Honda: 2 years of ml vs 1 month of prompting - heres what we learned

Traditional ML vs LLM Approaches

  • The original system used TF‑IDF (1‑gram) plus XGBoost and reportedly beat multiple vectorization/embedding approaches on heavily imbalanced data.
  • Several are surprised the team didn’t try a BERT‑style encoder classifier, noting these were state‑of‑the‑art for text classification and multilingual by 2023.
  • Others point out encoder models (BERT/CLIP) can work very well but are underused because they require more ML expertise and GPU capacity.
  • A related thread references modern retrieval stacks (BM25/TF‑IDF + embeddings + reranking + augmentation) as powerful but complex, “taped‑together” systems.

LLMs’ Strengths, Limits, and Process

  • LLMs are praised for making strong ML available to non‑experts: a small team can get good classification by prompt engineering instead of full pipelines.
  • Commenters stress this case is text classification on existing unstructured input, with minimal direct risk to customers—exactly where LLMs do well.
  • A key nuance: the “1 month of prompting” was enabled by years of prior work creating labeled data and evaluation frameworks.
  • Several warn against misreading this as endorsement of “zero‑shot, prompt and pray”; you still need labeled data and rigorous evals to know performance is acceptable.
  • Some suggest hybrid designs: LLM outputs and/or embeddings as features into XGBoost, likely improving results further.

Data, Labeling, and Model Performance

  • Multiple practitioners say the main bottleneck in ML projects is not models but collecting, annotating, and validating high‑quality data (especially negative examples and handling class imbalance).
  • There’s discussion on how bias in datasets and poor negative sampling can permanently cap classifier quality, regardless of algorithm.

Cost, Infrastructure, and Practicality

  • Old models could run on CPU; LLMs often need GPUs or paid APIs.
  • For warranty claims, people argue even relatively expensive per‑request LLM calls are cheap compared with technician labor and claim costs.
  • Some lament being “forced” into overpowered LLM APIs rather than lean encoder models because execs want fast, impressive demos.

Domain‑Specific and Linguistic Aspects

  • Warranty data is seen as inherently noisy (technician behavior, multiple parts replaced, messy text) but critical due to safety and regulatory requirements.
  • LLMs are viewed as well‑suited to triage and classification here, but critics worry that automation could hide safety signals and weaken human oversight.
  • The reported improvement from translating French/Spanish claims into German fascinates people; there’s speculation that some languages align better with certain technical domains, but the mechanism remains unclear.

Writing Style and Meta‑Discussion

  • Several readers think parts of the blog post sound LLM‑generated or “LinkedIn‑style,” spurring a side debate over AI‑authored prose, formulaic corporate writing, and methods to remove “slop” from model outputs.