Honda: 2 years of ml vs 1 month of prompting - heres what we learned
Traditional ML vs LLM Approaches
- The original system used TF‑IDF (1‑gram) plus XGBoost and reportedly beat multiple vectorization/embedding approaches on heavily imbalanced data.
- Several are surprised the team didn’t try a BERT‑style encoder classifier, noting these were state‑of‑the‑art for text classification and multilingual by 2023.
- Others point out encoder models (BERT/CLIP) can work very well but are underused because they require more ML expertise and GPU capacity.
- A related thread references modern retrieval stacks (BM25/TF‑IDF + embeddings + reranking + augmentation) as powerful but complex, “taped‑together” systems.
LLMs’ Strengths, Limits, and Process
- LLMs are praised for making strong ML available to non‑experts: a small team can get good classification by prompt engineering instead of full pipelines.
- Commenters stress this case is text classification on existing unstructured input, with minimal direct risk to customers—exactly where LLMs do well.
- A key nuance: the “1 month of prompting” was enabled by years of prior work creating labeled data and evaluation frameworks.
- Several warn against misreading this as endorsement of “zero‑shot, prompt and pray”; you still need labeled data and rigorous evals to know performance is acceptable.
- Some suggest hybrid designs: LLM outputs and/or embeddings as features into XGBoost, likely improving results further.
Data, Labeling, and Model Performance
- Multiple practitioners say the main bottleneck in ML projects is not models but collecting, annotating, and validating high‑quality data (especially negative examples and handling class imbalance).
- There’s discussion on how bias in datasets and poor negative sampling can permanently cap classifier quality, regardless of algorithm.
Cost, Infrastructure, and Practicality
- Old models could run on CPU; LLMs often need GPUs or paid APIs.
- For warranty claims, people argue even relatively expensive per‑request LLM calls are cheap compared with technician labor and claim costs.
- Some lament being “forced” into overpowered LLM APIs rather than lean encoder models because execs want fast, impressive demos.
Domain‑Specific and Linguistic Aspects
- Warranty data is seen as inherently noisy (technician behavior, multiple parts replaced, messy text) but critical due to safety and regulatory requirements.
- LLMs are viewed as well‑suited to triage and classification here, but critics worry that automation could hide safety signals and weaken human oversight.
- The reported improvement from translating French/Spanish claims into German fascinates people; there’s speculation that some languages align better with certain technical domains, but the mechanism remains unclear.
Writing Style and Meta‑Discussion
- Several readers think parts of the blog post sound LLM‑generated or “LinkedIn‑style,” spurring a side debate over AI‑authored prose, formulaic corporate writing, and methods to remove “slop” from model outputs.