2025-11-10

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

Traditional ML vs LLM Approaches

The original system used TF‑IDF (1‑gram) plus XGBoost and reportedly beat multiple vectorization/embedding approaches on heavily imbalanced data.
Several are surprised the team didn’t try a BERT‑style encoder classifier, noting these were state‑of‑the‑art for text classification and multilingual by 2023.
Others point out encoder models (BERT/CLIP) can work very well but are underused because they require more ML expertise and GPU capacity.
A related thread references modern retrieval stacks (BM25/TF‑IDF + embeddings + reranking + augmentation) as powerful but complex, “taped‑together” systems.

LLMs’ Strengths, Limits, and Process

LLMs are praised for making strong ML available to non‑experts: a small team can get good classification by prompt engineering instead of full pipelines.
Commenters stress this case is text classification on existing unstructured input, with minimal direct risk to customers—exactly where LLMs do well.
A key nuance: the “1 month of prompting” was enabled by years of prior work creating labeled data and evaluation frameworks.
Several warn against misreading this as endorsement of “zero‑shot, prompt and pray”; you still need labeled data and rigorous evals to know performance is acceptable.
Some suggest hybrid designs: LLM outputs and/or embeddings as features into XGBoost, likely improving results further.

Data, Labeling, and Model Performance

Multiple practitioners say the main bottleneck in ML projects is not models but collecting, annotating, and validating high‑quality data (especially negative examples and handling class imbalance).
There’s discussion on how bias in datasets and poor negative sampling can permanently cap classifier quality, regardless of algorithm.

Cost, Infrastructure, and Practicality

Old models could run on CPU; LLMs often need GPUs or paid APIs.
For warranty claims, people argue even relatively expensive per‑request LLM calls are cheap compared with technician labor and claim costs.
Some lament being “forced” into overpowered LLM APIs rather than lean encoder models because execs want fast, impressive demos.

Domain‑Specific and Linguistic Aspects

Warranty data is seen as inherently noisy (technician behavior, multiple parts replaced, messy text) but critical due to safety and regulatory requirements.
LLMs are viewed as well‑suited to triage and classification here, but critics worry that automation could hide safety signals and weaken human oversight.
The reported improvement from translating French/Spanish claims into German fascinates people; there’s speculation that some languages align better with certain technical domains, but the mechanism remains unclear.

Writing Style and Meta‑Discussion

Several readers think parts of the blog post sound LLM‑generated or “LinkedIn‑style,” spurring a side debate over AI‑authored prose, formulaic corporate writing, and methods to remove “slop” from model outputs.

Related topics