Lessons learned from fine-tuning 20+ models across industries — what works, what doesn't, and how to avoid the most common pitfalls.
Why Fine-Tune?
Off-the-shelf LLMs are impressive, but they often fall short for enterprise use cases. They hallucinate domain-specific facts, miss industry jargon, and can't follow your company's specific output formats. Fine-tuning bridges that gap.
Choosing Your Approach
After fine-tuning 20+ models across healthcare, finance, and logistics, we've found that the right approach depends on three factors:
**Data volume**: Under 1,000 examples? Start with few-shot prompting and RAG. Between 1K-50K? LoRA fine-tuning. Over 50K? Full fine-tuning may be justified.**Task specificity**: Classification and extraction tasks benefit enormously from fine-tuning. Open-ended generation tasks often do better with RAG + prompting.**Latency requirements**: Smaller fine-tuned models (7B-13B) can outperform prompted 70B+ models on specific tasks while being 10x cheaper to serve.Data Quality Is Everything
The single biggest predictor of fine-tuning success is training data quality. We spend 60-70% of every engagement on data curation:
Remove duplicates and near-duplicatesValidate outputs with domain expertsBalance the dataset across edge casesCreate adversarial examples for robustnessEvaluation Framework
Don't rely on loss curves alone. We build custom evaluation suites for every engagement:
Automated metrics: Task-specific metrics (F1, BLEU, ROUGE)LLM-as-judge: GPT-4 evaluation with structured rubricsHuman evaluation: Domain expert blind reviews on a 100-sample subsetRegression testing: Ensure the model didn't lose capabilitiesKey Lessons
Start with the smallest model that could workInvest in data quality over quantityLoRA is almost always the right starting pointBuild evaluation before trainingPlan for continuous retraining from day one