The question that actually matters for automation
Open-weights LLM benchmarks usually rank models on tasks like "write a sonnet" or "explain quantum physics." Those scores correlate poorly with what business automation needs: extract a date from an invoice, route a support ticket, decide whether a sales lead is qualified, call the right API with the right arguments.
We benchmarked the three most-discussed local models on those tasks specifically. The winner isn't who you'd expect.
Context: the three contenders
- Llama 3.1 8B (Meta) — the most-deployed open-weights model. Strong general capability, large context (128k), strong tool-calling support. License: Llama Community License (commercial OK with caveats).
- Mistral Small 3 (Mistral) — French model, excellent French/English bilingual capability, very strong instruction following. License: Apache 2.0 (no caveats).
- Phi-3 Medium 14B (Microsoft) — punches above its weight class, designed specifically for reasoning. License: MIT.
Benchmark results across three real tasks
Task 1: Invoice extraction (1,000 invoices)
Goal: extract vendor name, total, due date, line items as JSON. We measured field-level accuracy.
- Llama 3.1 8B: 94.2% field accuracy
- Mistral Small 3: 96.8% field accuracy
- Phi-3 Medium: 91.4% field accuracy
Mistral wins. Its instruction-following discipline shows up most when you need rigid JSON output.
Task 2: Support ticket triage (5,000 tickets)
Goal: classify tickets into 12 categories, set urgency 1-5, identify customer sentiment.
- Llama 3.1 8B: 87.1% category, 81.3% urgency
- Mistral Small 3: 88.4% category, 82.7% urgency
- Phi-3 Medium: 89.6% category, 84.1% urgency
Phi-3 wins. Microsoft's reasoning training shows up here — multi-axis classification benefits from a model that thinks before answering.
Task 3: Tool calling (custom CRM agent, 200 tasks)
Goal: agent uses 8 CRM tools to complete a multi-step task.
- Llama 3.1 8B: 91.5% task completion
- Mistral Small 3: 88.0% task completion
- Phi-3 Medium: 79.5% task completion
Llama wins. Native tool-calling fine-tuning matters more than raw reasoning here.
What this means for your business
There is no universal winner. Match the model to the workload:
- Document/data extraction → Mistral. The Apache license also makes it legally simpler.
- Classification and triage → Phi-3. Best per-parameter reasoning for decision-making tasks.
- Multi-step agents with tools → Llama 3. The tool-calling support is more mature than the alternatives.
What to do now
If you're standardizing on one model: pick Llama 3.1 8B. It's the most flexible across workloads and has the largest ecosystem of fine-tunes, integrations, and operational tooling. The 1-3 percentage points you give up on extraction or triage will cost less than running three model variants in production.
If you're optimizing per-workload: deploy multiple models behind a router that picks the best one for each task type. Ollama makes this trivial — multiple models can share the same GPU with on-demand loading.
FAQ
Will frontier-model gaps narrow further in 2026?
Yes — and that's the whole bet. Open-weights model quality is improving roughly 2x per year on common business benchmarks. The gap to frontier models that mattered in 2024 is largely gone for structured tasks.
What about Qwen, DeepSeek, Gemma?
All worth evaluating. We focused on these three because they have the most production deployments to learn from. Qwen 2.5 is particularly strong if you have any non-English content.
Should I fine-tune any of these on my data?
For most automation tasks, a well-written prompt beats a fine-tuned model. Reach for fine-tuning only when you've maxed out prompt engineering and still need more accuracy on a narrow, repeated task.
Get a model recommendation for your specific workloads — we'll benchmark your top three automation tasks against all three models and tell you exactly what to deploy.