Choose a language model
A decision process for picking the right Berget AI language model for your use case
Berget AI offers language models across a range of sizes and price points. This guide covers language models only; for embeddings, reranking, speech-to-text, and OCR, see the models overview. It walks you through a repeatable process for picking the right language model and knowing when to switch.
Before you start
Before choosing a model, have answers to these three questions:
- Capability: What must the model reliably do? Complex reasoning, multilingual output, code generation, structured extraction?
- Speed: How quickly must it respond? Real-time applications have different constraints than batch processing.
- Cost: What's your budget per million tokens, and at what volume?
These answers determine where you start. You can always adjust after testing.
Step 1: Pick a starting model
Use this table to find the best fit for your use case.
| Use case | Start with |
|---|---|
| Customer-facing chatbots, complex persona or policy instructions | GPT-OSS 120B or Llama 3.3 70B |
| Code generation, debugging, code review | GPT-OSS 120B, Llama 3.3 70B, or GLM 4.7 |
| Analytical reasoning, structured extraction, classification | GPT-OSS 120B; GLM 4.7 with reasoning mode for lower cost |
| Content moderation, policy classification | Llama 3.3 70B or GPT-OSS 120B |
| Simple Q&A, FAQ systems, focused single-turn tasks | Mistral Small 3.2 24B |
| Multilingual applications | GLM 4.7 with an explicit language directive |
| Extremely narrow, single-turn tasks where cost is the primary constraint | Llama 3.1 8B |
If your use case isn't listed or spans multiple categories, start with Llama 3.3 70B. It handles a wide range of tasks reliably and is a good baseline for benchmarking.
Step 2: Consider your starting strategy
There are two valid approaches depending on how well-defined your requirements are.
Option 1: Start cost-efficient, upgrade if needed
If you have a clear, focused task (a FAQ bot, a classifier, a summariser), start with Mistral Small 3.2 24B. Test it against your actual prompts. Upgrade to a frontier model only if you hit a capability ceiling.
Option 2: Start capable, optimise down
If your task is complex or the failure cost is high (customer-facing systems, multi-constraint instructions, anything involving nuanced reasoning), start with GPT-OSS 120B or Llama 3.3 70B. Once you understand the task's demands, consider whether a smaller model can meet the same bar at lower cost.
Step 3: Test with your actual prompts
Run your system prompt and representative inputs against the model you chose. Use real data, because testing on synthetic or simplified inputs won't tell you what you need to know.
- Instruction fidelity: Does the model follow all constraints, including edge cases? Multi-constraint system prompts, where rules interact or conflict, are the most reliable stress test.
- Consistency: Does output quality hold across multiple turns and varied inputs? A model that performs well on the first turn but drifts over a conversation isn't production-ready.
- Format compliance: If you need structured output, does the model produce it reliably? Test with inputs that are ambiguous or malformed, as these are the cases most likely to break format adherence.
Step 4: Upgrade or downgrade based on results
If the model fails on instruction fidelity or consistency, move up a tier. If it passes comfortably on all criteria and cost is a concern, test the next tier down.
Signals that suggest upgrading
- The model drops instructions in multi-turn conversations
- Structured output format breaks on edge-case inputs
- The model can't handle the full complexity of your system prompt
Signals that suggest downgrading
- The model passes all your benchmarks with headroom to spare
- Response latency is higher than your application requires
- Cost is a constraint and you haven't yet tested a smaller model