Choose a language model

A decision process for picking the right Berget AI language model for your use case

Berget AI offers language models across a range of sizes and price points. This guide covers language models only; for embeddings, reranking, speech-to-text, and OCR, see the models overview. It walks you through a repeatable process for picking the right language model and knowing when to switch.

Before you start

Before choosing a model, have answers to these three questions:

Capability: What must the model reliably do? Complex reasoning, multilingual output, code generation, structured extraction?
Speed: How quickly must it respond? Real-time applications have different constraints than batch processing.
Cost: What's your budget per million tokens, and at what volume?

These answers determine where you start. You can always adjust after testing.

Step 1: Pick a starting model

Use this table to find the best fit for your use case.

Use case	Start with
Customer-facing chatbots, complex persona or policy instructions	GPT-OSS 120B or Llama 3.3 70B
Code generation, debugging, code review	GPT-OSS 120B, Llama 3.3 70B, or GLM 4.7
Analytical reasoning, structured extraction, classification	GPT-OSS 120B; GLM 4.7 with reasoning mode for lower cost
Content moderation, policy classification	Llama 3.3 70B or GPT-OSS 120B
Simple Q&A, FAQ systems, focused single-turn tasks	Mistral Small 3.2 24B
Multilingual applications	GLM 4.7 with an explicit language directive
Extremely narrow, single-turn tasks where cost is the primary constraint	Llama 3.1 8B

If your use case isn't listed or spans multiple categories, start with Llama 3.3 70B. It handles a wide range of tasks reliably and is a good baseline for benchmarking.

Step 2: Consider your starting strategy

There are two valid approaches depending on how well-defined your requirements are.

Option 1: Start cost-efficient, upgrade if needed

If you have a clear, focused task (a FAQ bot, a classifier, a summariser), start with Mistral Small 3.2 24B. Test it against your actual prompts. Upgrade to a frontier model only if you hit a capability ceiling.

Option 2: Start capable, optimise down

If your task is complex or the failure cost is high (customer-facing systems, multi-constraint instructions, anything involving nuanced reasoning), start with GPT-OSS 120B or Llama 3.3 70B. Once you understand the task's demands, consider whether a smaller model can meet the same bar at lower cost.

Step 3: Test with your actual prompts

Run your system prompt and representative inputs against the model you chose. Use real data, because testing on synthetic or simplified inputs won't tell you what you need to know.

Instruction fidelity: Does the model follow all constraints, including edge cases? Multi-constraint system prompts, where rules interact or conflict, are the most reliable stress test.
Consistency: Does output quality hold across multiple turns and varied inputs? A model that performs well on the first turn but drifts over a conversation isn't production-ready.
Format compliance: If you need structured output, does the model produce it reliably? Test with inputs that are ambiguous or malformed, as these are the cases most likely to break format adherence.

Step 4: Upgrade or downgrade based on results

If the model fails on instruction fidelity or consistency, move up a tier. If it passes comfortably on all criteria and cost is a concern, test the next tier down.

Signals that suggest upgrading

The model drops instructions in multi-turn conversations
Structured output format breaks on edge-case inputs
The model can't handle the full complexity of your system prompt

Signals that suggest downgrading

The model passes all your benchmarks with headroom to spare
Response latency is higher than your application requires
Cost is a constraint and you haven't yet tested a smaller model

Choose a language model

Before you start

Step 1: Pick a starting model

Step 2: Consider your starting strategy

Option 1: Start cost-efficient, upgrade if needed

Option 2: Start capable, optimise down

Step 3: Test with your actual prompts

Step 4: Upgrade or downgrade based on results

Signals that suggest upgrading

Signals that suggest downgrading

Next steps

Models overview

Model chains

API reference

On this page