Model chains

How combining specialised models in sequence can outperform a single large model, and when to use this pattern

While reaching for the largest model and asking it to do everything is the obvious move, it's often the wrong one. A better approach is to instead chain specialised models, where each model does the part of the task it's best at.

Why chains outperform single models

Large general-purpose models are trained to handle a wide range of tasks, but that breadth comes with trade-offs. A model optimised for general reasoning isn't necessarily the best at retrieving relevant context, reranking documents, or generating syntactically correct code. It's a generalist.

Specialised models, by contrast, are trained or fine-tuned for a narrow task. An embedding model like Multilingual E5 is specifically designed to produce vector representations that capture semantic similarity. A reranker like BGE Reranker is trained to score document relevance. A code model like DeepCoder is fine-tuned on programming tasks. Each of these models is better at its specific job than a general-purpose model of any size.

When you chain them, you get the best of each. The general-purpose model handles reasoning and generation. The embedding model handles retrieval. The reranker improves precision. The code model handles implementation.

Running a 70B parameter model for every step of a pipeline is expensive; using smaller specialised models where reasoning isn't needed cuts cost without giving up output quality.

Model chains in practice

Consider a pipeline that takes a user's question, retrieves relevant documentation, and generates a code answer. Here's what each stage does in a typical Berget AI model chain.

Understand the query (Mistral Small 3.2 24B)

The user's question arrives in natural language. Mistral Small reformulates it into a clean, unambiguous search query. This step is lightweight and doesn't require a large model; Mistral Small handles instruction following well at low cost.

Retrieve candidate documents (Multilingual E5)

The reformulated query is embedded using intfloat/multilingual-e5-large-instruct. The embedding is compared against a vector index of your documentation to retrieve the top candidates. E5 is trained specifically for this task and handles multilingual queries well, which matters if your users write in Swedish or other Nordic languages.

Rerank for precision (BGE Reranker)

The candidate documents from the previous step are scored by BAAI/bge-reranker-v2-m3. Reranking is a separate task from retrieval: the reranker reads each document in the context of the query and assigns a relevance score. This step filters out documents that matched on surface similarity but aren't actually useful, improving the quality of context passed to the next step.

Generate the answer (DeepCoder 14B)

The top-ranked documents and the original question are passed to agentica-org/DeepCoder-14B-Preview. DeepCoder is fine-tuned for code generation, which makes it more reliable at producing syntactically correct code than a general-purpose model trained across broader tasks. The retrieved context grounds the answer in your actual documentation rather than the model's training data.

Validate and format (Mistral Small 3.2 24B)

The generated code is passed back to Mistral Small for a final check: does the answer address the question, are there obvious errors, and does the output match the required format? This step catches issues before the response reaches the user.

The full chain uses five models, but only one of them (DeepCoder) is doing the computationally expensive generation step. The others are fast and cheap. The design gives you specialised accuracy at each step without paying frontier-model prices throughout the pipeline.

When to use this pattern

Model chains are worth considering when your task has distinct phases that map to different capabilities. Retrieval-augmented generation is the most common case, but the pattern applies broadly: document processing pipelines, multi-step classification, translation followed by analysis, and any workflow where you can identify a clear handoff between retrieval, reasoning, and generation.

The pattern is less useful for simple, single-turn tasks. If a user asks a question and you need a direct answer, a single well-chosen model is the right approach. Chains add latency and complexity; they're worth it when the quality improvement is real and measurable.

Further reading

On this page