START: how to defeat hallucinations and teach LLMs accurate calculations / Habr

START is an open-source LLM designed for precise calculations and code verification. It addresses two major issues that most standard models face: hallucinations and errors in multi-step calculations. This article explains why these problems arise and how START solves them.

Why START is needed

Modern reasoning models can solve very complex tasks impressively well. However, they struggle with two key problems: hallucinations and inability to perform precise calculations.

Proofs:

Anthropic recently published a well-known study showing that LLMs can "cheat" by simulating reasoning but arriving at incorrect conclusions.
If asked to solve a complex math problem, an LLM may output a logically plausible but wrong answer simply because it cannot verify its steps like a human with a calculator, code interpreter, or specialized software.

The hardest challenges for models include:

Multi-step computations: integrals, combinatorics, optimization, and so on.
Code generation and debugging: without real execution, the model cannot detect syntax errors or logical bugs.
Data analysis: for example, testing statistical hypotheses requires exact calculations, not guesses.

The idea behind START comes from two directions:

Long Chains of Thought (Long CoT): reasoning where the model decomposes a task and tries to spot errors in the solution, mimicking human cognitive strategies.
Tool-Augmented Reasoning (TIR): an approach where the LLM decides when to call external tools, e.g., running Python code to perform calculations.

How START works

START is an LLM capable of delegating tasks to external tools when needed. Its breakthrough is in auto-generating a training dataset by calling tools without ready-made examples, using only prompts during the reasoning process of the pretrained QwQ model.

These prompts are inserted after words like Alternatively or Wait, where the model typically pauses to think. The generated Python code is executed, and the results are integrated back into the reasoning.

For example, for the task "Find the sum of the digits of 29!", the model first calculates the factorial by code, then analyzes the answer.

Collected results where TIR gave improved outcomes are used to train the START-0 model. This is an intermediate model used for the final training stage.

Though START-0 learned to call tools to solve problems, it still might do so suboptimally. Rejection Sampling Fine-Tuning (RFT) generates multiple reasoning paths, selects the best, cleans duplicates, and trains START on those.

How START differs from GPT-4 and Gemini

Narrow specialization: START is focused on tasks requiring calculations or code verification, not general dialogue.
Open-source: based on the QwQ open model, unlike GPT-4 or Gemini.

Available tools

Currently only Python, but the architecture supports adding more: SQL for analytics, WolframAlpha for symbolic math, or database APIs.

For example, to detect anomalies in sales, START could:

Generate an SQL query.
Fetch data.
Analyze it with Pandas.

Teaching START to think using tools

Training START resembles coaching an intern: providing the right tools, reinforcing best results, then asking to replicate on new tasks.

Stages:

Data collection: 50,000 tasks (math, code, science), including AIME olympiad problems and hard GPQA questions.
Hint-infer: The base QwQ-32B model generated solutions, with researchers inserting hints at key points and saving successful examples where code solved the task.
Hint-RFT: Multiple reasoning trajectories generated; best trajectories selected, duplicates removed, model fine-tuned on these.

Results

In math (AMC23), START reached 95% accuracy vs. 80% for base QwQ.
In science questions (GPQA), 63.6%, comparable to top closed models.
Code generation improved (+5.9%), better bug detection through code execution.

Where START excels

START is most useful for data analysis, automatic hypothesis testing via SQL and Python, and generating working code blocks.

Real-life example

An analyst checking why sales dropped in November could ask START, which would:

Generate SQL to extract data.
Build plots using matplotlib.
Detect anomalies via statistics.
Create a full report from the data without external help.

Speed Matters

Although code execution adds latency, START reduces the number of iterations needed to solve a task. For example, on a math problem, START can get the answer in one pass, whereas a usual LLM might repeatedly make arithmetic mistakes.

Postscript: if you want, you can have this commercially

OpenAI’s relatively new o3 model is basically START but commercial.

Main feature: new models are trained to use tools during reasoning. They can not only search the web but also run code and access other tools. Multimodal and able to apply this capability in thought.

On benchmarks, it outperforms even recently released Gemini 2.5 Pro Experimental — no surprise, given o3’s training consumed 10x more compute than o1.

START: how to defeat hallucinations and teach LLMs accurate calculations