The Progress of Large Language Models

Data is Unreasonably Effective

You've probably heard about how expensive it is for the frontier labs to train Large Language Models, and you might have been impressed, at some point of time in the last 3 years, by a Large Language Model (or an app that uses/wraps/harnesses one like Antigravity, Claude Code, Manus AI) doing something you thought is pretty challenging, or if we were to use a more daring word - intelligent (or maybe just drawing a unicorn a day).

Well, what do you know, the intelligence of the model and the cost of it (because training on a ton of GPU compute is an expensive activity in terms of electricity, cooling and Nvidia's tax) is proportional. As one of the (well-advised) men at the forefront said:

Feb 10, 2025

Sam Altman

The intelligence of an AI model roughly equals the log of the resources used to train and run it. These resources are chiefly training compute, data, and inference compute. It appears that you can spend arbitrary amounts of money and get continuous and predictable gains; the scaling laws that predict this are accurate over many orders of magnitude.

If you were interested in a deep and thoughtful take on scaling laws, I recommend reading the brilliant Gwern's take on the "scaling hypothesis".

What I have done here, is I have plotted the accuracy of the GPQA Diamond benchmark (y-axis), which is the only benchmark that every model compares on, against the compute (x-axis) it took to train the model (some models do not declare this figure and hence aren't plotted). Note that the x-axis is a log scale.

GPQA diamond is a Google-proof graduate to PhD-level MCQ benchmark (multiple choice questions, 4 choices). If you squint abit at the average line I've drawn on the graph, you'll see that empirical model performance shows a roughly log-linear relationship between training compute and accuracy. Accuracy has climbed from below a random 25% baseline (because its 4-choice multiple choice) to over 90% as training compute has increased.

Organization:

Loading GPQA scatter plot…

General trend - more training compute, better model performance.Source:Epoch AI

Now if we zoom into 3 of Meta's Llama 2 and 3 models in the graph below, all other things constant, we'll see that huge jump in performance on GPQA diamond from Llama-2-70b to Llama-3.1-70b. The difference, 7.5 times the amount of training data (~2T tokens VS 15T tokens). More data (better cleaned and filtered data but still internet data largely), unsupervised training, better accuracy - data is unreasonably effective.

Loading GPQA scatter plot…

Training on more data proves to be highly effective (till we exhaust good data).Source:Epoch AI

Now if we look at the same graph, training Llama-3.1-405b on that same 15T tokens but increasing the number of parameters to 405B (~5.7x larger), also results in better accuracy on the benchmark, but to a significantly lesser extent. Serving that model in production is also prohibitively expensive, hence, you probably won't find many providers hosting the 405B nowadays. It was used in a "distillation" for Llama-3.3-70b though.

If you look above the line, there are a bunch of models which look real efficient on training compute while achieving great scores on GPQA diamond. This is in some sense due to "algorithmic advances" (such as successfully pre-training and post-training of sparse mixture-of-experts models like DeepSeek-V3, Qwen3-235B-A22B or Llama4) that improved training efficiency. Mistral had already demonstrated the power of such models with their open weights Mixtral model way back in Dec 2023 but DeepSeek's release of R1 in Jan 2025 (shortly after V3) was the watershed moment when combined with test-time compute / reasoning advances - giving it frontier-level performance at a fraction of the cost.

Loading GPQA scatter plot…

Sparse Mixture-of-Expert Models are an algorithmic advancement in training efficiency.Source:Epoch AI

Why Does More Data Work So Well?

Researchers from Meta have shown through scaling / empirical experiments that when training on natural text data (i.e. cleaned internet data), models prioritize memorization until capacity is saturated, then a double-descent phenomenon begins, and generalization emerges, where the model is forced to learn general, reusable patterns instead of sample-level specifics. This is consistent in what we're seeing above from Llama2-70b to Llama3-70b models with 7.5x the data, that LLMs trained with extremely high data-to-parameter ratios still generalize.

Interestingly, when looking at the specific dense GPT-style models (like Llama2 and Llama3), they consistently memorize approximately 3.6 bits per parameter when trained in bf16 precision (the norm for dense models on A100s and H100s as fp8 can be less stable). That's kind of the limit before models start to generalize.

The Rapid Pace of Model Advancement

When looking at the same GPQA diamond benchmark, but replacing the x-axis with time instead (and we can therefore place more models which didn't declare their training compute) we've gone from below the random guessing baseline to effectively solving the benchmark (>90%) in just two years. Interestingly, if you take that as a imperfect proxy for the frontier of intelligence, open-weight models have rapidly closed the gap in performance and now lag the frontier by only about six months.

Loading model progress...

Open weights models roughly lag the frontier of closed weights models by six months.Source:Epoch AI

The Rise of Reasoning Models

As Andrej Karpathy said back in May 2023, LLMs can think better and solve harder problems with more tokens.

May 25, 2023

Andrej Karpathy

You have to really spread out the reasoning across more and more tokens. For example, you can't give a transformer a very complicated question and expect it to get the answer in a single token. There's just not enough time for it. These transformers need tokens to think.

From Jason Wei's seminal paper back when he was at Google about Chain-of-Thought Prompting in Jan 2022, to the release of OpenAI's o1 model, to the release of DeepSeek's R1 which put out-in-the-open the techniques to achieve "o1-level" reasoning that didn't require "o1-level" resources or secrecy. Each were landmarks that shaped the shift from language models as fast, intuitive pattern matchers ("System 1") to deliberate reasoners that utilize test-time compute, spending more time and computation during inference to "think" before they speak ("System 2"). (Special mention to Tatsu's groups fast follow on o1, s1, that introduced us to "budget forcing" and ultra budget-friendly reasoning fine-tuning).

2025 was the year that reasoning models really went big in traffic, usage and production settings, just looking at OpenRouter's aggregate traffic (which processes trillions of tokens each day) shows the shift toward reasoning models. The chart below stacks total tokens served each month on their top 20 models, split between reasoning and non-reasoning traffic. We see that sharp climb in use of reasoning models in the second-half of 2025. We also see an inflection point around August where reasoning traffic starts to overtake non-reasoning traffic. OpenRouter's own state of AI 100 trillion token study is an interesting read (and has more accurate figures than mine which has gaps as wayback machine didn't archive any in Nov).

Loading OpenRouter history…

These numbers also align with OpenAI's State of Enterprise AI report (Dec 2025) which indicates that reasoning token consumption per organization has increased by approximately 320x in 2025.

Reasoning Models Can Be Cost Efficient

If we use real cost per token data referenced from OpenRouter (log-scale and high to low) for each of the models (x-axis) and combine that with their performance on GPQA diamond (y-axis), we can see that reasoning models truly dominate on benchmarks like GPQA diamond (PhD and graduate-level questions) and that per token cost-efficiency is possible with reasoning models (trading off a little performance). The models towards the upper right quadrant are the most cost-efficient. The performance king at the moment is gemini-3-pro which tops the charts and surprisingly sits somewhere in the middle for cost (relatively cost-efficient when compared to GPT-5.2 on xhigh reasoning effort). At a similar cost-efficiency tier are models like GPT-5.2 medium reasoning effort or Claude's Sonnet-4.5, which report lower performance numbers than their flagship models.

Open weight reasoning models like kimi-k2-thinking-turbo and Qwen3-235B-A22B-Thinking offer great price-to-performance ratios and can be self-hosted of course. I am calculating this blended cost per million tokens using the same methodology as Artificial Analysis where a 3:1 ratio of input to output tokens is measured in the blend.

Reasoning modelsNon-reasoning models

Loading cost vs GPQA scatter plot…

Several reasoning-first models sit on the efficient frontier for price and GPQA performance.Source:OpenRouter

What's Next?

Where does this progress in LLMs lead us? Certainly general chatbots and copilot assistants have very much become part and parcel of our workflows (and techniques like Retrieval Augmented Generation) and are here to stay (they have become the learning/search tools of this generation). However, what's next also includes using these LLMs as the brains for agentic systems which glue together the wide capabilities of these models with deterministic software workflows, tools and systems in agentic loops and workflows.

The Progress of Large Language Models

Data is Unreasonably Effective

Why Does More Data Work So Well?

The Rapid Pace of Model Advancement

The Rise of Reasoning Models

Reasoning Models Can Be Cost Efficient

What's Next?

On this page