These days, large language models (LLMs) can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning. But when it comes to 4-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why? A new paper by Computer Science PhD student Xiaoyan Bai and Faculty Co-Director of the Novel Intelligence Research Initiative Chenhao Tan, along with collaborators from MIT, Harvard, University of Waterloo, and Google DeepMind, reverse-engineers failure and success to find answers.

As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally “holding on” to partial products so you can add them up to get your final product. Processes that require storing information for later use in this way are called “long-range dependencies.”

Standard models work by learning to recognize patterns in the data they’re trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process?

Why Standard Training Fails

Models are often taught new tasks via standard fine-tuning (SFT), which relies on scaling up the training data, or adding more steps or “layers.” But even when the research team tested models with 2 layers all the way up to 12 layers, they all achieved less than 1% accuracy on 4×4 digit multiplication. Why were the standard approaches failing here?

The researchers found that under the SFT approach with gradient descent (an iterative optimization algorithm), models converge to a “local optimum:” the best solution in a given dataset. However, this approach doesn’t account for those long-range dependencies.

The problem isn’t lack of training, the team found. Rather, the model is trapped: without an architecture that lets it store and retrieve intermediate information, it can’t step beyond that local optimum, no matter how long it trains, or how large it scales.

What ICoT Does Differently

Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT). Where SFT achieved less than 1% accuracy, the ICoTmodel was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights:

The ICoT model learns to remember what matters. Unlike the SFT model, the ICoT model learned to track those “long-range dependencies.” The team verified this by testing whether they could decode intermediate values (like running sums) from the models’ internal states. In the ICoT model, they could; in the standard model, they couldn’t. The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.
The ICoT model organizes its attention into branches across time. Think of it like a well-organized filing system: in early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the “cached” products needed for each output digit. This creates an efficient directed graph for implementing the multiplication algorithm, a structure the standard model never develops.
Mathematics rendered in geometric form: Perhaps most remarkably, the model represents digits and their operations using elegant mathematical structures. Digits are encoded using wave-like patterns (Fourier bases) that form a pentagonal prism shape in the model’s internal representation space. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum, which notably wasn’t programmed by the researchers, but rather emerged naturally during training in the ICoT model. It’s as if the successful model derived its own efficient mathematical language for arithmetic.

A Simple Fix

If the SFT models’ struggle with long-range dependencies was about missing inductive biases, then providing the right training signal should fix it. To validate their understanding, the team introduced a simple solution: they added an “auxiliary loss” that trains lightweight linear probes to predict running sums at each step, capturing and carrying intermediate values and partial products.

It turned out that making this one addition to the 2-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision.

Inspecting this model’s attention patterns revealed it learned similar mechanisms to ICoT—the sparse binary tree structure for caching and retrieving partial products. However it had developed additional strategies too: including an “attention head” allowing it to simultaneously track all necessary digit pairs.

Novel Intelligence and the Jagged Frontier

While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how transformers learn and “think.” The long-range dependency problem isn’t unique to arithmetic; it appears throughout language modeling and other sequential tasks, and demonstrates AI’s “jagged frontier,” that is, its capacity to excel at complex reasoning, yet stumble on seemingly simple tasks.

The team’s approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models’ performance.
“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking,” said Tan. “Our research is trying to chart that terrain.”

This paper’s key contribution: architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right inductive biases, not just more parameters or data, are key to pushing AI capabilities forward. While the auxiliary loss solution is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring long-range dependencies.

This article originated from the Data Science Institute.

Related News

More UChicago CS stories from this research area.
AI wedding photos
UChicago CS News

Mapping the New Rules of “AI Slop”: How Social Media Platforms are Managing AI-Generated Content

Mar 23, 2026
robot
UChicago CS News

How Chicago Robot Tutors Are Teaching SEL Effectively–Without Pretending to Be Human

Mar 19, 2026
screen grab
UChicago CS News

Could AI Help Us Be More Thoughtful Voters?

Mar 17, 2026
nano carbons
In the News

Nanodiamonds and Beyond: Designing Carbon Materials with Artificial Intelligence at Exascale

Mar 16, 2026
headshot
UChicago CS News

Michael Franklin Named Deputy Dean for Computational and Mathematical Sciences

Mar 16, 2026
UChicago CS News

AI Initiative Shares UChicago’s Vision for AI-Empowered Interdisciplinary Research

Mar 16, 2026
headshot
UChicago CS News

University of Chicago PhD Student Riki Otaki Receives MongoDB PhD Fellowship Award

Feb 26, 2026
Robert Grossman presenting
UChicago CS News

M3 Workshop Advances Federated AI for Biomedical Research

Feb 23, 2026
headshot
UChicago CS News

Aloni Cohen Named Sloan Research Fellow for Work Bridging Law and Computer Science

Feb 17, 2026
TEI conference announcement
UChicago CS News

This Spring at UChicago: TEI’26 Unites Technology, Art, and Design on Campus

Feb 03, 2026
neutron star
UChicago CS News

RADAR: A new era of collaborative cosmic exploration

Jan 28, 2026
privacy settings example
UChicago CS News

Designed to Deceive: Why Knowledge Isn’t Enough to Beat Dark Patterns

Jan 27, 2026
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube