Blog / AI Limitations

Can You Trust Generative AI With Numbers?

How LLMs Really "Do" Math And What It Means for Finance, Audit, and Tax

Wed, Nov 19, 2025

Robot counting strawberries in Van Gogh style

If you’ve ever pasted a few numbers into ChatGPT and gotten a convincing answer back, it’s tempting to think:

"Nice, this can replace my calculator, my spreadsheet, and maybe my junior analyst."

My experiments below, together with the last few years of research on large language models (LLMs), say: not so fast.

In this explainer I’ll walk through:

1. How LLMs Really "Calculate" 2 + 2

Modern LLMs like GPT-4 are transformer models trained on huge corpora of text to do one thing: predict the next token (i.e., piece of text) in a sequence. There is no built-in calculator or explicit arithmetic module in the base model. When we ask:

"What is 2 + 2?"

The model is not running an internal version of Excel. Under the hood, a few things happen:

Tokenization

The text "2 + 2" is split into tokens like "2", "+", "2". Each token becomes a vector in a high-dimensional space.

Pattern matching over training data

During training, the base model has seen billions of snippets including "2 + 2 = 4", flashcard-style arithmetic, code examples, textbook problems, etc. The internal parameters have adjusted so that, when it sees the token sequence "2 + 2 =" (or similar), the token "4" has very high probability as the next output.

Soft algorithmic behavior (sometimes...)

For tiny problems like 17 + 58, the model can’t just rely on direct memorization of every possible pair. There are too many combinations. Instead, transformers can approximate algorithm-like behavior (such as carrying digits) by layering attention and non-linear transformations. But this behavior is approximate and often degrades sharply as numbers get longer (e.g., 6- or 7-digit arithmetic) or operations get more complex (multi-digit multiplication).

Chain-of-thought (CoT) as a "scratchpad"

When you prompt models to "show your working," they generate step-by-step reasoning before giving a final number. CoT improves performance on many math benchmarks but still doesn’t turn the LLM into a reliable symbolic math engine. It’s a structured way of doing the same next-token prediction. (see Lewkowycz et al. 2022)

Overall, LLMs are probabilistic pattern matchers (or "stochastic parrots", as some researchers have called them), not deterministic, rule-based theorem provers. They emulate arithmetic by learning statistical regularities in text, not by implementing the exact algorithms you’d write in code.

That distinction is the foundation for understanding where they shine. At the same time, this is exactly where they fail in dangerous ways.

2. What My Experiments Show: From Perfect Sums to Broken Multiplication

I ran two sets of experiments using OpenAI’s API, focusing on a compact model (GPT-4o-mini) with temperature set to 0 (deterministic outputs).

Experiment 1: Small, Fixed Dataset (Everything Looks Perfect!)

First, I tested four very simple categories:

Each category had 4 fixed problems.

I ran 50 iterations, with all 16 questions asked each time (800 API calls total).

Result: 100% accuracy in every category, every run!

Chart showing 100% accuracy on easy math tasks

Experiment 1: Perfect accuracy on simple arithmetic

If you only ever tested a model this way, you’d come away believing:

"This LLM is rock solid at basic arithmetic and simple word problems. Let’s use it for everything."

That’s exactly the kind of overconfidence many teams fall into.

Experiment 2: Larger, Randomized Dataset (The Cracks Appear)

Next, I scaled up and randomized:

For each category, I generated 40 random problems, then ran 10 iterations per category with re-randomization. That’s enough to average out quirks and get a sense of variability.

The results (from these ~2,400 API runs):

Chart showing LLM accuracy dropping sharply as arithmetic complexity increases

Accuracy across different arithmetic task categories

Same model.

Visually, the chart these evaluations generated tells the story:

You might wonder, "Doesn’t ChatGPT solve this using Python now?" Yes, consumer versions like ChatGPT Plus or Copilot can detect math problems and write hidden Python code to solve them perfectly. However, my experiments test the raw LLM API, which is how most enterprise financial tools and automated audit pipelines are currently built (e.g., categorizing thousands of expenses). If you integrate a model into your compliance workflow without explicitly configuring a code-execution sandbox, you are relying on the probabilistic "mental math" shown above, and exposing your firm to these exact error rates.

Error gallery: what LLM math mistakes actually look like

The pattern here is worrying for regulated work: the model’s mistakes are rarely obvious nonsense. They are plausible numbers that are off by a few tens, hundreds, or an extra digit. In financial reporting, tax, or capital calculations, that’s exactly the kind of quiet error that slips into a spreadsheet and becomes a disclosure.

Category Model answer Correct Error type Why this matters
6-digit addition
658,251 + 858,878
1,513,129 1,517,129 Missed carry in thousands place Off by 4,000 but still looks perfectly reasonable—exactly the kind of quiet understatement that could slip into a tax filing.
2-digit multiplication
31 × 87
2,707 2,697 Small arithmetic slip Only 10 off. A quick human skim might never notice, especially buried inside a longer GenAI-generated explanation.
4-digit multiplication
5,408 × 3,388
29,130,224 18,323,304 Digit-level error, same order of magnitude Wrong by 8,080 but still "feels" plausible. If this feeds into a ratio or risk metric, the downstream impact is invisible.
4-digit multiplication
8,884 × 9,921
881,196,964 88,138,164 Order-of-magnitude error, extra digit The model adds an extra digit and mangles the partial products. Any context that treats this as a "calculator" is completely exposed.
Multi-step word problem
9 boxes × 11 pencils − 86
3 13 Wrong subtraction after correct multiplication Structure is right (multiply then subtract), but final subtraction is off. Looks like valid reasoning with a quiet numeric mistake.
Multi-step word problem
Sam walks 27km, 31km, 28km
159 129 Sum-of-terms error All numbers are sensible, but the total is off by 30 km. In a tax or capital calculation, that’s a non-obvious reconciliation difference.

These patterns line up strikingly well with what we see in current research:

So the experiments above are not a statistical fluke. They’re a clean, operational demonstration of a general phenomenon:

LLMs can be extremely reliable for some numeric tasks and disastrously unreliable for others that look superficially similar.

3. Why This Happens: Pattern Matching vs. Rule-Following

To see why 4-digit multiplication fails where 2-digit multiplication mostly works, it helps to distinguish two paradigms:

3.1 A Calculator (Rule-Based System)

A calculator or a Python script:

3.2 A Large Language Model (Statistical Sequence Model)

An LLM:

Research on arithmetic in LLMs shows that:

In other words:

The experiments above capture that transition almost perfectly.

4. Why This Matters for Finance, Audit, and Tax

If you work in finance, audit, or tax, you live in a world where numerical errors are not just embarrassing, because they can become regulatory events.

At the same time, regulators and industry bodies are actively exploring how AI and GenAI fit into financial services and compliance:

The very practical message the above experiments show: Treat LLMs as extremely capable junior analysts who are surprisingly bad at some very specific kinds of arithmetic.

4.1 Concrete Risks for Compliance Teams

Here are the kinds of failures the chart implies:

5. A Practical "Math Safety" Framework for Using GenAI in Finance, Audit, and Tax

Rather than banning GenAI outright, it’s more useful to draw a clean line between:

5.1 Green Zone: Recommended Uses

These are use cases where GenAI’s strengths align with our risk appetite:

5.2 Yellow Zone: Allowed with Strong Controls

These uses can be valuable but must be paired with deterministic tools:

Control principle: LLMs can propose or critique the math, but the final numbers must come from a system that’s designed to do arithmetic.

5.3 Red Zone: Avoid or Explicitly Prohibit

These are use cases your AI policy should flag as not allowed (or only allowed with a formal model-risk approval):

My earlier results (given the ~2% accuracy on 4-digit multiplication?) are a powerful reminder that: if the model is wrong 98% of the time on a basic numeric operation, we simply cannot treat it as a calculator for high-stakes workflows.

6. The Bottom Line

For non-technical colleagues, we can condense all this into three simple rules:

  1. Use GenAI to understand the math, not to be the math.
  2. If the number goes to a regulator, a client, or the tax authority, it must come from a real calculator, spreadsheet, or code. It cannot come from a chat box.
  3. Treat every numeric answer from GenAI as a suggestion, not a fact, unless you’ve independently verified it. Click, read, and confirm.

Large language models are already transforming how finance, audit, and tax professionals read, write, and reason about complex rules. But when it comes to doing arithmetic, they’re powerful mimics, not reliable calculators.

Used wisely, LLMs can free skilled people from rote work, help document and test controls, and deepen understanding of complex standards. Used as a black-box calculator, they’re a quiet source of model risk.

Let LLMs explain the math. Let code and spreadsheets do the math. That single separation will save you from the worst kind of model risk, where numbers that look plausible and are simply wrong.

Back to blog