Blog | Kelvin Law

Pigeon in a corridor lined with crossed-out creature signs

AI Governance

Tue, May 5, 2026

When AI guardrails backfire

The things we add to AI systems to make them safer can make them worse. OpenAI’s goblin problem and Anthropic’s Claude Code postmortem show why hidden guardrails need versioning, regression testing, and monitoring.

Read more →

Robot staring into a fish tank marked 80

AI Limitations

Fri, Mar 20, 2026

Your AI can’t count

I asked eight AI models to count subsidiaries in GE Aerospace’s SEC filing. Answers ranged from 14 to 228. The correct answer is 80. The failures reveal how retrieval, chunking, and attention windows silently corrupt financial data extraction.

Read more →

AI Governance

Wed, Dec 10, 2025

Don’t tell AI what time it is

Adding timestamps for audit trails drops accuracy by 10%. The compliance mechanism undermines the output being audited. This counterintuitive finding reveals how irrelevant context—even seemingly harmless metadata—can degrade model performance. For regulated industries requiring audit logs, this creates a fundamental tension.

Read more →

AI Evaluation

Wed, Nov 26, 2025

Is my LLM getting dumber, or is it just me?

Users complain GPT models get worse over time. OpenAI denies it. I designed 43 tests and ran them repeatedly. The results show a clear U-shaped pattern: GPT-4 started at 86% accuracy, dropped to 46.5% with GPT-4-turbo, then recovered. Teams need continuous evaluation, not blind trust in a model label.

Read more →

$Robot counting strawberries$

AI Limitations

Wed, Nov 19, 2025

Can you trust GenAI with numbers?

LLMs ace easy sums but fail at 4-digit multiplication. This isn’t a bug—it’s how tokenization works. For finance, audit, and tax teams, understanding these limitations is critical. The model that writes your report may silently miscalculate totals. I tested GPT-4 on arithmetic and the results reveal systematic patterns of failure.

Read more →

AI Methods

Sun, Nov 9, 2025

What is Chain-of-Thought Prompting?

Chain-of-Thought prompting encourages LLMs to generate intermediate reasoning steps before producing a final answer. This technique, pioneered by Wei et al. (2022), allows large models to solve complex problems more accurately by breaking them into manageable steps. For finance and accounting researchers, CoT is particularly valuable for tasks requiring multi-step analysis.

Read more →