Blog
Beyond Next-Token Prediction: The Case for Immutable Verification
If the bottleneck between AI and mathematical reasoning is the output channel, the fix is a better channel. Formal verification — proof assistants like Lean — provides the lossless interface that transforms LLMs from pattern-matchers into trustworthy reasoning agents.
Here’s a thought experiment. You have an AI system that has internalized deep mathematical structure — residue classes, binary representations, algebraic relationships. You’ve seen the evidence that the knowledge is in there (if you haven’t, see my previous post). Now you need that system to produce a mathematical proof.
You have two options.
Option A: Ask it to write the proof in English. Review it yourself. Hope you catch any errors — knowing that, as Terence Tao has pointed out, if an LLM submits a flawed proof, it’s significantly harder to detect than a flawed human proof.
Option B: Ask it to write the proof in a formal language where every logical step is mechanically checked by a compiler that accepts no hand-waving, no shortcuts, and no “trust me.” Either the proof compiles, or it doesn’t. There is no gray area.
Option B exists. It’s called formal verification. And it doesn’t just make AI outputs more reliable — it fundamentally changes what AI systems can do.
What Formal Verification Actually Is
A proof assistant like Lean is, at its core, an incorruptible judge. You write mathematical statements and proofs in a precise formal language, and the system checks every step against the foundational rules of logic. There’s no ambiguity, no room for subtle errors, no possibility of a plausible-sounding argument that quietly skips a step.
The difference between informally “checking” a proof and formally verifying one is the same difference between testing code and proving code correct:
Testing checks that your code works on the inputs you tried. Verification proves it works on every possible input.
Formally verified code — and formally verified mathematics — is a source of truth. It’s an answer key. You can trust it with the highest stakes because the verification is mechanical, exhaustive, and deterministic. The computer doesn’t get tired, doesn’t get biased, doesn’t have a bad day.
This matters especially for AI-generated outputs. When a human submits a flawed proof, mathematicians can usually spot the gap — the reasoning style is familiar, the types of errors are predictable. But when an LLM submits a flawed proof, the failure modes are unfamiliar and harder to catch. The verification layer eliminates this problem entirely. It doesn’t matter who generated the proof — human or machine. Either it passes the formal check, or it doesn’t.
The Paradigm Shift
For most of the history of AI and mathematics, the workflow has looked like this:
An LLM generates text. A human eyeballs it for plausibility. If it looks right, it’s accepted. If something seems off, it’s rejected or revised. The quality of the output depends on the quality of the reviewer.
The new paradigm inverts the trust model. The LLM generates output in a formal language. A verification layer — the Lean compiler — mechanically checks every logical step. The output is either proven correct or rejected. There’s no “looks right.” There’s no middle ground.
This is what “correctness by construction” means. The AI is no longer rewarded for sounding right — only for being right. It’s a qualitative shift, not an incremental improvement over the old workflow.
AlphaGeometry: The Proof of Concept
If this sounds theoretical, it isn’t. DeepMind’s AlphaGeometry system demonstrated the paradigm working in practice — and the results were extraordinary.
AlphaGeometry encoded geometric objects — triangles, circles, lines, intersection points — into a symbolic language. It then combined two components:
- A neural network that contributed creative leaps: suggesting auxiliary constructions, proposing proof strategies, generating the kind of intuitive “what if we tried this?” moves that drive mathematical discovery.
- A symbolic deduction engine that ensured rigor: taking each suggested step and checking it against the formal rules of Euclidean geometry, rejecting anything that didn’t hold up.
The result: AlphaGeometry solved International Mathematical Olympiad geometry problems at a silver-medalist level. Not by memorizing solutions, but by reasoning — generating creative proof strategies and verifying each step against an uncompromising formal system.
The key insight isn’t just that it worked. It’s how it worked. The neural network was free to be creative, to explore unconventional strategies, to make the kind of speculative leaps that drive mathematical breakthroughs. The symbolic layer ensured that creativity was channeled into provably correct results. Neither component alone would have been sufficient. Together, they achieved something neither could do alone.
The Ecosystem: Mathlib, CSLib, and the Infrastructure Being Built Now
A formal verification layer is only as useful as the library of formal knowledge it can draw on. If an AI needs to invoke a theorem about prime numbers in the middle of a proof, that theorem needs to already exist in a formalized, machine-readable form.
This is where the ecosystem matters.
Mathlib is the mathematical library for Lean — a community-built collection of hundreds of thousands of formalized mathematical results spanning algebra, analysis, number theory, topology, and more. It’s the largest coherent body of formalized mathematics in existence, and it serves as both a foundation and a precedent: proof that a community can build critical mass in formal knowledge.
But Mathlib is a mathematics library. What about computer science?
CSLib is an emerging effort to extend the formalization paradigm into computer science — formalizing graphs, algorithms, complexity classes, and the core objects that software engineers and CS researchers work with daily. It’s still young, but the direction is significant.
Here’s why it matters: Mathlib proved that formalization works for pure mathematics. CSLib is bringing that same rigor to the domain that founders and engineers care about most. When you formalize CS objects, you create a foundation where AI can reason about algorithms, data structures, and computational complexity with the same provable correctness that Lean provides for mathematics. An AI system that can formally verify a new sorting algorithm or prove optimality of a graph traversal isn’t a research curiosity — it’s directly applicable to building better, more trustworthy software.
The ecosystem is still early — CSLib in particular is in its early chapters. But the trajectory is clear, and the infrastructure layer is being built now. The groups converging on this direction — from the Lean community to research labs to startups building AI reasoning systems — are creating the foundation for something much bigger than any single library.
What This Unlocks
The practical implications of the verification paradigm are already visible:
Trustworthy AI outputs. Outputs guaranteed correct within the formal system. Not “probably right” or “right 90% of the time” — proven correct. For any domain where errors are costly — medicine, finance, aviation, cryptography — this is transformative.
Autonomous exploration. An AI that can verify its own outputs can explore freely. It can generate candidate proofs, check them, discard the failures, and keep the successes — all without human supervision. This is a fundamentally different capability than generating plausible text and hoping a human catches the mistakes.
Closing the formalization bottleneck. Tao has noted that formalizing a proof currently takes roughly 10x the effort of writing it informally. AI systems that can draft formal proofs — even imperfect ones that need refinement — could collapse that ratio. Research into autoformalization (using LLMs to translate informal math into formal Lean proofs) is already making progress on this front, and projects like GPT-f demonstrated as early as 2020 that LLMs can generate formal proofs that Lean accepts.
For founders and technical leaders: the actionable takeaway here is twofold. First, rethink your AI stack — if you’re building systems that need to reason correctly (not just plausibly), a formal verification layer isn’t a luxury, it’s architecture. Second, pay attention to the investment landscape: many of the next wave of breakthroughs and applications will come from verification-layer and neuro-symbolic projects. The teams building at this intersection are building the infrastructure that the rest of the industry will eventually depend on.
But verification is only half the story. Making AI trustworthy is necessary — but the deeper question is what happens when a trustworthy AI system starts creating new knowledge. When it can explore, invent, and verify autonomously — not just check existing work, but generate results that never existed before.
That’s where synthetic universes come in.
This is the second in a series of three posts. Previously: Lost in Translation: What LLMs Actually Understand About Mathematics. Next: Synthetic Universes: How AI Creates Knowledge That Never Existed — on the paradigm of AI-driven mathematical discovery and the path to superintelligence.
References
- Trinh, T. H. et al. (2024). Solving Olympiad Geometry without Human Demonstrations (AlphaGeometry). DeepMind / Nature.
- Polu, S. & Sutskever, I. (2020). Generative Language Models and Automated Theorem Proving (GPT-f). OpenAI.
- Jiang, A. Q. et al. (2022). Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs.
- Wu, Y. et al. (2022). Autoformalization with Large Language Models. Google Research.
- The Mathlib Community. Mathlib: A Unified Library of Mathematics Formalized in Lean 4. leanprover-community.github.io
- CSLib. Emerging formalization library for computer science in Lean.