From Intuition to Axiom: The Story of an AI That Learned to Prove

Mathematics has always been the ultimate test of structured reasoning. While large language models (LLMs) like GPT-4 or DeepSeek can reason through complex text, they often stumble where human mathematicians shine — constructing airtight, verifiable proofs.

In 2025, researchers at ByteDance introduced Seed-Prover, a system that finally bridged this gap. It proved five out of six problems at the International Mathematical Olympiad (IMO 2025) and achieved state-of-the-art results on nearly every major formal mathematics benchmark.

At its heart lies a simple idea: combine the formal precision of Lean, a proof-verification language, with the broad reasoning abilities of modern LLMs. Where previous models relied on intuition-like text reasoning, Seed-Prover reasons formally — step by step, lemma by lemma — and lets the Lean compiler serve as the final judge.

Accompanied by its geometry counterpart Seed-Geometry, which fills Lean’s long-standing geometric gap, the system represents a new frontier where artificial intelligence doesn’t just approximate truth — it proves it.

Introduction

The evolution of AI reasoning has mirrored the progression of mathematics itself — from intuition to formality. Natural language reasoning gave LLMs a remarkable fluency in solving equations or explaining logic, but it lacked verification.

In mathematics, verification is non-negotiable. Every claim must be provably true. This is where formal languages like Lean come in. In Lean, proofs aren’t essays — they are programs that can be compiled, tested, and certified.

Previous theorem-proving AIs fell into two camps:

Type	Approach	Limitation
Step-Level Provers	Generate Lean proofs line-by-line	Too slow, often fail mid-proof
Whole-Proof Provers	Generate the entire proof in one go	No feedback during proof

Seed-Prover merges both. It generates entire proofs but refines them iteratively using Lean’s compiler feedback — essentially letting the AI “debug” its own proofs.

The system introduces three defining features:

Lemma-Style Reasoning — the model builds smaller provable components first, then uses them to prove the main theorem.
Iterative Refinement — continuous feedback loops between model and compiler.
Test-Time Scaling — three reasoning tiers (light, medium, heavy) balancing speed and depth.

The result? A model that not only outperforms its predecessors — DeepSeek-Prover-V2, Kimina-Prover, and Goedel-Prover — but one that begins to mirror human mathematical strategy.

Approach

At its core, Seed-Prover builds upon two complementary systems:

Seed-Geometry — a fast, neuro-symbolic engine for geometric reasoning.
Seed-Prover — a large formal reasoning model built on Lean 4.

The Reinforcement Loop

Training formal provers is tricky because there’s no “partial credit.” Either a proof compiles or it doesn’t. Seed-Prover uses a reinforcement learning loop based on VAPO (Verified Automatic Proof Optimization) where the reward is:

\[R = \begin{cases} 1, & \text{if Lean verifies the proof;} \\ 0, & \text{otherwise.} \end{cases}\]

The model updates its parameters to maximize the probability of producing compilable proofs:

\[\mathcal{L} = -\sum_i R_i \log P_\theta(a_i | s_i)\]

It learns through millions of trials — generating, failing, fixing, and refining.

Three Inference Modes

Mode	Description	Duration
Light	Quick iterative fixes using compiler feedback	~1 hour
Medium	Nested lemma refinement	~6–12 hours
Heavy	Thousands of conjectures explored in parallel	2–3 days

These tiers allow Seed-Prover to adjust its reasoning depth like a human — skimming for easy problems or diving deep into multi-day proofs.

Seed-Geometry

Geometry has long been the Achilles’ heel of automated provers. Unlike algebraic manipulation, geometry demands spatial construction and inference.

Seed-Geometry addresses this through a dual-architecture approach: a neural proposal model that imagines possible constructions and a symbolic reasoning engine that verifies them.

Extended Domain-Specific Language

Instead of step-by-step ruler-and-compass moves, Seed-Geometry uses composite geometric actions such as:

\[\text{IsogonalConjugate}(P, \triangle ABC), \quad \text{ExsimilitudeCenter}(O_1, O_2)\]

These high-level abstractions drastically shorten proof chains and make reasoning interpretable.

Performance Highlights

Benchmark	AlphaGeometry 2	Seed-Geometry
IMO-AG-50	42/50	43/50
IMO Shortlist (2000–2022)	19/39	22/39
IMO 2025 P2	—	Solved in 2 s

By rewriting its backend in C++, the team achieved a 100× speedup over prior Python implementations. The engine performs forward-chaining until closure:

\[F_{t+1} = F_t \cup \{ r(f) \mid f \in F_t, r \in \mathcal{R} \}\]

This allows complete derivation of geometric relationships before final verification.

Seed-Prover

While Seed-Geometry handles spatial problems, Seed-Prover dominates algebraic and combinatorial domains.

Its defining characteristic is lemma-style proving — building proofs hierarchically. The model generates intermediate lemmas, proves them independently, then uses them to complete the final theorem.

lemma L1 : a^2 ≥ 0 := by ring
lemma L2 : a^2 + b^2 ≥ 2ab := by nlinarith
theorem main : (a - b)^2 ≥ 0 := by linarith [L1, L2]

Each lemma is stored in a **lemma pool** with metadata on difficulty, dependencies, and success rate. Failed lemmas are retried or reformulated.
Over time, this builds a reusable knowledge base — much like a mathematician’s notebook.

Test-Time Scaling

Setting	Focus	Example
Light	Syntax correction	IMO 2022 P2
Medium	Nested refinement	IMO 2025 P5
Heavy	Massive conjecture expansion	IMO 2025 P3–P4

Under the heavy setting, Seed-Prover can maintain thousands of active conjectures, refining them iteratively over days.
The final output can span over 1000 lines of Lean code, each line verified by the compiler.

Evaluation

The system was tested across the world’s toughest mathematical benchmarks.

Dataset	Metric	Previous SOTA	Seed-Prover
IMO 2025	Problems Solved	3 (AlphaProof)	5 / 6
Past IMO	Success Rate	—	78.1 %
MiniF2F	Valid/Test	92.2 %	99.6 %
PutnamBench	Problems Solved	86	331 / 657
CombiBench	Accuracy	10 %	30 %
MiniCTX-v2	Accuracy	44.3 %	81.8 %

Seed-Geometry complements these achievements with 43 / 50 on IMO-AG-50 and 22 / 39 on shortlist problems.

Together, the systems demonstrate both depth and generalization — solving Olympiad-level puzzles while handling abstract research-grade proofs.

The result is clear: reinforcement learning guided by formal verification is more effective than imitation learning or natural-language reasoning.
Seed-Prover doesn’t just guess; it knows.

Conclusion

Seed-Prover and Seed-Geometry together represent a paradigm shift — from statistical reasoning to formal cognition.
They mark the first time a single AI system has achieved near-complete coverage of Olympiad-level mathematics with provable correctness.

Key Takeaways

Formal supervision provides deterministic rewards for reinforcement learning.
Lemma modularity enables compositional reasoning and knowledge reuse.
Iterative refinement allows dynamic correction beyond fixed token limits.
Multi-tier inference balances computational depth with flexibility.

In proving 5 of 6 IMO 2025 problems, Seed-Prover didn’t just pass a test — it demonstrated a new kind of machine understanding.

If AlphaGo taught AI to master games, Seed-Prover teaches it to master logic.
The next challenge? Applying this framework to open conjectures, mathematical research, and even physics proofs.

Original Paper

📄 Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving (ByteDance Seed AI4Math, 2025)