Title: Lean-STaR: Learning to Interleave Thinking and Proving Description: No description Keywords: No keywords Text content: Lean-STaR: Learning to Interleave Thinking and Proving 1 Introduction 2 Related Work Automatic Theorem Proving & Autoformalization. Rationale-augmented Reasoning. Bootstrapping Language Model Reasoning. 3 Our Method: Lean-STaR 3.1 Preliminaries 3.2 Data Generation & Training 3.2.1 Direct Tactic Prediction 3.2.2 Thought-augmented Tactic Prediction 3.2.3 Bootstrapping Thought-augmented Theorem Proving 3.3 Evaluation Setup. Best-First Search. Sampling. 4 Experiments 4.1 Experimental Setup 4.2 Main Results Thought augmentation improves theorem proving. Bootstrapping improves thought-augmented theorem proving. 4.3 Performance Analysis by Types and Difficulties 4.4 Search and sampling budget 4.5 Experiments with stronger base model and more data 4.6 Experiments on expert iteration without CoT 5 Conclusion & Limitations A Additional Experiment Setup A.1 Lean-STaR Expert Iteration B Statistics for our methods as well as the baselines C An Example and Explanation of A Formal Proof in Lean D Performance Analysis by Types and Difficulties using InternLM2-plus-7b E Performance difference of joint training and continue training F Retrospective Rationale Generation Prompt of GPT-4 G Examples of generated Lean proofs with thoughts Lean-STaR: Learning to Interleave Thinking and Proving Haohan Lin2 &Zhiqing Sun1 Yiming Yang1                                       Sean Welleck1   1Language Technologies Institute, Carnegie Mellon University 2Institute for Interdisciplinary Information Sciences, Tsinghua University {haohanl,zhiqings,yiming,swelleck}@cs.cmu.edu https://leanstar.github.io/ Work done during the visit at CMU. Abstract Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model’s theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models (43.4%→46.3%,bold-→percent43.4percent46.3\bm{43.4\%\rightarrow 46.3\%,}bold_43.4 bold_% bold_→ bold_46.3 bold_% bold_, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness. 1 Introduction Theorem proving is a fundamental aspect of mathematics, and mathematical reasoning is an important part of artificial intelligence [28, 55]. Formalized mathematics in particular provides a challenging testbed for assessing mathematical reasoning capabilities. Since theorems and proofs in this setting can be represented in the form of checkable source code, it is easy to evaluate proofs of arbitrary complexity [15]. Automated theorem proving, if successful, can also help discover unknown errors in previous proofs111For example, Terence Tao found a non-trivial error while using Lean to formalize a project [38]., and make it easier to guarantee that new proofs are correct. More broadly, formal mathematics coupled with powerful automation may unlock new forms of education and collaboration, mathematical insights, and applications to verifying critical software [4, 16, 10, 31]. Recently, language models have shown promising progress in formal theorem proving [33, 35, 47, 19, 25, 50, 26]. Existing approaches typically train a model solely based on the proofs in a formal language (code) such as Lean [15], Isabelle [29], or Coq [13]. Our key observation is that such approaches ignore a wealth of informal information that may be useful for learning to prove theorems [45, 46]. For instance, the underlying thought process prior to each step of a proof is not present in formal source code. Based on this insight, we propose to train a language model that can produce a natural language chain-of-thought (“thought”) prior to each step (“tactic”) of a formal proof. We introduce Lean-STaR, a framework for learning to interleave informal thoughts with steps of formal proving. Building on the Self-Taught Reasoner (STaR) framework [52], we enable language models to interleave step-by-step rationales (i.e., thoughts) [30, 43] with formal proving in a two-stage process. In an initial phase, we prompt a sufficiently capable language model, such as GPT-4 [1], and generate retrospective thoughts based on a dataset of human-written proofs, such as Mathlib, the largest collection of human-written proofs in Lean [27]. Subsequently, we fine-tune a thought-augmented tactic predictor [8, 7, 18, 14] that, given a Lean state, can generate a thought and predict the subsequent tactic. In a second phase, we optimize this thought-augmented tactic predictor with the expert iteration algorithm [3, 36], using multi-step success rate in theorem proving as the reward. Our work presents a new link between informal and formal mathematics, complementary to prior explorations that translate standalone mathematical statements [37, 42, 48] or translate informal proofs into formal proofs [2, 23, 5, 56, 22]. Lean-STaR generates natural language thoughts specifically for each proof step, improving formal proving capabilities by interleaving natural and formal languages. We instantiate Lean-STaR by generating roughly 50,000 thought-augmented examples from Lean’s Mathlib [27], then synthesize an additional 50k examples through two iterations of expert iteration. To the best of our knowledge, this yields the first thought-augmented dataset for theorem proving. After fine-tuning an InternLM2-7b base model [51] on our thought-augmented data, our final Lean-STaR model can solve 34.8%percent34.8\mathbf{34.8\%}bold_34.8 % (pass@32) or 36.1%percent36.1\mathbf{36.1\%}bold_36.1 % (pass@64) of the problems on miniF2F-test [54]. Using stronger base model InternLM2-7b-plus, Lean-STaR can achieve 45.4%percent45.4\mathbf{45.4\%}bold_45.4 % (pass@32), significantly surpassing the previous state-of-the-art results of 43.4%percent43.443.4\%43.4 % (pass@32). In summary, Lean-STaR offers a framework for teaching language models to interleave informal thoughts with formal verification, advancing the capabilities of language models in automated theorem proving. Figure 1: The illustration of tactic prediction in one proof step with and without thought. 2 Related Work Automatic Theorem Proving & Autoformalization. Previous work on learning-based theorem proving typically follows the GPT-f framework [33], which trains a language model on (proof state, next-tactic) pairs, then proves theorems by using the model within a best-first tree search. Subsequent work has explored several directions, including data augmentation [20], novel proof search methods [25, 41], further training through curriculum learning [34], retrieval augmentation [50], or practical tools [44]. Others use prompted models to generate tactics [6, 39], or fine-tune models to generate a full proof [17]. A second auto-formalization [48] thread incorporates informal mathematics into formal theorem proving. Draft-Sketch-Prove [23] shows that language models have some ability to use informal proofs to improve a model’s formal proving abilities, by drafting an informal proof, translating into a formal proof sketch, then completing the proof with tools like Sledgehammer [8]. Draft-Sketch-Prove and related methods [40, 53, 57] are limited to the Isabelle prover, since they use powerful automatic proving tools like Sledgehammer. Lean lacks these tools, so generating the entire proof at once would be more unlikely in Lean. We focus on Lean, and train language models to generate a thought and predict the subsequent tactic in each proof step. To the best of our knowledge, we are the first to introduce thought-augmented reasoning in automatic theorem proving. Rationale-augmented Reasoning. Recently, many works demonstrated that letting language models reason before an answer can improve their performance on tasks including math, science, and code [30, 43, 12]. Although the corresponding techniques (e.g., Scratchpad and Chain-of-Thought) have proven to be effective, they require either extensive annotated training examples or exposure to numerous similar examples during pre-training [9]. The scarcity of natural language reasoning in formal theorem proving, coupled with the impracticality of manually annotating rationales for formal mathematics, thus presents a challenge. We propose a new Lean-STaR framework for synthesizing training examples by taking advantage of the correctness signal from the formal system. Bootstrapping Language Model Reasoning. Recently, several works suggest that language models may be taught to reason via synthetic data that they generate themselves, akin to a reinforcement learning method that improves a policy through self-play. Polu et al. [34] showed that a simple RL algorithm, expert iteration, paired with curriculum learning can improve a formal theorem proving model. Self-Taught Reasoner (STaR) [52] showed that we can iteratively fine-tune the language model on the correct (reasoning, answer) pairs generated by itself to gradually improve performance. Singh et al. [36] proposed ReST-EM, which filters data generated by language model with a binary feedback signal rather than using fully manually annotated data (similar to expert iteration in [34]). Our work builds on these ideas, providing the first study of bootstrapped thought-augmented proving. Figure 2: An example of Lean proof and thoughts generated by Lean-STaR. Note that there is a calculation error in the thought (in red), but this does not affect the correctness of the proof because the calculation task is actually completed by the interactive theorem prover (i.e., Lean’s nlinarith) instead of the language model. This shows a benefit of combining neural and symbolic systems. 3 Our Method: Lean-STaR We introduce Lean-STaR, a new method for combining informal thoughts with formal theorem proving. First, we recap interactive theorem proving (§3.1). Then we present Lean-STaR’s data-generation (§3.2.1, §3.2.2) and reinforcement learning (§3.2.3) phases. Finally, we present our evaluation protocols (§3.3). 3.1 Preliminaries Interactive Theorem Provers (ITPs) are typically used for step-by-step automatic theorem proving in formal mathematics. At each step, we can provide the ITP with a high-level “tactic” to simplify the current goal state (e.g., the initial goal theorems to be proven) into subgoals. These subgoals will form new states, and proving all the subgoals results in a complete proof of the given theorem. We use Lean [15], a popular interactive theorem prover. An example formal proof in Lean and its explanation are shown in Appendix C. ⬇ ### State a b n : ℕℕ\mathbb{N}blackboard_N ⊢proves\vdash⊢ ¬ (n ≠\neq≠ 0 ∧\land∧ a ≠\neq≠ 0) ↔↔\leftrightarrow↔ n = 0 ∨\lor∨ a = 0 ### Tactic “‘lean4 simp only [not_and_or, ne_eq, not_not] “‘ ⬇ ### Reasoning To prove the equivalence between the negation of a conjunction and a disjunction, we utilize logical equivalences, specifically the negation of a conjunction (‘¬(P ∧\land∧ Q)‘) being equivalent to the disjunction of the negations (‘P = 0 ∨\lor∨ Q = 0‘) and simplifications related to negation and inequality. Figure 3: Examples of (input, output) pairs of retrospective rationale generation with GPT-4. The full prompt is given in Appendix F. See §3.2.2 for the detailed explanation. 3.2 Data Generation & Training We describe the data generation and training of the direct tactic prediction model (SFT), the thought-augmented tactic prediction model trained with synthetic data (Lean-CoT), and the final model trained with expert iteration (Lean-STaR). 3.2.1 Direct Tactic Prediction We define the theorem-proving problem as a Markov Decision Process (MDP) (𝒮,𝒜,Pa,Ra)𝒮𝒜subscript𝑃𝑎subscript𝑅𝑎(\mathcal{S},\mathcal{A},P_{a},R_{a})( caligraphic_S , caligraphic_A , italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) where proof states serve as states in MDP and tactics serve as actions. From this perspective, a proof is a trajectory (s1,a1,r1),(s2,a2,r2),⋯subscript𝑠1subscript𝑎1subscript𝑟1subscript𝑠2subscript𝑎2subscript𝑟2⋯(s_{1},a_{1},r_{1}),(s_{2},a_{2},r_{2}),\cdots( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ of states sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, tactics aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and rewards ri∈ℝsubscript𝑟𝑖ℝr_{i}\in\mathbb{R}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R, and the ITP (e.g., Lean) provides each new state si+1subscript𝑠𝑖1s_{i+1}italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. In the typical setting [33], proving a theorem consists of providing a proof state s𝑠sitalic_s to the language model and then generating a tactic from the language model M𝑀Mitalic_M, i.e., πM⁢(a|s)subscript𝜋𝑀conditional𝑎𝑠\pi_{M}(a|s)italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_a | italic_s ). The language model can be fine-tuned for this task using a dataset of (proof state, next-tactic) pairs from successful proof trajectories, i.e. D={(si,ai):i=1,⋯,M}𝐷conditional-setsuperscript𝑠𝑖superscript𝑎𝑖𝑖1⋯𝑀D=\{(s^{i},a^{i}):i=1,\cdots,M\}italic_D = { ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) : italic_i = 1 , ⋯ , italic_M }, where final states have a reward of 1. We refer to a language model fine-tuned on such a dataset as a supervised fine-tuning (SFT) model. 3.2.2 Thought-augmented Tactic Prediction Existing approaches typically train only on formal states and tactics [33]. We hypothesize that incorporating a latent thought can improve a model’s ability to predict the next tactic. Formally, we introduce a hidden “thought” variable tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT prior to each tactic, and then extend the model to the form πM⁢(ai,ti|si)=πM⁢(ai|ti,si)⁢πM⁢(ti|si)subscript𝜋𝑀subscript𝑎𝑖conditionalsubscript𝑡𝑖subscript𝑠𝑖subscript𝜋𝑀conditionalsubscript𝑎𝑖subscript𝑡𝑖subscript𝑠𝑖subscript𝜋𝑀conditionalsubscript𝑡𝑖subscript𝑠𝑖\pi_{M}(a_{i},t_{i}|s_{i})=\pi_{M}(a_{i}|t_{i},s_{i})\pi_{M}(t_{i}|s_{i})italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). In thought-augmented tactic prediction, the distribution over the next tactic can then be expressed as: πM⁢(ai|si)=∑tiπM⁢(ai|ti,si)⁢πM⁢(ti|si).subscript𝜋𝑀conditionalsubscript𝑎𝑖subscript𝑠𝑖subscriptsubscript𝑡𝑖subscript𝜋𝑀conditionalsubscript𝑎𝑖subscript𝑡𝑖subscript𝑠𝑖subscript𝜋𝑀conditionalsubscript𝑡𝑖subscript𝑠𝑖\pi_{M}(a_{i}|s_{i})=\sum\limits_{t_{i}}\pi_{M}(a_{i}|t_{i},s_{i})\pi_{M}(t_{i% }|s_{i}).italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . The key challenge is obtaining (state, thought, tactic) pairs for training a model. To this end, we introduce retrospective rationale generation. Our motivating observation is that the distribution of natural language thoughts in theorem-proving πM⁢(ti|si)subscript𝜋𝑀conditionalsubscript𝑡𝑖subscript𝑠𝑖\pi_{M}(t_{i}|s_{i})italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is scarce in the pre-training corpus of large language models. In turn, we find that even the most powerful GPT-4 model does not perform well in generating the correct rationale through few-shot prompting [9]. To develop a language model capable of generating thoughts and tactics ai,ti|sisubscript𝑎𝑖conditionalsubscript𝑡𝑖subscript𝑠𝑖a_{i},t_{i}|s_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we need an entirely new dataset DT={(si,ti,ai):i=1,⋯,N}subscript𝐷𝑇conditional-setsuperscript𝑠𝑖superscript𝑡𝑖superscript𝑎𝑖𝑖1⋯𝑁D_{T}=\{(s^{i},t^{i},a^{i}):i=1,\cdots,N\}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) : italic_i = 1 , ⋯ , italic_N }. However, in Lean, we only have a dataset of DS={(si,ai):i=1,⋯,N}subscript𝐷𝑆conditional-setsuperscript𝑠𝑖superscript𝑎𝑖𝑖1⋯𝑁D_{S}=\{(s^{i},a^{i}):i=1,\cdots,N\}italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) : italic_i = 1 , ⋯ , italic_N } where (si,ai)superscript𝑠𝑖superscript𝑎𝑖(s^{i},a^{i})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is one step in some successful proof trajectories. Given a powerful large language model G𝐺Gitalic_G, which we refer to as the oracle model222For instance, in our experiments we use the best available large language model, GPT-4., we give the oracle model the ground-truth tactic aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and let the oracle model produce the thought tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the current state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ground-truth tactic aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This helps improve the pass rate and produce thought-augmented data more efficiently. Our few-shot prompt is provided in Appendix F. The design principle of the prompt is to prevent the oracle model from generating hindsight-like thoughts. We randomly select M𝑀Mitalic_M pairs (si,ai)∈DSsuperscript𝑠𝑖superscript𝑎𝑖subscript𝐷𝑆(s^{i},a^{i})\in D_{S}( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT . Then the oracle model is used to produce a thought tisuperscript𝑡𝑖t^{i}italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each pair (si,ai)superscript𝑠𝑖superscript𝑎𝑖(s^{i},a^{i})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) to create a new dataset DT⁢{(si,ti,ai):i=1,⋯,M}subscript𝐷𝑇conditional-setsuperscript𝑠𝑖superscript𝑡𝑖superscript𝑎𝑖𝑖1⋯𝑀D_{T}\{(s^{i},t^{i},a^{i}):i=1,\cdots,M\}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT { ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) : italic_i = 1 , ⋯ , italic_M }. With this retrospectively annotated dataset by the oracle model DTsubscript𝐷𝑇D_{T}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we obtained our first thought-augmented tactic prediction model, Lean-CoT, by fine-tuning from the SFT model. 3.2.3 Bootstrapping Thought-augmented Theorem Proving We propose to apply expert iteration to further improve the performance of Lean-CoT. Specifically, we start from the initial Lean-CoT model M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the initial dataset D={si:i=1,⋯,M}𝐷conditional-setsuperscript𝑠𝑖𝑖1⋯𝑀D=\{s^{i}:i=1,\cdots,M\}italic_D = { italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : italic_i = 1 , ⋯ , italic_M }, which consists of all initial states sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of the theorems to be proved. In iteration 1111, we use model M𝑀Mitalic_M to sample K𝐾Kitalic_K times per problem. Each time the model will produce a proof trajectory [(s0,t0,a0),(s1,t1,a1),⋯,(sn,tn,an)]subscript𝑠0subscript𝑡0subscript𝑎0subscript𝑠1subscript𝑡1subscript𝑎1⋯subscript𝑠𝑛subscript𝑡𝑛subscript𝑎𝑛[(s_{0},t_{0},a_{0}),(s_{1},t_{1},a_{1}),\cdots,(s_{n},t_{n},a_{n})][ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ]. Then we create a new dataset D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by filtering the generated trajectories to include only the successful ones. De-duplication is then applied to the collected trajectories. Now, we can further fine-tune the SFT model M𝑀Mitalic_M on dataset DT∪D1subscript𝐷𝑇subscript𝐷1D_{T}\cup D_{1}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to produce Lean-STaR model M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then we can use M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as initial model to produce dataset D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and further fine-tune to obtain model M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the next iteration. This method can be seen as an offline RL method [36] in the theorem proving MDP. In this MDP, the cumulative reward R⁢((s0,t0,a0),(s1,t1,a1),⋯,(sn,tn,an))=1𝑅subscript𝑠0subscript𝑡0subscript𝑎0subscript𝑠1subscript𝑡1subscript𝑎1⋯subscript𝑠𝑛subscript𝑡𝑛subscript𝑎𝑛1R\left((s_{0},t_{0},a_{0}),(s_{1},t_{1},a_{1}),\cdots,(s_{n},t_{n},a_{n})% \right)=1italic_R ( ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = 1 if and only if the proof trajectory is successful. The total expected reward is J⁢(M,D)=∑i𝔼(s0,t0,a0),⋯,(sn,tn,an)∼πM(⋅|si)⁢R⁢((s0,t0,a0),⋯,(sn,tn,an)),J(M,D)=\sum\limits_{i}\mathbb{E}_{(s_{0},t_{0},a_{0}),\cdots,(s_{n},t_{n},a_{n% })\sim\pi_{M}(\cdot|s^{i})}R\left((s_{0},t_{0},a_{0}),\cdots,(s_{n},t_{n},a_{n% })\right),italic_J ( italic_M , italic_D ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∼ italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_R ( ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , and Lean-STaR’s expert iteration can be seen as optimizing this reward [36]. Figure 4: The diagram of our pipeline. (1) Produce CoT dataset through GPT-4. (2) Fine-tune the SFT model with the CoT dataset to obtain Lean-CoT. (3) Use expert iteration to generate the STaR dataset through the model in the last iteration (Lean-CoT in the first iteration) and then fine-tune Lean-CoT on the updated STaR dataset to obtain the model in the next iteration. We continue performing this step until a stopping condition is met (e.g., a fixed number of iterations). Figure 5: The visualization of Best-first Search (K=1𝐾1K=1italic_K = 1) and Sampling (S=1𝑆1S=1italic_S = 1). Search method maintains a search tree and explores S𝑆Sitalic_S tactics on each expanded node. Sampling method explores K𝐾Kitalic_K tactic trajectories from the root and ignores illegal tactics in the trajectories. 3.3 Evaluation Setup. We evaluate the model on formal theorem proving – given a theorem statement, produce a theorem that is correct according to the formal system. This requires an algorithm for producing a full proof by interacting with Lean. As a new form of theorem-proving system, it is unclear what the best strategy is when we have informal thoughts. Our preliminary experiments indicate that best-first search with beam search does not work well for the thoughts in the natural language format. Thus we describe the traditional strategy (best-first search), and our new approach based on sampling. Best-First Search. The most popular method to evaluate the theorem proving ability of a language model M𝑀Mitalic_M is to use best-first search like GPT-f [33, 50, 6, 44]. In best-first search, we keep all unexpanded states sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each time, we expand the “best” state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and use the language model to sample S𝑆Sitalic_S next tactics ai,1⁢⋯⁢Ssubscript𝑎𝑖1⋯𝑆a_{i,1\cdots S}italic_a start_POSTSUBSCRIPT italic_i , 1 ⋯ italic_S end_POSTSUBSCRIPT for the current state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each legal tactic ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, a new state can be obtained by applying tactic ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT on state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Following standard practice [33, 50, 44], we assume the state with maximum log-probabilities is the “best”s. Specifically, we select state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with maximum ∑j=0i−1log⁡p⁢(aj,sj)superscriptsubscript𝑗0𝑖1𝑝subscript𝑎𝑗subscript𝑠𝑗\sum\limits_{j=0}^{i-1}\log p(a_{j},s_{j})∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_log italic_p ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where (s0,a0),⋯,(si−1,ai−1)subscript𝑠0subscript𝑎0⋯subscript𝑠𝑖1subscript𝑎𝑖1(s_{0},a_{0}),\cdots,(s_{i-1},a_{i-1})( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) is the proof trajectory before state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and log⁡p⁢(aj,sj)𝑝subscript𝑎𝑗subscript𝑠𝑗\log p(a_{j},s_{j})roman_log italic_p ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the average log probability of each generated token. We expand up to N𝑁Nitalic_N states and we get a successful proof search when we reach any proof state with no goals. Then, we can attempt the search K𝐾Kitalic_K times to obtain a pass rate p⁢a⁢s⁢s⁢@⁢K𝑝𝑎𝑠𝑠@𝐾pass@Kitalic_p italic_a italic_s italic_s @ italic_K. However, we found that the best-first search method performs poorly with the Lean-CoT and Lean-STaR models. We attribute this to using average log probabilities, which may not be a reliable quality indicator when the thought sequence tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is generated. Sampling. Motivated by these issues with applying best-first search to thought-augmented proving, we develop a new method based on sampling trajectories in parallel. Specifically, our method samples K𝐾Kitalic_K times in parallel for each problem, each time generating at most N𝑁Nitalic_N tactics. Also, illegal sampled tactics will be ignored during sampling. Specifically, in a sample, suppose our current state is sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the proof trajectory before sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is (s0,a0),⋯,(si−1,ai−1)subscript𝑠0subscript𝑎0⋯subscript𝑠𝑖1subscript𝑎𝑖1(s_{0},a_{0}),\cdots,(s_{i-1},a_{i-1})( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ⋯ , ( italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) and the sampled tactic is aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a legal tactic, (si,ai)subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) will be added to the proof trajectory and we will reach a new state obtained by applying tactic ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT on state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Otherwise, we ignore this aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and use language model M𝑀Mitalic_M to sample a new tactic given state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We limit the number of times a tactic can be generated by language model M𝑀Mitalic_M to a total of N𝑁Nitalic_N per time in K𝐾Kitalic_K sampling times. The sampling method is roughly equivalent to the search with S=1𝑆1S=1italic_S = 1, except that the sampling ignores illegal tactics. We assume that in the sampling method we have S=1𝑆1S=1italic_S = 1. In this setting, evaluating our sampling method and best-first search with equal S×K𝑆𝐾S\times Kitalic_S × italic_K took approximately the same amount of GPU time. This sampling method can easily accommodate hidden variable “thoughts” tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Figure 5 compares best-first search and our sampling method. 4 Experiments We instantiate Lean-STaR using the best available open language model pre-trained on the Lean corpus (InternLM2-Math-base-7b [51]), and follow standard practice in using Lean’s Mathlib as the underlying training set (via the Lean Dojo dataset [50]). We generate an initial set of thoughts for Mathlib using GPT-4, perform two rounds of expert iteration, then evaluate the model on miniF2F [54], the de-facto standard benchmark for evaluating language-model based theorem provers. Our experimental results show that both retrospective rationale generation and expert iteration significantly improve the theorem-proving capabilities of language models in this setting. We describe our setup and findings in detail below. Table 1: Pass rates on the minif2f-test dataset with Lean. This table shows the pass rates of previous works and our work. S𝑆Sitalic_S is the number of tactics attempted at each expanded node (assumed to be 1111 in sampling) and K𝐾Kitalic_K is the total number of search or sampling attempts per problem. In sampling we use temperature 0.7, and in search we use beam search when generating the next tactic. Note that we sample 32323232 examples twice when K=64𝐾64K=64italic_K = 64 in sampling. Approach Decoding N𝑁Nitalic_N K𝐾Kitalic_K S𝑆Sitalic_S Pass rate GPT-3.5 [1] (Few-Shot) Sampling 50 1 1 2.8%percent2.82.8\%2.8 % GPT-4 [1] (Few-Shot) Sampling 50 1 1 11.9%percent11.911.9\%11.9 % Transformer [34] (w/o RL) Search 512 1 8 24.6%percent24.624.6\%24.6 % Llemma-34b [6] (Few-Shot) Search 50 1 32 25.8%percent25.825.8\%25.8 % Llemma-7b [6] (Few-Shot) Search 50 1 32 26.2%percent26.226.2\%26.2 % ReProver [50] Search 50 1 64 26.5%percent26.526.5\%26.5 % Transformer [34] (w/ RL) Search 512 1 8 29.6%percent29.629.6\%29.6 % InternLM2-34b [51] (Few-Shot) Search 50 1 32 29.5%percent29.529.5\%29.5 % COPRA (with GPT-4) [39] Customized - 60 1 29.9%percent29.929.9\%29.9 % COPRA (with GPT-4) [39] Customized - 100 1 30.7%percent30.730.7\%30.7 % InternLM2-7b [51] (Few-Shot) Sampling 50 32 1 28.7%percent28.728.7\%28.7 % InternLM2-7b [51] (Few-Shot) Search 50 1 32 30.3%percent30.330.3\%30.3 % SFT (InternLM2-7b) Sampling 50 32 1 29.5%percent29.529.5\%29.5 % SFT (InternLM2-7b) Search 50 1 32 30.7%percent30.730.7\%30.7 % Lean-CoT (InternLM2-7b) Sampling 50 32 1 32.8%percent32.832.8\%32.8 % Lean-STaR (Iter-1) (InternLM2-7b) Sampling 50 32 1 34.0%percent34.034.0\%34.0 % Lean-STaR (Iter-2) (InternLM2-7b) Sampling 50 32 1 34.8%percent34.8\mathbf{34.8\%}bold_34.8 % Lean-STaR (Iter-2) (InternLM2-7b) Sampling 50 64 1 36.1%percent36.1\mathbf{36.1\%}bold_36.1 % 4.1 Experimental Setup We use LeanDojo Benchmark 4 v9 as the supervised fine-tuning (SFT) dataset containing 231,240231240231,240231 , 240 data examples. We fine-tune for 1111 epoch to obtain the SFT model. For the learning rate, we use a warmup in the first 20%percent2020\%20 % steps from 00 to 2×10−52superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, followed by a cosine schedule decaying to zero. We randomly select 17,2561725617,25617 , 256 different successful proof trajectories from LeanDojo Benchmark 4 dataset [50], and use GPT-4-0125 [32] to annotate 52,4385243852,43852 , 438 thoughts from those proof trajectories. We filtered out all proof steps (si,ai)superscript𝑠𝑖superscript𝑎𝑖(s^{i},a^{i})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for which aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT contains the newline symbol “\n” before annotating. We perform two iterations of expert iteration, and provide the details in Appendix A.1 due to space. We evaluate our method on the MiniF2F benchmark [54]. We use a similar evaluation setting as previous works [50, 44, 51], but use our sampling method instead of best-first search for the evaluation of our thought-augmented theorem proving model as discussed in (§3.3). For each problem, Lean-STaR samples K=32𝐾32K=32italic_K = 32 times in parallel with temperature T=0.7𝑇0.7T=0.7italic_T = 0.7, and limits the number of times a tactic can be generated to a total of N=50𝑁50N=50italic_N = 50. We also limit sampling to 10101010 minutes per problem. We choose these settings to resemble the inference budget used in our baselines, which follow previous work [44, 6, 51]. Namely, for best-first search baselines we use beam search to generate the next tactic with S=32,K=1formulae-sequence𝑆32𝐾1S=32,K=1italic_S = 32 , italic_K = 1 [44, 6, 51]. We do not compare with methods designed for other formal languages such as Jiang et al. [23], Xin et al. [49] since language differences greatly influence the pass rate due to the different tactics and automation. We also do not compare with Lample et al. [25] since they only report S=32,K=64formulae-sequence𝑆32𝐾64S=32,K=64italic_S = 32 , italic_K = 64 on best-first search, which is approximately equivalent to S=1,K=512formulae-sequence𝑆1𝐾512S=1,K=512italic_S = 1 , italic_K = 512 for the sampling method, which is too computationally expensive for us. Table 2: Counts of problems successfully proved in minif2f-test benchmark, split by type and difficulty. The methods use sampling with N=50,K=32formulae-sequence𝑁50𝐾32N=50,K=32italic_N = 50 , italic_K = 32. Thought-augmented methods improve performance on all categories, while Lean-STaR significantly improves Number Theory performance. TOTAL Test Set Size InternLM2-7b SFT Lean-CoT Lean-STaR (Iter-2) IMO 20 0 0 0 0 AIME 15 2 1 2 3 AMC 45 3 3 7 5 MATH Algebra Level 5 14 1 2 3 3 Level 4 14 7 7 7 7 Level 3 14 9 9 11 11 Level 2 14 10 10 9 11 Level 1 14 9 10 10 10 Number Theory Level 5 16 6 6 6 7 Level 4 11 5 5 4 5 Level 3 11 4 5 5 6 Level 2 11 6 5 5 6 Level 1 11 8 8 9 9 CUSTOM Algebra 18 0 1 1 1 Number Theory 8 0 0 0 0 Induction 8 0 0 1 1 4.2 Main Results Our main results are reported in Table 1. Lean-STaR gives a significant improvement over the previous state-of-the-art in Lean. For instance, with a similar inference budget, Lean-STaR achieves 34.8% versus 30.3%percent30.330.3\%30.3 % in InternLM2 [51] using best-first search and 30.7%percent30.730.7\%30.7 % in COPRA [39] using GPT-4. With a larger compute budget, Lean-STaR’s performance improves further to 36.1%. Table 3: Pass rates about InternLM2-Plus-7B on the minif2f-test dataset with Lean. This table shows the pass rates of previous works and our work. S𝑆Sitalic_S is the number of tactics attempted at each expanded node (assumed to be 1111 in sampling) and K𝐾Kitalic_K is the total number of search or sampling attempts per problem. In sampling we use temperature 0.7, and in search we use beam search when generating the next tactic. Note that we sample 32323232 examples twice when K=64𝐾64K=64italic_K = 64 in sampling. In search we follow the setup in [51] and limit each search to 20202020 minutes. Approach Decoding N𝑁Nitalic_N K𝐾Kitalic_K S𝑆Sitalic_S Pass rate InternLM2-plus-7b [51] (Few-Shot) (from paper) Search 1000 1 32 43.4%percent43.443.4\%43.4 % InternLM2-plus-7b [51] (Few-Shot) (reproduced) Search 1000 1 32 42.6%percent42.642.6\%42.6 % InternLM2-plus-7b [51] (Few-Shot) Sampling 50 32 1 40.9%percent40.940.9\%40.9 % SFT (InternLM2-plus-7b) [51] (Few-Shot) Sampling 50 32 1 41.3%percent41.341.3\%41.3 % Lean-CoT (InternLM2-plus-7b) Sampling 50 32 1 43.4%percent43.443.4\%43.4 % Lean-STaR (Iter-1) (InternLM2-plus-7b) Sampling 50 32 1 45.4%percent45.445.4\%45.4 % InternLM2-plus-7b [51] (Few-Shot) Sampling 50 64 1 42.2%percent42.242.2\%42.2 % SFT (InternLM2-plus-7b) [51] (Few-Shot) Sampling 50 64 1 43.4%percent43.443.4\%43.4 % Lean-CoT (InternLM2-plus-7b) Sampling 50 64 1 45.5%percent45.545.5\%45.5 % Lean-STaR (Iter-1) (InternLM2-plus-7b) Sampling 50 64 1 46.3%percent46.3\mathbf{46.3\%}bold_46.3 % Thought augmentation improves theorem proving. The first phase of Lean-STaR trains a model to interleave thoughts and tactics, by fine-tuning on a synthesized dataset of thought-augmented examples. The fine-tuned model from this phase, denoted Lean-CoT in Table 1, achieves a pass rate of 32.8%percent32.832.8\%32.8 %, which is higher than the model prior to this phase, denoted SFT (29.5%). We conclude that the first phase of Lean-STaR can improve the theorem proving ability of a language model, even one that is already specialized for generating tactics in Lean such as the SFT model. Bootstrapping improves thought-augmented theorem proving. The second phase of Lean-STaR consists of generating new thoughts and tactics with the current language model, saving those that result in correct proofs, and training on the union of the initial thought-augmented dataset and the saved examples (i.e., expert iteration [34, 52, 36]). Refer to Appendix A.1 for details. We perform two iterations of expert iteration, and present the results in Table 1, denoted Lean-STaR. Each iteration improves the model’s theorem proving performance, from 32.8% (the initial model) to 34% (Lean-STaR after iteration 1) to 34.8% (Lean-STaR after iteration 2). Furthermore, we find that the model is amenable to further improvement via additional sampling, achieving 36.1% by doubling the sampling budget. We conclude that Lean-STaR’s second phase can further improve a model’s ability to generate thoughts and tactics that lead to correct proofs. We include three qualitative examples in the Appendix, which show the model interleaving thoughts and proof steps. 4.3 Performance Analysis by Types and Difficulties Tasks in minif2f-test are manually formalized from Olympiad type problems, drawn from multiple sources including AIME, AMC, IMO problems, and problems from the MATH dataset [21]. These problems can have different levels of difficulty and types. Table 2 reports the number of problems successfully proved, partitioned by type and difficulty. We see that Lean-CoT improves performance in solving difficult problems on all categories, especially those from mathematics competitions. On top of these improvements, Lean-STaR’s improvements come mainly in Number Theory. Table 4: Performence of SFT-Direct and our Lean-STaR at different search size or sampling times S×K𝑆𝐾S\times Kitalic_S × italic_K. We fix N=50𝑁50N=50italic_N = 50. We use beam search in search and temperature T=0.7𝑇0.7T=0.7italic_T = 0.7 in sampling when generating the next tactic. We have K=1𝐾1K=1italic_K = 1 in search and S=1𝑆1S=1italic_S = 1 in sampling. Note that we sample 32323232 examples twice when K=64𝐾64K=64italic_K = 64 in sampling. SFT-Direct (Search) SFT-Direct (Sampling) Lean-STaR (Iter-2) (Sampling) S×K=1𝑆𝐾1S\times K=1italic_S × italic_K = 1 13.5%percent13.513.5\%13.5 % 20.9%percent20.920.9\%20.9 % 21.7%percent21.721.7\%21.7 % S×K=2𝑆𝐾2S\times K=2italic_S × italic_K = 2 18.0%percent18.018.0\%18.0 % (+4.5%percent4.5+4.5\%+ 4.5 %) 22.5%percent22.522.5\%22.5 % (+1.6%percent1.6+1.6\%+ 1.6 %) 24.6%percent24.624.6\%24.6 %(+2.9%percent2.9+2.9\%+ 2.9 %) S×K=4𝑆𝐾4S\times K=4italic_S × italic_K = 4 23.3%percent23.323.3\%23.3 % (+5.3%percent5.3+5.3\%+ 5.3 %) 25.0%percent25.025.0\%25.0 % (+2.5%percent2.5+2.5\%+ 2.5 %) 27.5%percent27.527.5\%27.5 %(+2.9%percent2.9+2.9\%+ 2.9 %) S×K=8𝑆𝐾8S\times K=8italic_S × italic_K = 8 27.5%percent27.527.5\%27.5 % (+4.2%percent4.2+4.2\%+ 4.2 %) 27.0%percent27.027.0\%27.0 % (+2.0%percent2.0+2.0\%+ 2.0 %) 30.7%percent30.730.7\%30.7 % (+3.2%percent3.2+3.2\%+ 3.2 %) S×K=16𝑆𝐾16S\times K=16italic_S × italic_K = 16 29.9%percent29.929.9\%29.9 % (+2.4%percent2.4+2.4\%+ 2.4 %) 28.3%percent28.328.3\%28.3 % (+1.3%percent1.3+1.3\%+ 1.3 %) 33.6%percent33.633.6\%33.6 % (+2.9%percent2.9+2.9\%+ 2.9 %) S×K=32𝑆𝐾32S\times K=32italic_S × italic_K = 32 30.7%percent30.730.7\%30.7 % (+0.8%percent0.8+0.8\%+ 0.8 %) 29.5%percent29.529.5\%29.5 % (+1.2%percent1.2+1.2\%+ 1.2 %) 34.8%percent34.834.8\%34.8 % (+1.2%percent1.2+1.2\%+ 1.2 %) S×K=64𝑆𝐾64S\times K=64italic_S × italic_K = 64 30.7%percent30.730.7\%30.7 % (+0.0%percent0.0+0.0\%+ 0.0 %) 30.3%percent30.330.3\%30.3 % (+0.8%percent0.8+0.8\%+ 0.8 %) 36.1%percent36.136.1\%36.1 % (+1.3%percent1.3+1.3\%+ 1.3 %) 4.4 Search and sampling budget Table 4 reports the trends of the pass rate against the search size or sampling budget S×K𝑆𝐾S\times Kitalic_S × italic_K. We find that Lean-STaR benefits more as K𝐾Kitalic_K increases, especially when K𝐾Kitalic_K is relatively large. The result suggests that additional sampling with thoughts improves performance, while additional sampling without thoughts may saturate. We believe this is because thoughts increase the diversity of outputs and contribute to exploration in the theorem proving space. Therefore, Lean-STaR is more scalable (in terms of inference-time compute), and may be amenable to further improvements with additional iterations of expert iteration. 4.5 Experiments with stronger base model and more data We instantiate Lean-STaR using a stronger language model (InternLM2-Math-plus-7b [51]), which was released after the experiment above. We follow a similar setup to the previous experiment. In this experiment, we used 140,000140000140,000140 , 000 thoughts annotated by GPT-4o [32] to fine-tune a model (“Lean-CoT”). Then we performed only one iteration of expert iteration and collected about 60,0006000060,00060 , 000 (proof state, thoughts, next-tactic) pairs in data, named “STaR dataset” D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We further fine-tuned the Lean-CoT model on dataset D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to get the Lean-STaR model. Our new results are reported in Table 3. We can see that Lean-STaR still gives a significant improvement over the baseline. For instance, Lean-STaR achieves 45.4%percent45.445.4\%45.4 % versus 39.8%percent39.839.8\%39.8 % in InternLM-plus using sampling with a similar inference budget and 43.4%percent43.443.4\%43.4 % using best-first search with more inference budget reported in [51]. This results show that both retrospective rationale generation and expert iteration can improve the theorem-proving capabilities on a stronger base model. 4.6 Experiments on expert iteration without CoT Table 5 shows the result of expert iteration without CoT (i.e., using (state, tactic) pairs only) as well as the result of Lean-CoT and Lean-STaR. Expert iteration alone achieves 43.0%, which is less than Lean-STaR (45.4%). This shows that Lean-STaR’s performance gains do not only come from the use of expert iteration. Table 5: Results for the InternLM2-plus-7b and our Lean-CoT, Lean-STaR, and expert iteration without CoT. We use sampling with N=50,K=32,&T=0.7formulae-sequence𝑁50formulae-sequence𝐾32𝑇0.7N=50,K=32,\&\ T=0.7italic_N = 50 , italic_K = 32 , & italic_T = 0.7. Approach # (Continual) Training Data Pass@32 InternLM2-plus-7b (Few-Shot) - 40.9%percent40.940.9\%40.9 % - SFT 231,240231240231,240231 , 240 41.3%percent41.341.3\%41.3 % +0.4%percent0.4+0.4\%+ 0.4 % Lean-CoT 140,000140000140,000140 , 000 43.4%percent43.443.4\%43.4 % +2.1%percent2.1\mathbf{+2.1\%}+ bold_2.1 % Lean-STaR 60,0006000060,00060 , 000 45.5%percent45.545.5\%45.5 % +2.1%percent2.1\mathbf{+2.1\%}+ bold_2.1 % Expert iteration (SFT) 60,0006000060,00060 , 000 43.0%percent43.043.0\%43.0 % +1.7%percent1.7+1.7\%+ 1.7 % 5 Conclusion & Limitations In this paper, we presented Lean-STaR, a novel approach that significantly enhances the theorem-proving capabilities of language models in formal mathematics by integrating Chain-of-Thought (CoT) rationales into each proof step. Our method begins with generating synthetic rationales using ground-truth tactics retrospectively, followed by fine-tuning the language model to generate these rationales and predict subsequent tactics, resulting in the Lean-CoT model. We further improved this model using expert iteration, fine-tuning it on correct proofs it samples and verifies using the Lean solver. Our contributions include the introduction of the first thought-augmented theorem proving dataset, demonstrating that expert iteration can further improve performance, and achieving new state-of-the-art results on the miniF2F-test benchmark, increasing the pass rate from 30.3%percent30.330.3\%30.3 % to 36.1%percent36.136.1\%36.1 %. These advancements are not only about improving the accuracy of automated theorem proving, but also offer a scalable and efficient framework for advancing human understanding of mathematics, which may lead to significant impacts in education, scientific discovery, and program verification [11, 24, 37, 4, 16, 31]. The primary limitation of our method is that its performance may be constrained by issues of computational scalability. Both Lean-CoT and Lean-STaR have been fine-tuned on a dataset that is not very large. Additionally, the use of GPT-4 to generate synthetic data may incur a significant cost and possibly introduce biases. Also, expert iteration could face a bottleneck due to CPU and IO limitations, which might slow down the process due to a sluggish speed of Lean ITP. References Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Agrawal et al. [2022] Ayush Agrawal, Siddhartha Gadgil, Navin Goyal, Ashvni Narayanan, and Anand Tadipatri. Towards a mathematics formalisation assistant using large language models. arXiv preprint arXiv:2211.07524, 2022. Anthony et al. [2017] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017. Avigad [2023] Jeremy Avigad. Mathematics and the formal turn, 2023. Azerbayev et al. [2023a] Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023a. Azerbayev et al. [2023b] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023b. Blanchette et al. [2016] Jasmin Christian Blanchette, Cezary Kaliszyk, Lawrence C Paulson, and Josef Urban. Hammering towards qed. Journal of Formalized Reasoning, 9(1):101–148, 2016. Bohme and Nipkow [2010] Sascha Bohme and Tobias Nipkow. Sledgehammer: judgement day. In Automated Reasoning: 5th International Joint Conference, IJCAR 2010, Edinburgh, UK, July 16-19, 2010. Proceedings 5, pages 107–121. Springer, 2010. Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Buzzard [2024] Kevin Buzzard. Lean in 2024. https://xenaproject.wordpress.com/2024/01/20/lean-in-2024/, 2024. Carter and Monks [2013] Nathan C Carter and Kenneth G Monks. Lurch: a word processor that can grade students’ proofs. In CICM Workshops, 2013. Chen et al. [2022] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022. Coq [1996] Projet Coq. The coq proof assistant-reference manual. INRIA Rocquencourt and ENS Lyon, version, 5, 1996. Czajka and Kaliszyk [2018] Lukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory. Journal of automated reasoning, 61:423–453, 2018. De Moura et al. [2015] Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pages 378–388. Springer, 2015. First [2023] Emily First. Automating the Formal Verification of Software. PhD thesis, 2023. URL https://scholarworks.umass.edu/dissertations_2/2812. First et al. [2023] Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 1229–1241, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400703270. doi: 10.1145/3611643.3616243. URL https://doi.org/10.1145/3611643.3616243. Gloeckle et al. [2023] Fabian Gloeckle, Baptiste Roziere, Amaury Hayat, and Gabriel Synnaeve. Temperature-scaled large language models for lean proofstep prediction. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. Han et al. [2021] Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. arXiv preprint arXiv:2102.06203, 2021. Han et al. [2022] Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U. Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. Huang et al. [2024] Yinya Huang, Xiaohan Lin, Zhengying Liu, Qingxing Cao, Huajian Xin, Haiming Wang, Zhenguo Li, Linqi Song, and Xiaodan Liang. Mustard: Mastering uniform synthesis of theorem and proof data. arXiv preprint arXiv:2402.08957, 2024. Jiang et al. [2022] Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothee Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022. Kang et al. [2020] Dongyeop Kang, Andrew Head, Risham Sidhu, Kyle Lo, Daniel S Weld, and Marti A Hearst. Document-level definition detection in scholarly documents: Existing models, error analyses, and future directions. arXiv preprint arXiv:2010.05129, 2020. Lample et al. [2022] Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35:26337–26349, 2022. Li et al. [2024] Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving, 2024. mathlib Community [2020] The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, pages 367–381, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370974. doi: 10.1145/3372885.3373824. URL https://doi.org/10.1145/3372885.3373824. Newell and Simon [1956] Allen Newell and Herbert Simon. The logic theory machine–a complex information processing system. IRE Transactions on information theory, 2(3):61–79, 1956. Nipkow et al. [2002] Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson. Isabelle/HOL: a proof assistant for higher-order logic. Springer, 2002. Nye et al. [2021] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. of Sciences [2023] National Academies of Sciences. Artificial intelligence to assist mathematical reasoning: Proceedings of a workshop, 2023. OpenAI [2023] OpenAI. OpenAI: GPT-4, 2023. URL https://openai.com/research/gpt-4. Polu and Sutskever [2020] Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020. Polu et al. [2022] Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022. Rabe et al. [2020] Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020. Singh et al. [2023] Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023. Szegedy [2020] Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In Intelligent Computer Mathematics: 13th International Conference, CICM 2020, Bertinoro, Italy, July 26–31, 2020, Proceedings 13, pages 3–20. Springer, 2020. Tao [2023] Terence Tao. https://mathstodon.xyz/@tao/111287749336059662, 2023. Thakur et al. [2023] Amitayush Thakur, Yeming Wen, and Swarat Chaudhuri. A language-agent approach to formal theorem-proving. arXiv preprint arXiv:2310.04353, 2023. Wang et al. [2023a] Haiming Wang, Huajian Xin, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, Heng Liao, and Xiaodan Liang. Lego-prover: Neural theorem proving with growing libraries, 2023a. Wang et al. [2023b] Haiming Wang, Ye Yuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, et al. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646, 2023b. Wang et al. [2020] Qingxiang Wang, Chad Brown, Cezary Kaliszyk, and Josef Urban. Exploration of neural machine translation in autoformalization of mathematics in mizar. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, pages 85–98, 2020. Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. Welleck and Saha [2023] Sean Welleck and Rahul Saha. Llmstep: Llm proofstep suggestions in lean. arXiv preprint arXiv:2310.18457, 2023. Welleck et al. [2021] Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. Naturalproofs: Mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112, 2021. Welleck et al. [2022] Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. Naturalprover: Grounded mathematical proof generation with language models. Advances in Neural Information Processing Systems, 35:4913–4927, 2022. Wu et al. [2021] Yuhuai Wu, Markus N Rabe, Wenda Li, Jimmy Ba, Roger B Grosse, and Christian Szegedy. Lime: Learning inductive bias for primitives of mathematical reasoning. In International Conference on Machine Learning, pages 11251–11262. PMLR, 2021. Wu et al. [2022] Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. Advances in Neural Information Processing Systems, 35:32353–32368, 2022. Xin et al. [2023] Huajian Xin, Haiming Wang, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, et al. Lego-prover: Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2023. Yang et al. [2023] Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023. Ying et al. [2024] Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 2024. Zelikman et al. [2022] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. Zhao et al. [2024] Xueliang Zhao, Wenda Li, and Lingpeng Kong. Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving, 2024. URL https://openreview.net/forum?id=xLoxMvO695. Zheng et al. [2021] Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021. Zhou [2023] Denny Zhou. Teach language models to reason. https://dennyzhou.github.io/LLMs-Reason-Taiwan-2023.pdf, 2023. Accessed: 2024-05-21. Zhou et al. [2024a] Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, and Yuhuai Wu. Don’t trust: Verify–grounding llm quantitative reasoning with autoformalization. arXiv preprint arXiv:2403.18120, 2024a. Zhou et al. [2024b] Jin Peng Zhou, Charles E Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, and Yuhuai Wu. Don’t trust: Verify – grounding LLM quantitative reasoning with autoformalization. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=V5tdi14ple. Appendix A Additional Experiment Setup A.1 Lean-STaR Expert Iteration The second phase of Lean-STaR consists of generating new thoughts and tactics with the current language model, saving those that result in correct proofs, and training on the union of the initial thought-augmented dataset and the saved examples (i.e., expert iteration [34, 52, 36]). We perform two iterations of expert iteration, and provide details on our specific experimental setup below. In each iteration we use sampling on the LeanDojo Benchmark 4 dataset, and save the (state, thought, tactic) examples that are part of successful proofs. For each problem, we sample K=32𝐾32K=32italic_K = 32 times in parallel with temperature T=1.0𝑇1.0T=1.0italic_T = 1.0, and limit the number of times a tactic can be generated to a total of N=5𝑁5N=5italic_N = 5 per problem. Also, sampling is limited to 1111 minute per problem. In this setup, each problem needs on average about 0.50.50.50.5 A100 minutes. We collect successfully sampled trajectories to produce a “STaR dataset” D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and up to 3333 proof trajectories were collected for each problem. We collected 32,2313223132,23132 , 231 different (proof state, thoughts, next-tactic) pairs in successful proof trajectories during expert iteration, which takes about 4444 days with 8×A⁢1008𝐴1008\times A1008 × italic_A 100 GPUs. Then, we further fine-tune SFT model for 1111 epoch on the combination of GPT-4 annotated reasoning data and expert iteration data DT∪D1subscript𝐷𝑇subscript𝐷1D_{T}\cup D_{1}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to get the Lean-STaR model. We use the same learning rate setup that was used for the SFT model. In the second iteration, we generate a dataset D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in a similar fashion. Then, we chose to further fine-tune model from iteration 1111, M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, on the generated dataset D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (roughly 19k pairs). The setup of experiment about InternLM2-plus is slightly different. The details are shown in Section  4.5 and Appendix  E. Appendix B Statistics for our methods as well as the baselines Table 6: Statistics for the baselines and our Lean-CoT, Lean-STaR on MiniF2F dataset. We use sampling method with hyperparameters N=50&K=32&T=0.7𝑁50𝐾32𝑇0.7N=50\ \&\ K=32\ \&\ T=0.7italic_N = 50 & italic_K = 32 & italic_T = 0.7. Approach # (Continual) Training Data Pass@32 InternLM2-Math-7b (Few-Shot) - 28.7%percent28.728.7\%28.7 % - SFT 231,240231240231,240231 , 240 29.5%percent29.529.5\%29.5 % +0.8%percent0.8+0.8\%+ 0.8 % Lean-CoT 52,4385243852,43852 , 438 32.8%percent32.832.8\%32.8 % +3.3%percent3.3\bm{+3.3\%}bold_+ bold_3.3 bold_% Lean-STaR (Iter-1) 32,2313223132,23132 , 231 34.0%percent34.034.0\%34.0 % +1.2%percent1.2+1.2\%+ 1.2 % Lean-STaR (Iter-2) 19,3241932419,32419 , 324 34.8%percent34.8\mathbf{34.8\%}bold_34.8 % +0.8%percent0.8+0.8\%+ 0.8 % Appendix C An Example and Explanation of A Formal Proof in Lean An example of a formal proof in Lean with its visualization is shown in Figure 6, taken from [25]. In the proof, the tactic induction k is is applied to the initial state (n≤m⇒n+k≤m+k𝑛𝑚⇒𝑛𝑘𝑚𝑘n\leq m\Rightarrow n+k\leq m+kitalic_n ≤ italic_m ⇒ italic_n + italic_k ≤ italic_m + italic_k) and the ITP converts the current state to subgoals case 0 ∧\land∧ case ih: n≤m∧n+k≤m+k⇒n+(k+1)≤m+(k+1)𝑛𝑚𝑛𝑘𝑚𝑘⇒𝑛𝑘1𝑚𝑘1n\leq m\land n+k\leq m+k\Rightarrow n+(k+1)\leq m+(k+1)italic_n ≤ italic_m ∧ italic_n + italic_k ≤ italic_m + italic_k ⇒ italic_n + ( italic_k + 1 ) ≤ italic_m + ( italic_k + 1 ). The case 0: n≤m𝑛𝑚n\leq mitalic_n ≤ italic_m is our hypothesis h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT so it can be proven by case 0:exact h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT tactic. Then, we rewrite the case ih through the nat.succ_le_succ_iff which is a theorem in Lean library means n≤m⇔n+1≤m+1⇔𝑛𝑚𝑛1𝑚1n\leq m\Leftrightarrow n+1\leq m+1italic_n ≤ italic_m ⇔ italic_n + 1 ≤ italic_m + 1. After tactics case 0:exact h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and case ih:rw nat.succ_le_succ_iff, the goal state is converted to n+k≤m+k𝑛𝑘𝑚𝑘n+k\leq m+kitalic_n + italic_k ≤ italic_m + italic_k which is the hypothesis introduced by induction. Therefore, we can complete this proof using tactic exact k_ih. ⬇ theorem add_le_add_right (m n k : ℕℕ\mathbb{N}blackboard_N) (h0 : n ≤\leq≤ m) : n + k ≤\leq≤ m + k := induction k with | zero => exact h0 | succ k ih => rw Nat.succ_le_succ_iff exact ih Figure 6: A example proof and its visualization of n≤m⇒n+k≤m+k𝑛𝑚⇒𝑛𝑘𝑚𝑘n\leq m\Rightarrow n+k\leq m+kitalic_n ≤ italic_m ⇒ italic_n + italic_k ≤ italic_m + italic_k in Lean, taken from [25]. The induction tactic reduces the initial statement to two subgoals. Then tactics case 0:exact h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and case ih:rw nat.succ_le_succ_iff, case ih:exact k_ih can be applied in turn to complete the proof. Appendix D Performance Analysis by Types and Difficulties using InternLM2-plus-7b Table 7 reports the number of problems successfully proved, partitioned by type and difficulty using InternLM2-plus. We see that Lean-CoT improves performance mainly in Number Theory and Lean-STaR improves performance in solving difficult problems on all categories, which is the opposite of the performance of the InternLM2-base. Table 7: Counts of problems successfully proved in minif2f-test benchmark using InternLM2-plus-7b, split by type and difficulty. The methods use sampling with N=50,K=32formulae-sequence𝑁50𝐾32N=50,K=32italic_N = 50 , italic_K = 32. TOTAL Test Set Size InternLM2-plus-7b Lean-CoT Lean-STaR (Iter-1) IMO 20 0 0 0 AIME 15 3 3 4 AMC 45 9 9 10 MATH Algebra Level 5 14 6 6 6 Level 4 14 9 9 9 Level 3 14 11 13 13 Level 2 14 11 11 11 Level 1 14 10 10 10 Number Theory Level 5 16 7 7 7 Level 4 11 6 8 8 Level 3 11 6 7 9 Level 2 11 7 9 9 Level 1 11 10 10 10 CUSTOM Algebra 18 4 3 4 Number Theory 8 0 0 0 Induction 8 1 1 1 Appendix E Performance difference of joint training and continue training As shown in Table 8, the joint training method performs better using InternLM2-base but training method performs much better using InternLM2-plus. It seems that there are no difference between these two methods. Therefore, this performance can be depend on the quantity of data or the model. (We use much more data when using InternLM2-plus and the quantity of "STaR data" is relatively small.) Table 8: Performance difference of joint training and continue training on Lean-STaR. We use sampling method with hyperparameters N=50&K=32&T=0.7𝑁50𝐾32𝑇0.7N=50\ \&\ K=32\ \&\ T=0.7italic_N = 50 & italic_K = 32 & italic_T = 0.7. In continue training, we further fine-tune the Lean-CoT model on "STaR data" to get Lean-STaR model and in joint training we fine-tune the SFT model on combination of GPT-4 annotated reasoning data and "STaR data". Approach InternLM2-base-7b InternLM2-plus-7b Lean-CoT 32.8%percent32.832.8\%32.8 % 43.4%percent43.443.4\%43.4 % Lean-STaR (Iter-1) (joint training) 34.0% 43.9%percent43.943.9\%43.9 % Lean-STaR (Iter-1) (continue training) 33.2%percent33.233.2\%33.2 % 45.5% Appendix F Retrospective Rationale Generation Prompt of GPT-4 ⬇ Please act as a professional mathematician. Your goal is to accurately prove a math theorem in Lean4. You are given the first tactic that should be taken to prove the Given Theorem. # Example: Given Theorem: ‘‘‘lean4 X : Type u_1 Y : Type u_2 inst✝1 : TopologicalSpace X inst✝ : TopologicalSpace Y f : X →→\rightarrow→. Y h : PContinuous f ⊢proves\vdash⊢ IsOpen (PFun.preimage f Set.univ) ‘‘‘ Given Tactic: ‘‘‘lean4 exact h _ isOpen_univ ‘‘‘ Assistant Output: ### PLANNING OF REASONING Given the above goal and the given the target tactic ‘exact h _ isOpen_univ‘, I should produce a reasoning that only based on the goal, and conclude to the ‘exact h _ isOpen_univ‘, but not in the hindsight format. ### REASONING By directly using the hypothesis ‘h : PContinuous f‘, we state that for any open set, its preimage via ‘f‘ is also open, and this applies to ‘Set.univ‘ as well. ### TACTIC ‘‘‘lean4 exact h _ isOpen_univ ‘‘‘ To achieve the goal, you have three jobs. # Write down the planning of reasoning. # Write down the reasoning that can result in the selection of the Given Tactic. # Write down the Given Tactic. You have four principles to do this. # Craft reasoning that is comprehensive yet concise, contained ideally within two sentences. # Begin reasoning with the objective of the Given Theorem, ensuring logical progression, rather than constructing it retrospectively from the Given Tactic. # Avoid mentioning the Given Tactic and specific statement names from the Given Tactic throughout your reasoning. # Position the conclusion of the tactic after presenting all reasons, ensuring that details regarding the Given Tactic are not introduced at the start of your reasoning. Your output should be strictly in the following format and should not contain extra content: ### PLANNING OF REASONING Given the above goal and the given the target tactic , I should produce a reasoning that only based on the goal, and conclude to the exact , but not in the hindsight format. ### REASONING ### TACTIC # Given Theorem: ‘‘‘lean4 {theorem["state_before"]} ‘‘‘ # Given Tactic: ‘‘‘lean4 {theorem["tactic"]} ‘‘‘ Appendix G Examples of generated Lean proofs with thoughts ⬇ theorem aime_1990_p4 (x : ℝℝ\mathbb{R}blackboard_R) (h0 : 0 < x) (h1 : x ^ 2 - 10 * x - 29 ≠\neq≠ 0) (h2 : x ^ 2 - 10 * x - 45 ≠\neq≠ 0) (h3 : x ^ 2 - 10 * x - 69 ≠\neq≠ 0) (h4 : 1 / (x ^ 2 - 10 * x - 29) + 1 / (x ^ 2 - 10 * x - 45) - 2 / (x ^ 2 - 10 * x - 69) = 0) : x = 13 := norm_num at h1 h2 h3 -- Reasoning: To prove ‘x = 13‘, we examine the conditions under which the given equation holds, specifically focusing on the non-equality of the expressions involving ‘x ^ 2 - 10 * x - 29‘, ‘x ^ 2 - 10 * x - 45‘, and ‘x ^ 2 - 10 * x - 69‘. Simplifying these expressions using properties of real numbers and algebra allows us to directly compare them with their respective values, ultimately identifying ‘x = 13‘ as the unique solution that satisfies all constraints. field_simp at * -- Reasoning: To prove \(x = 13\), we leverage the given equalities involving rational fractions of \(x^2 - 10x - c\) for various constants \(c\), where \(0 < x\) and each \(x^2 - 10x - c\) is non-zero. By algebraically manipulating these expressions, we aim to derive a relationship among \(x\) and these constants, specifically focusing on \(13\), under the given non-zero conditions for the quadratic expressions. ring_nf at h4 ⊢proves\vdash⊢ -- Reasoning: To establish the equality ‘x = 13‘ under the given conditions, we need to manipulate the algebraic expression given by ‘h4‘ to reveal its dependence on ‘x‘, recognizing that the presence of ‘x‘ in the expressions may affect the equation’s truth value. The application of algebraic transformations on both sides of the equation will facilitate the identification of ‘x‘’s value, taking into consideration the inequality conditions provided (‘h1‘, ‘h2‘, ‘h3‘). nlinarith -- Reasoning: To prove \(x = 13\), we must manipulate the equation given by \(h4\) to isolate \(x\), using the algebraic properties of equality and the assumption \(h0\) which states \(0 < x\). This requires careful rearrangement and application of algebraic identities to simplify and rearrange terms, facilitating the direct application of the properties of real numbers to reach the desired conclusion. theorem amc12a_2002_p6 (n : ℕℕ\mathbb{N}blackboard_N) (h0 : 0 < n) : ∃\exists∃ m, m > n ∧\land∧ ∃\exists∃ p, m * p ≤\leq≤ m + p := use n + 1, by linarith only [h0] -- Reasoning: To demonstrate the existence of such an ‘m‘ and ‘p‘ that satisfy the given inequalities, we leverage the property of natural numbers that ensures an increase in ‘n‘ from ‘0‘ leads to the possibility of finding such ‘m‘ and ‘p‘. This approach is supported by the fundamental property of natural numbers and their arithmetic operations, suggesting that for a given ‘n‘ greater than ‘0‘, there exists a successor ‘m‘ and a number ‘p‘ that satisfy the conditions outlined. exact ⟨0, by simp 〉 -- Reasoning: To demonstrate that there exists an \(n \in ℕℕ\mathbb{N}blackboard_N\) such that \((n + 1) * p \leq n + 1 + p\) holds for some \(p \in ℕℕ\mathbb{N}blackboard_N\), we recognize that setting \(p = 0\) satisfies the inequality by the properties of zero in arithmetic. This is because \(n + 1 > 0\) by assumption, ensuring the left-hand side is non-negative and the right-hand side simplifies to \(n + 1\), making the inequality trivially true. ⬇ theorem aime_1983_p2 (x p : ℝℝ\mathbb{R}blackboard_R) (f : ℝℝ\mathbb{R}blackboard_R →→\rightarrow→ ℝℝ\mathbb{R}blackboard_R) (h0 : 0 < p ∧\land∧ p < 15) (h1 : p ≤\leq≤ x ∧\land∧ x ≤\leq≤ 15) (h2 : f x = abs (x - p) + abs (x - 15) + abs (x - p - 15)) : 15 ≤\leq≤ f x := cases’ h1 with h1 h2 -- Reasoning: Given a function \(f\) defined over the real numbers, where \(f(x)\) equals the sum of the absolute values of the differences \(x - p\), \(x - 15\), and \(x - p - 15\), along with conditions on \(p\) and \(x\) that ensure \(x\) lies between \(p\) and \(15\) and \(p\) is positive, we aim to show that \(f(x) \geq 15\). To demonstrate this, we consider all possible cases for the position of \(x\) relative to \(p\) and \(15\), specifically at \(p\) itself, between \(p\) and \(15\), and at \(15\). By evaluating \(f(x)\) for each case, we can establish a lower bound of \(f(x)\) that is greater than or equal to \(15\). simp [h0, abs_of_nonneg (sub_nonneg.2 h1), abs_of_nonneg (sub_nonneg.2 h2), *] -- Reasoning: The objective is to show that \(f(x)\) is at least 15 for \(x\) within the bounds \(p \le x \le 15\), where \(p\) is positive but less than 15. The provided expression for \(f(x)\) suggests using properties of absolute values and inequalities to simplify and analyze \(f(x)\). By applying simplification rules for absolute values and inequalities, along with the given bounds, we aim to demonstrate the desired inequality \(f(x) \ge 15\). simp only [abs_of_nonneg] at * -- Reasoning: The objective is to establish an inequality involving an expression with absolute values and a sum, where all terms are functions of \(x\). By recognizing that the absolute value functions are always non-negative, we can simplify the given expression by removing the absolute value symbols, thus focusing on the arithmetic manipulation of the terms to demonstrate the inequality. simp only [abs_of_nonneg, sub_add_eq_add_sub] at * -- Reasoning: The objective is to demonstrate that the expression ‘15 ≤\leq≤ x - p + |x - 15| + |x - p - 15|‘ holds given the conditions on ‘p‘, ‘x‘, and the function ‘f‘. Since the expression in question involves absolute values which inherently involve non-negativity, manipulating these expressions to highlight their non-negativity or equality is a logical initial step. This involves recognizing that the absolute value of a difference can be rewritten as the difference of absolute values, which simplifies the manipulation of the expression to prove the inequality. congr 1 -- Reasoning: To establish the inequality ‘15 ≤\leq≤ x + |x - 15| + |x - p - 15| - p‘, we recognize that the left-hand side of the inequality involves a mix of additions and subtractions of real numbers. By introducing a congruence transformation that simplifies operations involving absolute values, we aim to align both sides of the inequality in a form that directly reflects the properties of real numbers under these operations. simp only [abs_of_nonpos (sub_nonpos_of_le h2), abs_of_nonneg (sub_nonneg_of_le h1)] -- Reasoning: The objective is to prove an inequality involving absolute values and subtractions that aligns the given conditions and constraints. Simplification using the properties of absolute values and the provided bounds can directly lead to the desired conclusion by expressing the absolute values in terms of their underlying expressions and applying the given constraints. simp [abs_of_pos, *] -- Reasoning: To prove the given inequality involving absolute values, simplification using algebraic properties and given hypotheses will directly lead us to the desired inequality by recognizing the algebraic manipulation and application of the known bounds ‘h1‘ and ‘h2‘, as well as the non-negativity of ‘p‘. rw [abs_of_nonpos] -- Reasoning: The objective is to demonstrate that ‘15 ≤\leq≤ 15 + |x - p - 15| - p‘ under the given constraints on ‘p‘ and ‘x‘. Given ‘p < 15‘ and ‘p ≤\leq≤ x ≤\leq≤ 15‘, it’s apparent that ‘p - 15‘ is non-positive. This observation allows us to apply the property that the absolute value of a non-positive number is equal to its negative value, thereby simplifying the expression on the right-hand side of the inequality. all_goals linarith -- Reasoning: The goal involves demonstrating a relationship between expressions involving inequalities and subtraction. The simplification of these expressions into a form that directly compares their numerical values can lead to a straightforward application of known inequalities and algebraic properties, demonstrating the inequality’s validity under the given conditions. Generated on Thu Aug 8 23:43:17 2024 by LaTeXML