Formal theorem proving, using languages like Lean or Isabelle, represents a frontier for both mathematics and LLM abilities. Its significance is twofold: firstly, by providing rigorously verified proofs, it contributes to the bedrock of mathematical certainty, potentially unlocking new realms of mathematical discovery; secondly, it can be viewed as the cutting edge of reasoning capabilities for modern AI methods, particularly large language models (LLMs). From an application perspective, the ability of language models to perform these complex reasoning tasks has profound implications, extending far beyond pure mathematics. From AI agents making critical decisions to systems analyzing intricate scenarios, the skills honed in theorem proving translate directly to real-world AI applications, pushing the boundaries of what LLMs can achieve in terms of logical reasoning and problem-solving.
This task is monumentally challenging, attracting the attention of renowned mathematicians like Terence Tao and underscoring its significance and difficulty. The commonly tested miniF2F dataset (Zheng et al., 2022), which includes problems from the International Mathematical Olympiad (IMO), features reasoning processes so complex and multifaceted that even human experts find them challenging. Attempting to solve this problem using LLMs presents two significant hurdles. Firstly, as noted in previous research (Wu et al., 2022), there's a critical shortage of high-quality, specialized data in mathematics and theorem proving. While many complex problems that LLMs can solve typically require demonstration (fine-tuning) through large amounts of in-domain data, human-generated Lean/Isabelle proofs for complex mathematical problems are extremely scarce. Secondly, the task demands deep, multi-step reasoning that pushes the boundaries of current LLM capabilities. Whether LLMs can consistently perform this level of complex reasoning remains an open question, making formal theorem proving not just a mathematical challenge, but a true test of AI's cognitive abilities and a benchmark for advanced reasoning in artificial intelligence.
In a collaborative effort between the University of Hong Kong and SambaNova Systems, we introduce SubgoalXL, a novel approach to AI-powered theorem proving. SubgoalXL combines subgoal-based proofs (Figure 1a) with an expert learning framework (Figure 1b), iteratively improving three key components: a formal statement generator, a formal proof generator, and a subgoal generator. This system samples from estimated optimal distributions to enhance performance on theorem-proving tasks. Through this innovative methodology, which leverages subgoal-based approaches and an expert learning framework, SubgoalXL significantly improves data utilization efficiency, enabling LLMs to achieve superior results with less data and computational resources.
SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset, an absolute improvement of 4.9% (Table 1). It successfully solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F. These results highlight the critical role of maximizing limited data utility and employing effective guidance for complex reasoning, complementing large-scale data efforts.
For the complete code and further technical details, you can explore our project on GitHub and try out SubgoalXL on the Hugging Face Hub. A detailed explanation of our approach and results is available in our arXiv paper.
A crucial factor in our success was the utilization of SambaNova Systems' groundbreaking technology, which recently surpassed the 1000 tokens per second barrier for large language model inference. This technological leap played a pivotal role in accelerating our research progress, allowing us to process complex formal statements and generate proofs at unprecedented speeds. The high-performance AI system's exceptional capabilities were particularly evident in handling the large-scale mathematical theorem proving tasks required for SubgoalXL. By significantly reducing the time required for model iterations and experimentation, we were able to train and optimize our SubgoalXL model more rapidly and effectively than ever before. This efficiency was key to achieving our leading performance on the miniF2F dataset. The ability to process tokens at such high speeds not only accelerated our research but also enabled us to push the boundaries of what's possible in AI-assisted mathematics. This collaboration exemplifies how cutting-edge AI hardware can dramatically advance research in theorem proving, demonstrating the synergy between advanced technology and innovative algorithmic approaches in pushing the frontiers of AI capabilities.
SubgoalXL represents a significant step forward in the field of AI-powered theorem proving. By combining novel methodologies with state-of-the-art hardware, we've demonstrated that AI can tackle increasingly complex mathematical reasoning tasks with greater efficiency and accuracy. As we continue to refine and expand this approach, we're excited about the potential implications for both mathematical advancement and the broader landscape of AI reasoning capabilities.