📜 Līla

A Unified Benchmark for Mathematical Reasoning

(EMNLP 2022)
Paper | Blog post | Dataset | Model | Leaderboard (coming soon)

About

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Existing math reasoning datasets are too narrow in scope to holistically evaluate models' general math reasoning abilities. Towards evaluating and improving AI systems in this domain, we propose Līla, a unified mathematical reasoning benchmark consisting of over 140K natural langauge questions from 23 diverse tasks. We also introduce Bhāskara, a foundation model for mathematical reasoning trained on Līla.

An example of Līla annotations

Benchmark

Līla is a comprehensive benchmark for mathematical reasoning with over 140K natural language questions annotated with Python programs and natural language instructions. The data set comes with multiple splits: Līla-IID (train, dev, test), Līla-OOD (train, dev, test), and Līla-Robust. To help measure progress in mathematical reasoning, we host a public leaderboard for Līla.

We gladly accept contributions to Līla. If you have a mathematical reasoning dataset, please open a issue and send us an email and we will work with you to incorporate it!

Question n-gram distribution in Līla

Our benchmark is named after Līlavati, a 12th century mathematical treatise on arithmetic that covers topics like arithmetic and geometric progressions, indeterminate equations and combinations. It is also widely known for the extensive number of math word problems. The author, Bhāskara II is known for fundamental and original contributions to calculus, physics, number theory, algebra, and astronomy. Read more about Bhāskara in our blog post.

Model

Bhāskara is a multi-task mathematical reasoning model. Bhāskara is a fine-tuned version of EleutherAI's GPT-Neo-2.7B on the Līla-IID-train dataset. In our paper, we show that further fine-tuning Bhāskara on new mathematical tasks outperforms fine-tuning similarly sized models. Because our model is available on HuggingFace Hub, using it is as easy as:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("allenai/bhaskara")
model.generate("Question: What is the square root of 15?\nAnswer:")

Authors

Swaroop Mishra*
Arizona State Universty
Matthew Finlayson*
The Allen Institute for AI

Pan Lu
UC Los Angeles
Leonard Tang
Harvard Universty
Sean Welleck
The Allen Institute for AI
Chitta Baral
Arizona State Universty
Tanmay Rajpurohit
Georgia Institute of Technology
Oyvind Tafjord
The Allen Institute for AI
Ashish Sabharwal
The Allen Institute for AI
Peter Clark
The Allen Institute for AI
Ashwin Kalyan
The Allen Institute for AI
*Equal first authors

Contact

Questions? Want to get in touch? Contact Matthew Finlayson (matthewf@allenai.org) or open an issue on GitHub.

Cite us

@INPROCEEDINGS{Mishra2022Lila,
  author = {
    Swaroop Mishra 
      and Matthew Finlayson 
      and Pan Lu 
      and Leonard Tang 
      and Sean Welleck 
      and Chitta Baral 
      and Tanmay Rajpurohit 
      and Oyvind Tafjord 
      and Ashish Sabharwal 
      and Peter Clark 
      and Ashwin Kalyan},
  title = {Lila: A Unified Benchmark for Mathematical Reasoning},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}