MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Findings of the Association for Computational Linguistics ACL 2024(2024)
摘要
Recent advancements in large language models (LLMs) have showcasedsignificant improvements in mathematics. However, traditional math benchmarkslike GSM8k offer a unidimensional perspective, falling short in providing aholistic assessment of the LLMs' math capabilities. To address this gap, weintroduce MathBench, a new benchmark that rigorously assesses the mathematicalcapabilities of large language models. MathBench spans a wide range ofmathematical disciplines, offering a detailed evaluation of both theoreticalunderstanding and practical problem-solving skills. The benchmark progressesthrough five distinct stages, from basic arithmetic to college mathematics, andis structured to evaluate models at various depths of knowledge. Each stageincludes theoretical questions and application problems, allowing us to measurea model's mathematical proficiency and its ability to apply concepts inpractical scenarios. MathBench aims to enhance the evaluation of LLMs'mathematical abilities, providing a nuanced view of their knowledgeunderstanding levels and problem solving skills in a bilingual context. Theproject is released at https://github.com/open-compass/MathBench .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn