2605.09063
2026-05-20
cs.CL
版本更新
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak:一个由数学家精心编订的基准,用于评估大语言模型的研究级数学能力
Guijin Son, Seungone Kim, Catherine Arnett, Hyunwoo Ko, Hyein Lee, Hyeonah Kang, Jiang Longxi, Jin Yun, JungYup Lee, Kyungmin Lee, Sam Yoosuk Kim, Sang Park, Seunghyeok Hong, SeungJae Lee, Seungyeop Yi, Shinae Shin, SunHye Bok, Sunyoung Shin, Yonghoon Ji, Youngtaek Kim, Hanearl Jung, Akari Asai, Graham Neubig, Sean Welleck, Youngjae Yu, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Chae Young Han, Christian Stump, Cooper R. Anderson, Dmitrii Karp, Dohyun Kwon, Dongryung Yi, DoYong Kwon, Duk-Soon Oh, Eunho Choi, Giovanni Resta, Greta Panova, Huiyun Noh, Hyungryul Baik, Hyungsun Bae, Inomov Mashrafdzhon, Jeewon Kim, Jeong-Rae Kim, Ji Eun Lee, Jiaqi Liu, Jieui Kang, Jimin Kim, Jon-Lark Kim, Joonyeong Won, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Mario Kummer, Max Mercer, Min Hoon Kim, Minjun Kim, Nahyun Lee, Ng Ze-An, Nicolas Libedinsky, Rafał Marcin Łochowski, Raphaël Lachièze-Rey, Robert Auffarth, Ruichen Zhang, Sejin Park, Seonguk Seo, Shin Jaehoon, Sunatullo, Taewoong Eom, Yeachan Park, Yongseok Jang, Youchan Oh, Zhaoyang Wang, Zoltán Kovács
AI总结
本文提出Soohak基准,通过64位数学家自主编写439道问题,评估大语言模型在研究级数学问题上的能力,同时引入拒绝子集以测试模型对不恰当问题的识别能力。