arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 68 篇

2606.14943 2026-06-16 cs.CL cs.LG 新提交

Simplifying the Modeling of Arbitrary Conditionals in Natural Language

简化自然语言中任意条件概率的建模

Yinhan Lu, Eric Elmoznino, Léo Gagnon, Sarthak Mittal, Tejas Kasetty, Guillaume Lajoie

发表机构 * Mila — Quebec AI Institute(Mila — 魁北克人工智能研究所) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出AC-GPT,通过简单修改标准因果Transformer,实现单次前向传播中评估和采样任意条件(包括过去、未来和混合上下文),保持左到右顺序和下一词预测目标,无需退化标准性能。

详情
AI中文摘要

因果Transformer通过联合分布的自回归分解对序列进行建模,这使得高效的从左到右解码和条件似然计算成为可能。然而,它们无法高效地从任意条件中采样或评估——例如,以过去和未来标记为条件的文本块。最近的工作旨在通过新颖的架构解决这个问题,但通常导致对此类条件概率的次优建模和退化的生成。我们提出了任意条件GPT(AC-GPT),它引入了一个对标准因果Transformer的简单修改,使得在单次前向传播中能够评估和采样任意条件——包括过去、未来和混合上下文。与先前的方法不同,我们的方法保留了标准的从左到右顺序和下一词预测目标,这对于自然语言上的强性能和高效训练都是必不可少的。关键的是,这种兼容性允许现有的LLM被微调以进行任意条件建模。我们的实证结果表明,我们的方法在建模任意条件概率方面优于基线,且不会降低标准的从左到右性能。

英文摘要

Causal Transformers model sequences through an autoregressive factorization of the joint distribution, which enables efficient left-to-right decoding and conditional likelihood computation. However, they cannot tractably sample from or evaluate arbitrary conditionals -- e.g., a block of text conditioned on past and future tokens. Recent work aims to solve this problem through novel architectures, but they often lead to sub-optimal modeling of such conditionals and degraded generations. We propose Arbitrary Conditionals GPT (AC-GPT) which introduces a simple modification to standard causal Transformers to enable evaluating and sampling from arbitrary conditionals -- including past, future, and mixed contexts -- within a single forward pass. Unlike prior approaches, our method preserves the standard left-to-right ordering and next-token prediction objective essential for both strong performance and efficient training on natural language. Crucially, this compatibility allows existing LLMs to be fine-tuned for arbitrary conditioning. Our empirical results indicate that our method outperforms baselines on modeling arbitrary conditionals, without degrading standard left-to-right performance.

2606.14961 2026-06-16 cs.CL 新提交

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA: 面向可靠思维链推理的置信度-理由对齐

Juming Xiong, Weixin Liu, Kevin Guo, Congning Ni, Junchao Zhu, Chongyu Qu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究)

AI总结 提出GRPO强化学习框架,联合奖励答案正确性、置信度与理由支持度,减少置信度与理由对齐误差,提升推理可靠性。

详情
AI中文摘要

思维链推理可以提升大语言模型性能,但当伴随的理由看似合理但不完整或支持不足时,高答案置信度可能具有误导性。我们研究置信度-理由对齐:模型对其承诺答案的置信度是否由其生成的理由所证明。我们引入基于GRPO的强化学习框架,联合奖励答案正确性、承诺答案概率以及基于评分标准的理由支持度,其中评分标准评估理由的立足点、连贯性、任务匹配度以及与所选答案的关联性,且不向评判者揭示正确答案。在MedQA、MathQA和OpenBookQA上使用三个开源大语言模型,与未调优检查点、SFT和仅正确性GRPO相比,我们的方法将置信度-理由对齐误差降低了高达26.51%,同时保持了有竞争力的准确率并经常改进校准。这些结果表明,可靠的思维链推理不仅需要自信的答案,还需要实质性支持这些答案的理由。

英文摘要

Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

2606.15007 2026-06-16 cs.CL cs.AI cs.LG 新提交

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra: 开放、高效的混合专家Mamba-Transformer模型用于智能体推理

NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil, Dan Su, Dan Zhao, Dane Corneil, Daniel Afrimi, Daniel Egert, Daniel Korzekwa, Daniel Lo, Daniel Machlab, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, David Yu, Davit Karamyan, Deena Donia, Deep Debroy, Deepak Narayanan, Devin O'Kelly, Dheeraj Peri, Dhruv Nathawani, Di, Wu, Dima Rekesh, Divyanshu Kakwani, Donald Plummer, Dong Anh, Dongfeng Yu, Dongfu Jiang, Donnie Kim, Dorrin Poorkay, Duncan Riach, Dusan Stosic, Dustin VanStee, Eavan Meng, Edgar Minasyan, Edward Lin, Eileen Margaret Peters Long, Elad Sarafin, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric Tramel, Eric Yang, Erick Galinkin, Erik Pounds, Erika Goncalves Goncalves, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Faisal Ladhak, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Frank Sun, Frankie Siino, Frida Hou, Gal Hubara Agam, Gal Kaplun, Gantavya Bhatt, Gargi Prasad, Garvit Kulshreshtha, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Greg Mason, Greg Pauloski, Grigor Nalbandyan, Grzegorz Chlebus, Grzegorz Karch, Guan-Ting Liu, Guoming Zhang, Guyue Huang, Haggai Maron, Haifeng Qian, Haim Elisha, Haoxing Ren, Haran Kumar Shiv Kumar, Haribhau Hud, Harris Nover, Harrison Saturley Hall, Hayate Iso, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hovhannes Tamoyan, Hua Li, Huanhuan Chen, Hui Li, Hui Wang, Huy Nguyen, Ian Chiles, Ido Galil, Ido Shahaf, Igor Gitman, Igor Shovkun, Ilya Loshchilov, Ingo Guehring, Itamar Schen, Itay Levy, Itay Neeman, Ivan Moshkov, Izik Golan, Izzy Putterman, Jaemin Choi, Jakub Slowikowski, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiantao Jiao, Jiaqi Zeng, Jie Lou, Jim King, Jimmy Zhang, Jingquan Wang, Jinhang Choi, Jinju Chu, Joey Conway, Joey Guman, Johan Jatko, Johannes Rausch, John Kamalu, John Roberts, Johnny Greco, Johnny Mensel, Jonah Alben, Jonas Yang, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joshua Mabry, Joshua Pierce, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kajal Jain, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Willowhawk, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirthi Shankar Sivamani, Konstantinos Krommydas, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Kyle Keprios, Kylie Day, Lawrence McAfee, Leo Du, Leon Derczynski, Li Ding, Linda Liu, Lingjie Wu, Lior Kadoch, Lizzie Wei, Luis Vega, Luke Robison, Lun Su, Maarten Van Segbroeck, Maciej Jakub Mikulski, Maer Rodrigues de Melo, Magda Sypula, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Tarun Chandran, Manoj Kilaru, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Marcin Chochowski, Mark Cai, Mark Mozolewski, Markus Kliegl, Marta Stepniewska-Dziubinska, Martyna Patelka, Mattei Machczynski, Matvei Novikov, Mauricio Ferrato, Maximilian Golub, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Mengxi Wu, Meredith Price, Meriem Boubdir, Micah Schaffer, Michael Andersch, Michael Boone, Michael Gschwind, Michael Lightstone, Michael Loh, Michal Bien, Michal Zawalski, Michelle Gill, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Houston, Mingyuan Ma, Minseok Lee, Mohamed Fawzy, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Namit Dhameja, Narimane Hennouni, Natalie Hereth, Nathaniel Pinckney, Nave Algarici, Nave Assaf, Netanel Haber, Nicholas Knight, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Desai, Nikolai Ludwig, Nima Tajbakhsh, Ning Xu, Nir Ailon, Nirmal Juluru, Nitin Nitin, Ofri Masad, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivia Viessmann, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Pablo Ribalta, Pallab Bhattacharya, Panos Lampropoulos, Parth Mannan, Pasha Shamis, Patrick Legresley, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pierre-Yves Aquilanti, Pinky Xu, Piotr Januszewski, Piotr Laskiewicz, Pooya Jannaty, Prakash Gurumurthy, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Puhui Meng, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Radha Sri-Tharan, Rahul Kandu, Rakshit Sanadhya, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Ray Macalisang, Rayen Tian, Reka Kovacs, Renjie Pi, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Rishi Puri, Rita Fernandes Neves, Ritchie Zhao, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Robert Kirby, Roger Waleffe, Rohit Watve, Roi Koren, Ron Banner, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Stewart, Ryota Egashira, Sadegh Mahdavi, Saee Paliwal, Sagar Singh, Sahil Modi, Salika Dave, Samantha Shinagawa, Samuel Kriman, Sandip Bhaskar, Sangkug Lym, Sanjay Kariyappa, Sanjeev Satheesh, Saran Vikas Murari, Satish Pasumarthi, Saurabh Mishra, Saurav Muralidharan, Scott Hara, Sean Narentharen, Selvaraj Anandaraj, Seonjin Na, Seonmeyong Bak, Seonmyeong Bak, Sepehr Sameni, Seph Mard, Serge Panev, Seth Henneman, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Mendelson, Shaun Kotek, Shawn Wang, Shay Aharon, Shaya Gharghabi, Sheng-Chieh Lin, Shi Chen, Shiqing Fan, Shirish Baskaran, Shreya Gopa, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Shwetha Krishnamurthy, Siddharth Singh, Simeng Sun, Sirshak Das, Sivakumar Arayandi Thottakara, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sridhar Bhuvanapalli, Srimukh Veccham, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Su Rong, Sugam Dipak Devare, Sukrit Rao, Sumeet Kumar Barua, Sungsoo Ha, Sunny Gai, Suriya Gunasekar, Suseella Panguluri, Suyog Gupta, Sviataslau Hinzburh, Sweta Priyadarshi, Syeda Nahida Akter, Talor Abramovich, Tan Bui, Tanay Varshney, Tatevik Ter-Hovhannisyan, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tianhe Zhang, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Tiyasa Mitra, Tom Balough, Tomasz Grzegorzek, Tomasz Hliwiak, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Tony Salim, Tony Wang, Traian Rebedea, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Venkat Srinivasan, Venmugil Elango, Vibhor Agrawal, Victor Cui, Vijay Korthikanti, Vikas Mehta, Vinay Rao, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Vu Pham, Wanli Jiang, Wasi Uddin Ahmad, Wataru Ishihara, Wei Du, Wei Ping, Weiheng Chai, Wenliang Dai, Wesley Helmholz, Will Jennings, Will Zhu, Wojciech Prazuch, Xiaowei Ren, Xiwen Yu, Yan Breek, Yang Chen, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Youngeun Kwon, Yu Yao, Yugi Guvvla, Yuki Huang, Yunsheng Liu, Zach Moshe, Zachary Newell, Zhilin Wang, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijie Yan, Zsolt-Alon Wertheimer

发表机构 * NVIDIA(英伟达)

AI总结 提出550B总参数量、55B激活参数的混合专家Mamba-Attention语言模型Nemotron 3 Ultra,通过20T tokens预训练、1M上下文扩展及后训练,在推理吞吐量提升约6倍的同时保持与顶尖模型相当的精度。

详情
AI中文摘要

我们介绍了Nemotron 3 Ultra,一个总参数量5500亿、激活参数550亿的混合专家Mamba-Attention语言模型。我们在20万亿文本tokens上预训练了Nemotron 3 Ultra,然后将上下文长度扩展到100万tokens,并使用监督微调(SFT)、强化学习(RL)和多教师在线策略蒸馏(MOPD)进行后训练。Nemotron 3 Ultra是我们迄今为止能力最强的模型,采用了多项关键技术——LatentMoE、多token预测(MTP)、NVFP4预训练、多环境RLVR、MOPD和推理预算控制。与公开可用的最先进LLM相比,Nemotron 3 Ultra的推理吞吐量提高了约6倍,同时达到了相当的精度。最先进的精度、高推理吞吐量和100万tokens的上下文长度使Nemotron 3 Ultra成为长时间运行的自主智能体任务的理想选择。我们在HuggingFace上开源了基础、后训练和量化检查点,以及训练数据和配方。

英文摘要

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

2606.15070 2026-06-16 cs.CL 新提交

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

当进一步推理无益时停止:推理模型中的注意力状态自适应生成

Jiakai Li, Ke Qin, Rongzheng Wang, Yizhuo Ma, Qizhi Chen, Muquan Li, Shuang Liang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province(四川省 ubiquitous 智能与可信服务重点实验室)

AI总结 针对大型推理模型过度思考导致冗余和准确率下降的问题,提出无需训练的注意力状态自适应生成方法ASAG,通过推断推理状态动态调整生成策略,在多个基准上平均准确率提升3.2%,生成token减少近40%。

Comments ICML 2026 Spotlight

详情
AI中文摘要

通过引入测试时计算缩放,大型推理模型(LRMs)可以通过显式的思维链(CoT)推理过程解决复杂问题。然而,它们常常遭受过度思考的困扰,导致冗余的token输出和准确率下降。当前缓解这一问题的方法仍然有限:基于训练的方法需要大量计算资源,而无需训练的方法依赖于精心设计的提示或不可靠的置信度信号。在这项工作中,我们从注意力分布的角度研究早期停止,并提出一种简单的方法ASAG,该方法推断模型的推理状态并自适应地调整生成策略。所提出的框架无需训练且即插即用,能够无缝集成到现有的LRMs中。在九个基准上的大量实验表明,该方法在主流LRMs(包括DeepSeek-R1-Distill和Qwen3系列)的不同参数规模上均取得了一致的改进。具体而言,ASAG在Qwen3-8B的所有推理任务上平均准确率提高了3.2%,同时生成的token数量减少了近40%。

英文摘要

By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.

2606.15079 2026-06-16 cs.CL cs.AI 新提交

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Ling 和 Ring 2.6 技术报告:高效且即时的万亿参数规模智能体智能

Ang Li, Ben Liu, Bin Han, Bin Hu, Bin Jing, Binbin Hu, Bing Li, Cai Chen, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Liang, Chen Qian, Chengfu Tang, Chengyao Wen, Chilin Fu, Chunwei Wu, Cong Zhang, Cunyin Peng, Daixin Wang, Dalong Zhang, Deng Zhao, Dingnan Jin, Dingyuan Zhu, Donghao Zhang, Fan Yuan, Fangzheng Zhao, Fanzhuang Meng, Feifan Wu, Feng Xu, Fengbin Fang, Gangshan Wang, Guodong Yang, Hailin Zhao, Haitao Wang, Haitao Zhang, Hanxiao Zhang, Hanzi Wang, Hao Dai, Hao Liu, Hao Qian, Hao Wu, Haoxiong Liu, Haoyu Xu, Heng Zhang, Hong Liu, Hongliang Zhang, Hongrui Liu, Hongxun Li, Hongzhi Ruan, Huaidong Xiong, Huihuang Zheng, Huikang Tang, Jia Guo, Jia Li, Jia Liu, Jiameng Wang, Jiaming Liu, Jiannan Shi, Jianping Wei, Jiaolong Yang, Jiapeng Wang, Jie Gao, Jie Wang, Jiewei Wu, Jin Yang, Jinjin Li, Jinjing Huang, Jinquan Sun, Jinyao Chen, Juanhui Tu, Jun Liu, Jun Mei, Jun Xu, Jun Zhou, Junjie Ou, Junnan Sipan, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kuan Xu, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Chen, Lei Liang, Lei Xu, Li Tang, Liang Jiang, Liangcheng Fu, Lihui Zhang, Linfeng Shi, Lintao Ma, Liyuan Liu, Longfei Li, Longfei Zheng, Lu Liu, Lu Yu, Man Li, Meiqi Zhu, Meng Li, Mengjie Gao, Mengshu Sun, Mingming Yin, Mingyang Zhang, Mingyuan Fan, Nuo Xu, Pan Tang, Peijie Jiang, Peilong Zhao, Peng Lin, Pingping Liu, Qi Zuo, Qian Zhao, Qiang Cheng, Qianggang Cao, Qiaoben Bao, Qing Cui, Qingyuan Yang, Qitao Shi, Qiyin Huang, Qizheng Zhou, Quan Wan, Runyuan Zhao, Shaomian Zheng, Shaowei Wei, Shengnan Zhang, Shuaicheng Li, Shujie Li, Shuo Zhang, Sikang Bian, Tianchu Yao, Tiange Xu, Tianshu Wang, Ting Guo, Tinghao Wang, Tingwei Huang, Tong Zhao, Tongkai Yang, Wang Hong, Wanli Gu, Wei Lu, Weichang Wu, Weiguang Han, Weiquan Li, Wenbo Shen, Wenjing Fang, Wenzhi Tang, Xiang Shu, Xiao Shi, Xiaodong Yan, Xiaolu Zhang, Xiaopei Wan, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xinxing Yang, Xinyao Tang, Xinyu Kong, Xinyu Liu, Xiong Xu, Xuan Sun, Xudong Han, Xudong Wang, Xujie Shen, Yalin Zhang, Yangyang Hou, Yankun Ren, Yao Zhao, Ye Chen, Yeyang Chen, Yibo Cao, Yifan Zuo, Yijie Chen, Ying Li, Yingjie Song, Yingxue Li, Yiqi Wang, Yixuan Sun, Yizhu Xiao, Yongfei Xu, Yu Liu, Yuchen Fang, Yue Gao, Yue Yu, Yue Zhang, Yuqi Zhang, Yuxiao He, Yuxiao Lu, Yuxin Tian, Yuxuan Li, Yuzhuo Fu, Zhankai Xu, Zhaoxin Huan, Zhenduo Zhang, Zhengke Gui, Zhengyu Huang, Zhenjun Ma, Zhenxuan Pan, Zheping Qu, Zhibo Zhu, Zhidong Fan, Zhigang Huangfu, Zhihao Wang, Zhiqiang Zhang, Zhizhen Liu, Zhuyan Zhou, Zibin Lin, Zihang Zeng, Zihao Wang, Zilong Wang, Ziqi Liu, Zitao Xuan, Zixuan Cheng, Zujie Wen, Zuoli Tang

发表机构 * Ling Team(Ling团队) Inclusion AI

AI总结 提出Ling-2.6和Ring-2.6模型系列,通过架构迁移预训练、混合线性注意力设计及KPop强化学习框架,实现低延迟、强推理与高效部署,开源所有检查点。

详情
AI中文摘要

高效且可扩展的智能体智能需要模型既能提供低延迟响应,又能具备强大的推理能力,同时保持训练、服务和部署的实用性。在本报告中,我们介绍了Ling-2.6和Ring-2.6,这是一系列旨在大规模解决这一挑战的模型。Ling-2.6针对即时响应生成和每个输出令牌的高能力进行了优化,而Ring-2.6则专为更深层次的推理和更高级的智能体工作流而设计。我们没有从头开始训练,而是通过架构迁移预训练和大规模后训练来升级Ling-2.0基础模型。这一升级以模型架构、优化目标、服务系统和智能体训练环境的统一协同设计为指导,从而在模型能力和部署效率上实现改进。在架构层面,我们引入了一种混合线性注意力设计,将闪电注意力与MLA相结合,提高了长上下文训练和解码的效率。为了进一步提升令牌效率,我们通过进化思维链、语言单元策略优化、双向偏好对齐和最短正确响应蒸馏来优化每个输出令牌的能力。对于智能体能力,我们提出了KPop,这是一个强化学习框架,旨在支持Ring-2.6-1T在大规模环境接地数据上的稳定训练。KPop通过跨编码、搜索、工具使用和工作流执行的异步调度提高了训练效率,实现了从复杂的智能体-环境交互中进行可扩展学习。Ling-2.6和Ring-2.6共同为高效、可扩展和开放的智能体系统提供了一条实用路径。我们开源了2.6系列的所有检查点,以支持实用智能体智能的进一步研究和开发。

英文摘要

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

2606.15080 2026-06-16 cs.CL cs.AI 新提交

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

AdaMame: 一种自适应多语言推理的训练方案

Dayeon Ki, Kevin Duh, Marine Carpuat

发表机构 * University of Maryland(马里兰大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对多语言推理中的语言崩溃问题,提出两阶段训练方案AdaMame,通过SFT建立多语言能力,再以自适应GRPO优化推理语言对齐,在准确率、语言保真度和令牌效率上达到帕累托最优。

Comments 20 pages, 5 figures

详情
AI中文摘要

尽管大型推理模型(LRMs)在英语中表现出色,但它们往往无法以查询语言进行推理,这种现象称为语言崩溃。现有的基于强化学习的修复方法通常在准确性目标上添加一个二元语言保真度奖励,但仍然会在准确性、中间轨迹代码切换和过度令牌使用方面产生权衡。在这项工作中,我们提出了AdaMame,一种用于多语言数学推理的两阶段训练方案,通过自适应地将推理语言与查询语言对齐来解决这些限制,同时不损害准确性。第一阶段的SFT在五种语言的自然推理轨迹上进行微调,以建立多语言推理能力。在随后的RL阶段,我们引入了AdaMame-GRPO,这是组相对策略优化(GRPO)的一种改编,其中查询条件的对齐因子在训练过程中逐渐增长,引导模型首先探索多样的推理语言,然后利用查询语言进行推理。在两个基准、两个LRM和12种语言上的评估表明,AdaMame-GRPO在所有基线上实现了推理准确性、语言保真度和令牌效率的帕累托最优性能,在领域外、低资源语言上取得了最强的提升。

英文摘要

While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.

2606.15216 2026-06-16 cs.CL cs.AI 新提交

Spokes: Optimizing for Diverse Pretraining Data Selection

Spokes: 优化多样化预训练数据选择

Clarence Lee, Yejin Choi, Luke Zettlemoyer, Pang Wei Koh, Hai Leong Chieu

发表机构 * DSO National Laboratories(DSO国家实验室) Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 提出基于G-Vendi分数的概率多样化框架,通过指数梯度下降直接优化数据多样性,在FineWeb和DCLM上提升下游性能1.5和1.4个点。

Comments 9 pages, 4 figures

详情
AI中文摘要

多样性在数据选择中起着关键作用,通过减少冗余和重复,在固定数据预算下提高性能。然而,优化多样性本身具有挑战性,因为它是集合级属性,依赖于数据点之间的交互而非单个示例。因此,现有方法通常依赖代理或近似,往往无法确保足够多样化的子集。在这项工作中,我们通过引入基于G-Vendi分数的概率多样化框架,并利用指数梯度下降进行优化,直接优化多样性。我们的方法生成的子集比通过随机抽样获得的子集多样化得多,在50万样本子集上实现了G-Vendi分数增加489。我们在FineWeb和DCLM上评估了我们的方法,它持续优于现有方法。值得注意的是,SPOKES(仅多样性)在DCLM和FineWeb上分别比随机抽样提高了平均下游性能0.4和0.5个点。更重要的是,联合优化质量和多样性取得了最强结果:SPOKES在DCLM和FineWeb上分别取得了1.5和1.4个点的提升,优于所有基线,包括语义去重和质量过滤。

英文摘要

Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

2606.15333 2026-06-16 cs.CL cs.LG 新提交

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

重放重要内容:面向高效LLM强化反学习的离策略重放

Zirui Pang, Chenlong Zhang, Haosheng Tan, Zhuoran Jin, Jiaheng Wei, Zixin Zhong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Glasgow(格拉斯哥大学)

AI总结 针对LLM反学习中在线策略优化对困难样本利用不足的问题,提出ReRULE方法,通过离策略重放缓冲区存储并复用低奖励困难样本,在保持通用性的同时提升反学习效率。

详情
AI中文摘要

LLM反学习已成为一种经济有效的替代方案,无需完全重新训练即可从预训练模型中移除危险知识,同时保持通用实用性。最近的基于RL的方法(如RULE)将反学习重新定义为学习拒绝行为,但其在线策略优化在整个训练过程中反复从相同的遗忘和保留/边界提示中采样。我们发现了该过程中的一个关键低效问题:简单案例迅速收敛,提供的梯度信号很少,而遗忘/保留边界附近的困难案例持续产生低奖励的轨迹,这些轨迹在单次使用后被丢弃。为了解决这个问题,我们提出了ReRULE,一种用于强化反学习的离策略重放增强方法。ReRULE在早期GRPO训练期间将低奖励的困难案例轨迹组存储在重放缓冲区中,并通过重要性采样的离策略更新在后续阶段重用它们,将计算重定向到仍需学习的边界案例。理论上,我们证明ReRULE比纯在线策略RULE具有更紧的困难案例收敛界。实验上,ReRULE将MUSE-Books保留质量从46.3提高到56.2,同时仅增加5-11%的训练时间。其在更简单的TOFU设置上改进有限,进一步支持了预期的条件行为:当困难/简单差异显著时,重放最为有益。

英文摘要

LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as learning a refusal behavior, but their on-policy optimization repeatedly samples from the same forget and retain/boundary prompts throughout training. We identify a critical inefficiency in this process: easy cases quickly converge and provide little useful gradient signal, while hard cases near the forget/retain boundary continue to produce low-reward rollouts that are discarded after a single use. To address this issue, we propose ReRULE, an off-policy replay enhancement for reinforcement unlearning. ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training and reuses them in later stages through importance-sampled off-policy updates, redirecting computation toward boundary cases that still require learning. Theoretically, we show that ReRULE yields a tighter hard-case convergence bound than pure on-policy RULE. Empirically, ReRULE improves MUSE-Books Retain Quality from 46.3 to 56.2 while adding only 5--11% training time across benchmarks. Its limited improvement on the simpler TOFU setting further supports the intended conditional behavior: replay is most beneficial when the hard/easy disparity is pronounced.

2606.15378 2026-06-16 cs.CL cs.LG 新提交

Rethinking the Role of Efficient Attention in Hybrid Architectures

重新思考高效注意力在混合架构中的作用

Ziqing Qiao, Yinuo Xu, Chaojun Xiao, Zhou Su, Zihan Zhou, Yingfa Chen, Xiaoyue Xu, Xu Han, Zhiyuan Liu

发表机构 * Tsinghua University(清华大学) OpenBMB

AI总结 本文系统分析混合架构中高效注意力模块(如滑动窗口注意力和循环序列混合器)的影响,发现其主要影响长上下文能力的涌现速度,并揭示“大窗口惰性”现象,提出仅对全注意力层去除位置编码可提升长上下文性能。

Comments 23 pages, 13 figures

详情
AI中文摘要

现代语言模型越来越多地采用混合架构,将全注意力与高效注意力模块(如滑动窗口注意力(SWA)和循环序列混合器)相结合。然而,这些高效模块如何塑造模型能力仍知之甚少。为填补这一空白,我们从三个角度对混合架构进行了系统分析:缩放行为、机制分析和架构设计。首先,从缩放角度来看,我们发现高效注意力设计主要影响长上下文能力涌现的速度,而不同的混合模型在充分训练下最终会收敛到可比较的长上下文性能。其次,从机制上,我们表明长距离检索主要由全注意力承担,而高效注意力则塑造其优化轨迹。这解释了我们称之为“大窗口惰性”的反直觉现象:更大的SWA窗口可能延迟全注意力层中检索头的形成。第三,受此机制指导,我们表明仅对小型窗口SWA混合架构的全注意力层应用NoPE(无位置编码)可以显著提升长上下文性能,而对短上下文性能影响甚微。

英文摘要

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

2606.15419 2026-06-16 cs.CL cs.AI 新提交

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

让LLMs互相评判:面向医学问答的多智能体同行评审推理

Zaifu Zhan, Shuang Zhou, Rui Zhang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出多智能体同行评审推理方法,让多个LLM独立生成思维链推理并相互评估,选择最优推理链输出答案,在三个医学问答数据集上优于单模型和多数投票方法。

Comments Accepted by the Journal of the American Medical Informatics Association

详情
AI中文摘要

目的:提升大语言模型在医学问答中的准确性、可解释性和鲁棒性。方法:我们设计了一种多智能体同行评审推理方法,其中多个LLM智能体独立生成包含候选答案的思维链推理,然后作为同行评审者评估彼此推理的事实正确性和逻辑合理性。选择评分最高的推理链生成最终答案。使用五个最先进的LLM(Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B)在三个基准数据集(HeadQA、MedQA-USMLE和PubMedQA)上进行实验。性能与单模型思维链推理和基于思维链的多数投票进行了比较。结果:同行评审推理始终优于两种基线。最佳模型组合在数据集上的平均准确率达到0.820,超过了最强单模型(0.777)和多数投票集成(最高0.789)。该方法还随着参与模型数量的增加而有效扩展,同时同行评估可靠地区分了高质量和低质量的推理链。结论:提出的多智能体同行评审推理方法使LLM既能作为求解者又能作为评估者,在医学问答中取得了优越性能。通过强调推理质量而非仅答案一致性,该方法提高了准确性、可解释性和鲁棒性,为可信赖的生物医学AI系统提供了有前景的方向。

英文摘要

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

2606.15521 2026-06-16 cs.CL cs.LG 新提交

Emergent retokenization symmetry in large language models: phenomenology and applications

大型语言模型中涌现的重分词对称性:现象学与应用

Kanishk Jain, Matthew Day, Tankut Can

发表机构 * Department of Physics, Emory University(埃默里大学物理系)

AI总结 研究发现大型语言模型在训练中部分涌现出重分词对称性,通过重分词实验探测模型对语义等价输入表示的敏感性和鲁棒性,并提出一种新的推理时采样策略。

详情
AI中文摘要

分词引入了表示冗余:在固定词表下,每个字节串存在多种有效的分词编码(或切分方式),它们解码后得到相同的表面字符串。然而,给定提示词时,大多数语言模型的分词器通过返回规范切分打破了这种表示对称性。仅基于规范切分进行训练应会影响推理行为,且几乎没有理由期望模型在下游任务中尊重切分对称性。我们发现这种对称性在训练过程中部分涌现。本文通过实验探测这种涌现对称性,测试了分词组合理解、表示多样性和任务导向的基准性能。我们主要使用\textbf{重分词}——在保持字节完全不变的情况下,将提示词的规范分词替换为另一种切分。相对于其他提示扰动,重分词异常干净,因为它隔离了切分效果而不改变语法、语义或表面形式。我们利用重分词研究预训练和后训练中对语义等价输入表示的敏感性和鲁棒性。此外,这种部分重分词对称性暗示了一个不同的推理时采样轴。温度采样通过模型的下一个词概率分布生成多样输出,而重分词通过语义等价的输入表示从模型内部计算生成多样性。我们发现,虽然这种重分词采样策略在简单问题上可能损害性能,但它也能恢复传统采样无法找到的解决方案。总体而言,我们的工作将重分词呈现为一种简单而强大的大型语言模型探测工具,揭示了组合理解和提示敏感性,并提供了一种新颖的采样策略。

英文摘要

Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbf{retokenization} -- replacing a prompt's canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model's internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

2606.15733 2026-06-16 cs.CL cs.AI 新提交

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Vernier: 探测因果推理中词汇间隙背后的表征错位

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 通过配对视图权重更新和激活修补,发现语言模型在因果推理中因变量名替换导致的答案差异源于表征错位而非信息丢失,并在Qwen和Llama模型上验证了反事实增强的对齐效果。

详情
AI中文摘要

指令微调的语言模型在将其英文变量名替换为类型保留的占位符后,可能会对相同的因果推理问题给出不同的答案,尽管结构因果模型和正确答案未变。我们探究这种词汇间隙是否反映了占位符视图中的信息丢失,或是从仍携带答案相关内容的表征中读取时的错位。Vernier 使用配对视图权重更新作为工具,然后检查间隙闭合后留下的机制。在工作状态下,证据支持表征错位。变量名探针在占位符视图上变得更准确,对 Qwen-7B、Qwen-14B 和 Llama-3.1-8B 的激活修补表明,决策令牌表征可以在视图间传递答案身份。重新对齐视图的更新是对原始提示和占位符提示的反事实增强,而答案子空间 KL 主要增强了中间答案信念的一致性。成功受限于模型家族、规模和任务。CRASS 转移在 Qwen 规模和 Llama 上可靠,e-CARE 仍然较弱,初步的非因果重命名任务显示出类似的定性模式。

英文摘要

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

2606.15734 2026-06-16 cs.CL cs.AI cs.IR cs.LG 新提交

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

可检索梯度:无累积权重漂移的持续后训练

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出ReGrad范式,将梯度作为可检索知识单元,通过元学习重塑文档梯度为通用适应信号,实现无权重漂移的可扩展参数知识注入。

详情
AI中文摘要

持续后训练使模型在部署后能够吸收新知识,但重复更新共享参数会累积权重漂移,可能导致灾难性遗忘并降低通用能力。检索增强生成避免了这种参数漂移,但往往缺乏参数化知识整合的深度。在本文中,我们提出ReGrad(可检索梯度),一种将梯度视为可检索知识单元的新范式。ReGrad离线预计算文档特定梯度,存储在索引化的梯度库中,并在推理时仅检索与查询相关的梯度以进行临时权重调整。然而,原始语言建模梯度针对词级文档重建而非查询驱动的知识使用进行优化。因此,我们引入双层元学习目标,将文档派生梯度重塑为下游任务的通用适应信号。在通用和特定领域设置上的实验表明,ReGrad优于CPT和RAG基线,实现了可扩展且可逆的参数知识注入,且不累积权重漂移。

英文摘要

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

2606.15778 2026-06-16 cs.CL cs.AI cs.LG cs.SI 新提交

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

DYNA:用于在持续学习中通过时间知识图谱增强大语言模型的动态情景记忆网络

Ali Sarabadani, Mahtab Tajvidiyan

发表机构 * Department of Computer Engineering and Information Technology, University of Qom(卡姆大学计算机工程与信息科技系)

AI总结 提出DYNA框架,通过时间知识图谱作为外部可更新记忆,增强冻结的大语言模型,在三个时间召回任务上减少约7%的灾难性遗忘并提升约5%的时间排序能力。

详情
AI中文摘要

大语言模型(LLMs)难以在不遗忘或昂贵重训练的情况下融入新知识。我们提出DYNA,一个轻量级框架,通过时间知识图谱增强冻结的LLM,其中事件作为节点,时间关系作为有向、带时间戳的边。该图谱作为外部可更新记忆。在查询时,DYNA通过随机游走和中心性度量检索相关节点,然后增强LLM的响应。在三个时间召回任务上评估,DYNA相比微调减少了约7%的灾难性遗忘,相比标准RAG提升了约5%的时间排序能力。更高的图谱聚类系数与更好的检索相关,表明图谱结构的重要性。贡献:(1)将情景记忆作为时间知识图谱,(2)无需重训练的LLM增强,(3)图谱属性作为检索性能的预测因子。

英文摘要

Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM's response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

2606.15821 2026-06-16 cs.CL cs.AI cs.LG 新提交

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

真相留在家族中:通过模型谱系中继承的真相头增强上下文基础

Miso Choi, Seonga Choi, Mincheol Kwon, Woosung Joung, Jinkyu Kim, Jungbeom Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究发现基础LLM与下游变体间存在上下文真相分数的强继承性,提出TruthProbe软门控策略放大真相头以提升上下文真实性并减少多模态幻觉。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)的最新进展产生了许多共享基础LLM的专业多模态LLM(MLLM),形成了不同的模型谱系。基础LLM与下游变体之间是否存在基本的行为联系尚不清楚。我们通过量化头部级别的上下文真相分数来研究这个问题。在包括基于Vicuna、Qwen2.5、LLaMA2和Mistral的模型在内的多种LLM和MLLM谱系中,我们发现真相分数在模型家族内被强烈保留,即使在指令调优或多模态适应后也是如此。我们进一步表明,这种继承与注意力头权重保留一致,并且上下文真相头关注查询相关的证据。基于这一发现,我们提出了TruthProbe,一种软门控策略,在保留其他头部贡献的同时放大上下文真相头。TruthProbe在HaluEval上提高了上下文真实性,并在POPE和CHAIR上减少了多模态幻觉,基础LLM的真相分数有效转移到其微调的LLM和MLLM后代。代码可在https://github.com/miso-choi/TruthProbe获取。

英文摘要

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at https://github.com/miso-choi/TruthProbe.

2606.15872 2026-06-16 cs.CL 新提交

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

SciOrch: 学习编排专家大语言模型以解决前沿多模态科学推理任务

Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

发表机构 * Imperial College London(伦敦帝国学院) The Chinese University of Hong Kong(香港中文大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Oxford(牛津大学) Shenzhen Loop Area Institute(深圳环湖研究所)

AI总结 提出SciOrch框架,训练轻量级8B模型编排多个前沿大语言模型,通过MCTS和GRPO优化,在科学推理任务上超越最强单模型和多智能体基线。

详情
AI中文摘要

前沿科学推理仍然是大语言模型(LLMs)面临的主要挑战,即使是最强大的商业系统也达不到专家级性能。对模型行为的深入分析揭示了单模型评估所隐藏的显著互补性:不同的前沿模型在不同类型的问题上表现出色,没有一个模型能全面覆盖。我们提出了SciOrch,一个训练轻量级8B模型来编排前沿LLMs进行科学推理的框架。编排器分解每个问题,通过API调用将子问题委托给选定的商业模型,并综合最终答案。训练这样的编排器比传统的智能体强化学习更难:每个动作都会触发一次API调用,这在金钱成本和延迟上都代价高昂,使得标准的在线回滚不可行。我们通过基于MCTS的方法解决了这个问题,生成了多样化的编排轨迹,提取了每个节点的单轮样本,并使用GRPO风格的训练优化编排器。在包含SGI-Reasoning和Scientists' First Exam的240个问题测试集上,SciOrch达到了56.66%的平均准确率,比最强的单个商业模型高出3.74%,比最强的多智能体基线高出3.33%。它还在SGI和SFE上都取得了最佳准确率,而API成本不到典型多智能体方法的一半。

英文摘要

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

2606.15877 2026-06-16 cs.CL cs.AI 新提交

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

自由能启发式:作为不确定精度下主动推理的快速节俭认知

Alex Bogdan

发表机构 * Evolutionairy AI Toronto, Canada(进化人工智能(多伦多,加拿大))

AI总结 本文提出元不确定性决定链式思维(CoT)的效果:当模型对自身证据的可靠性高度不确定时,更多推理会降低准确率。通过自由能最小化策略证明,在重尾精度先验下,有限数量的高有效性线索后停止整合,与“取最优”启发式等价。实验验证了高元不确定性下长CoT导致准确率下降17.3个百分点。

Comments 64 pages, 6 figures

详情
AI中文摘要

链式思维(CoT)提升了大型语言模型在数学和符号推理中的表现。但在规划、有争议的伦理问题以及模型无法自我检查的任务中,更多推理反而使情况更糟。这两种效应均有文献记载;但一直缺少一个原则性的解释来说明哪种属性决定了结果。我们认为这是元不确定性:模型对其自身证据可靠性的不确定程度。当这种不确定性很高时,额外的推理不再增加信号,而是开始制造虚假的置信度。我们证明,在不确定精度下最小化期望自由能的策略,在精度先验为重尾分布时(定理2.6.1),会在有限数量的高有效性线索后停止整合线索,并且在递减优势条件下,该策略在样本层面上与“取最优”策略相同(定理2.7.4)。因此,快速节俭启发式和主动推理是同一计算的两种描述。预测是,在高元不确定性项目上,更长的CoT会降低准确率。我们按项目对区间进行评分(模拟-恢复rho > 0.96),构建了FEH-79基准(包含匹配对照的奈特框架),并在七个模型(五个开放权重3B-32B,两个前沿模型)、五种CoT长度和7,875个响应上进行了预注册研究。门槛(在数据前固定)要求负交互的后验概率高于0.95,准确率下降超过6个百分点。结果成立。高区间下降为17.3个百分点(95% CI [7.7, 25.5]);具有明确答案的匹配项目没有显示成本。该效应依赖于区间:在能力较强的中大型模型中显著,在两个前沿系统中具有方向性,在最弱的模型中缺失甚至反转。该框架回答了CoT何时有帮助,并统一了贝叶斯和快速节俭传统:少即是多的效应是关于元不确定性区间的证据,而非反对贝叶斯认知。

英文摘要

Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho > 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

2606.15884 2026-06-16 cs.CL 新提交

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

法律领域推理中大语言模型的神经元级分析

Eri Onami, Youmi Ma, Shuhei Kurita, Naoaki Okazaki

发表机构 * Institute of Science Tokyo(东京科学大学) NII(国立信息学研究所) AIST(产业技术综合研究所)

AI总结 通过神经元归因分数识别并抑制关键神经元,发现存在任务特异性神经元和跨任务通用神经元,法律领域神经元重叠度高且分布受输入格式影响。

详情
AI中文摘要

我们对LLM在法律领域推理中的神经元级分析进行了研究,并将其与七个开放权重模型中的其他应用领域任务进行了比较。通过使用神经元归因分数对影响神经元进行排序和抑制,我们证实抑制识别出的神经元会显著降低目标任务的准确率,而抑制相同数量的随机神经元则不会。我们进一步发现了一小部分对所有七个任务都有影响的神经元;一旦这些神经元被移除,抑制剩余神经元只会降低它们被识别出的任务的表现,从而揭示了每个研究模型中真正任务特异性的神经元。在法律领域内,三个基准测试表现出相对较高的神经元重叠,并且往往共同受到影响,这表明存在跨司法管辖区的法律组件神经元。我们实验中识别出的神经元分布表明,关于影响神经元集中在中间MLP层的假设可能取决于输入格式和内容,而非普遍现象。

英文摘要

We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we confirmed that suppressing the identified neurons collapses accuracy on the target task, whereas suppressing the same number of random neurons does not. We further found a small subset of neurons influential across all seven tasks; once these are removed, suppressing the remaining neurons degrades only the task they were identified from, revealing genuinely task-specific neurons in every model studied. Within the legal domain, the three benchmarks exhibit relatively high neuron overlap and tend to be affected jointly, suggesting of legal components neurons that span jurisdictions. The distribution of identified neurons in our experiments suggests that the hypothesis that influential neurons are concentrated in middle MLP layers may depend on the input format and content, rather than being a universal phenomenon.

2606.15893 2026-06-16 cs.CL 新提交

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

BALTO: 用于幻觉缓解的平衡令牌级策略优化

Ning Li, Zixuan Guo, Yan Xu, Wenbo Fei, Yifan Niu, Chang Luo, Yasheng Wang, Weiwen Liu, Yong Yu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent(腾讯) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 针对大语言模型幻觉问题,提出BALTO框架,通过提取可验证事实声明并投影为令牌级标签,引入平衡信用分配机制,在六个模型-基准设置中实现最高忠实度,优于现有后训练基线。

详情
AI中文摘要

幻觉仍然是阻碍大语言模型在知识密集型环境中部署的主要障碍,在这些环境中,生成的响应必须忠实地基于所提供的证据。强化学习是缓解幻觉的一个有前景的方向,但响应级忠实度奖励存在粒度不匹配问题:局部幻觉可能导致受支持的内容受到虚假惩罚。尽管最近的工作引入了细粒度反馈,如声明级验证和令牌级奖励,但不平衡的信用分配仍可能引发长度、冗长或优化噪声偏差。我们提出了BALTO,一种用于幻觉缓解的平衡令牌级策略优化框架。BALTO提取可核查的事实声明,根据参考上下文对其进行验证,并将声明级判断投影到令牌级标签。该框架引入了一种平衡的令牌级信用分配机制。这种设计将概率质量从未受支持的内容重新分配到忠实内容,而不是抑制整个响应。我们从理论角度系统分析了响应级奖励的局限性,并证明了BALTO在幻觉缓解的训练稳定性和优化效率方面的优势。在ConFiQA、RAGTruth和FinLLM-Eval上的实验表明,BALTO在所有六个模型-基准设置中实现了最高的忠实度,并且在Q-Score上持续优于现有的后训练基线,展示了更强的忠实度-信息量权衡。

英文摘要

Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.

2606.15972 2026-06-16 cs.CL cs.AI cs.LG 新提交

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

一次形式化,其余编辑:基于Lean的高效数学推理答案选择

Ji Feng, Zhouxing Shi

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 提出BASE流水线,通过形式化一个候选答案并编辑其余答案,减少自动形式化调用约5倍,同时提升选择准确性。

Comments 15 pages, 1 figure. Code available at https://github.com/ucr-rai/base-and-edit

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地应用于数学推理,形式化证明助手(如Lean)可用于以机器可检查的严谨性验证推理输出,从而支持在测试时扩展中从K个采样候选答案中进行答案选择等用例。然而,使用Lean要求LLM的输出(最初为自然语言)首先被形式化。现有的基于Lean的答案选择工作使用自动形式化模型为每个候选答案独立生成一个Lean形式化语句,这带来了显著的计算成本。我们提出BASE,一个基础-编辑流水线,它为每个问题形式化一个基础候选答案,并通过就地编辑答案表达式来推导出其余K-1个语句。为此,我们训练了一个重写器模型LEANSCRIBE,用于定位基础形式化中的答案,并为其他K-1个候选答案生成可重用的编辑函数。BASE同时提高了选择准确性并降低了形式化成本——这是一个帕累托改进,在四个基准测试和三个求解器上的所有12个(数据集,求解器)配置中均成立,在K=8时自动形式化器调用减少约5倍,且随着K增长,减少幅度预计会更大。代码可在https://github.com/ucr-rai/base-and-edit获取。

英文摘要

With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at https://github.com/ucr-rai/base-and-edit.

2606.16011 2026-06-16 cs.CL 新提交

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

谁在翻盘?自我与跨模型反论证揭示LLM中的答案不稳定性

Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) LMU Munich(慕尼黑大学)

AI总结 提出一种评估大语言模型答案稳定性的协议,通过生成反论证挑战正确回答,发现翻转率在17.5%到97.3%之间,且自我归因和跨模型论证显著影响稳定性。

Comments Accepted to the non-archival workshops AI4Good and AIWILD at ICML 2026

详情
AI中文摘要

标准准确率基准测试旨在测试大语言模型(LLM)接近正确答案的程度,但不适合测试当答案受到合理反论证挑战时,LLM是否坚持正确回答。我们引入了一种评估答案稳定性的受控协议:在模型正确回答多项选择题后,我们用针对错误选项的连贯论证挑战模型的答案,并测量模型是否翻转。该设置a)将论证内容与公开的社会压力隔离,b)改变论证长度、自我归因和跨模型来源。在七个前沿模型和57个MMLU主题上,翻转率从17.5%到97.3%不等,揭示了仅靠准确率指标无法捕捉到的稳定性巨大差异。我们发现自我归因始终增加翻转率(平均+7.1个百分点,最高+18.7个百分点)。此外,跨模型汇集错误答案论证,并为每个问题选择最有效的论证,比依赖任何单一源模型产生更强的对抗性挑战。我们进一步构建了MaxFlip,一个精心策划的挑战集,相比标准自我生成挑战,翻转率提升高达+23.6个百分点。我们发布了协议、挑战记录和MaxFlip,以支持在标准准确率基准测试之外进行稳定性评估。材料可在https://github.com/nafisenik/WhoFlips和https://hf.co/datasets/nafisehNik/WhoFlips获取。

英文摘要

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

2606.16093 2026-06-16 cs.CL cs.AI 新提交

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

基于可学习混合的GSS-Transformer混合架构的长上下文建模

Kuzey Torlak, Hüseyin Arda Arslan, Anıl Dervişoğlu, Beyza Nur Deniz, Onur Boyar

发表机构 * Kadıköy Anadolu High School(卡德柯伊安纳多卢高中) Politecnico di Torino(都灵理工大学) Istanbul Technical University(伊斯坦布尔理工大学) Boğaziçi University(博阿齐奇大学) IBM Research - Tokyo(IBM 东京研究院)

AI总结 提出并行混合架构PHA,通过可学习混合机制融合GSS、GQA和FFN,在长上下文建模中实现Transformer级困惑度与更高效率。

Comments 16 pages, 9 tables, 4 figures

详情
AI中文摘要

建模长距离依赖仍然是自然语言处理中的核心挑战。Transformer架构通过自注意力实现强性能,但计算复杂度随序列长度呈二次方增长($O(N^2)$),而状态空间模型(SSM)线性扩展($O(N)$)但存在选择性召回瓶颈,难以从压缩状态中检索精确信息。这导致了效率与困惑度之间的基本权衡。为应对这些挑战,我们提出了\textit{并行混合架构(PHA)},它将门控状态空间(GSS)、分组查询注意力(GQA)和前馈网络(FFN)作为独立的并行分支运行,并通过可学习混合机制融合。PHA不强制SSM近似注意力或将两种范式串行化,而是让每个分支专门化:GSS捕获全局上下文,注意力执行选择性检索,FFN提供补充处理。在WikiText-103上,PHA在125M参数下达到16.51 PPL,优于Hedgehog(16.70)和H3-125M(23.70)。扩展到180M参数得到16.42 PPL,与纯注意力基线结果相当,同时在长上下文下吞吐量提高24%,内存使用降低40%。在OpenWebText上,我们的125M模型达到19.72 PPL,优于标准Transformer(20.60)和GSS混合基线(19.80)。这些结果表明,将序列建模范式分离为并行专家,能够在长上下文语言建模中实现Transformer级困惑度,同时显著提升效率。

英文摘要

Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

2606.16240 2026-06-16 cs.CL cs.LG 新提交

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

创意碰撞:大型语言模型中的导演人格引导与竞争

Subramanyam Sahoo, Justin Shenk

发表机构 * AI Safety Camp(AI安全训练营)

AI总结 研究通过叠加两种语义相反的导演人格向量(斯皮尔伯格与斯科塞斯)来引导语言模型生成,发现斯皮尔伯格向量主导道德倾向,中间点提升连贯性,且两者在特定层共享道德基调基底。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

激活引导已成为在推理时塑造大型语言模型行为的强大工具,但以往大多数工作向残差流注入单一的语义方向。我们研究了两种语义相反的引导向量叠加的丰富场景——我们称之为“创意碰撞”。具体而言,我们通过在精心策划的剧本语料库上进行均值差异激活对比,构建了史蒂文·斯皮尔伯格(乐观、救赎的道德价值)和马丁·斯科塞斯(黑暗、道德模糊)的导演人格向量,然后通过标量混合参数$α\in[0,1]$和引导系数$λ$在两者之间进行插值。在五个评估轴(道德价值、生成连贯性、表面风格、方向主导性和向量几何)上,出现了三个主要发现:(i)斯皮尔伯格的表征特征表现出稳健的“方向主导性”,在几乎整个插值范围内抑制了斯科塞斯的道德影响;(ii)中间碰撞点在高$λ$下相对于纯单导演引导反而提高了生成连贯性;(iii)两种人格在40层仅解码器Transformer的第28层达到最大定位,揭示了一个共享的“道德基调基底”。这些结果阐明了Transformer残差流中竞争语义方向的几何结构,并对可控创意生成和价值对齐叙事合成具有直接影响。

英文摘要

Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer setting in which two semantically opposing steering vectors are superimposed -- a regime we call \textbf{Creative Collision}. Concretely, we construct directorial persona vectors for Steven Spielberg (optimistic, redemptive moral valence) and Martin Scorsese (dark, morally ambiguous) via mean-difference activation contrast on curated screenplay-derived corpora, then interpolate between them with a scalar mixing parameter $α\in [0,1]$ and a steering coefficient $λ$. Across five evaluation axes -- moral valence, generation coherence, surface style, directional dominance, and vector geometry -- three principal findings emerge: (i)~Spielberg's representational signature exhibits robust \emph{directional dominance}, suppressing Scorsese's moral influence across almost the entire interpolation range; (ii)~intermediate collision points paradoxically \emph{improve} generation coherence relative to pure single-director steering at high $λ$; and (iii)~both personas localise maximally to layer~28 of a 40-layer decoder-only transformer, revealing a shared \emph{moral-tone substrate}. These results illuminate the geometry of competing semantic directions in transformer residual streams and have direct implications for controllable creative generation and value-aligned narrative synthesis.

2606.16360 2026-06-16 cs.CL cs.AI 新提交

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

Tyler: 语言模型的类型化潜在推理——何时思考、计算什么以及分配多少

Hanyu Lin, Min Cai, Jiawei Wen, Haodi Zhang

发表机构 * Shenzhen University(深圳大学) University of Alberta(阿尔伯塔大学)

AI总结 提出Tyler框架,通过类型化潜在推理模块和预算感知策略,在自回归解码中动态选择文本生成或潜在计算,显著提升推理准确率并降低遗忘。

Comments website: https://typed-latent-reasoning.github.io

详情
AI中文摘要

链式思维(CoT)提示通过将中间计算外化为离散文本标记来改进大型语言模型(LLM)的推理能力,但这种文本接口也引入了冗余和推理开销。潜在推理通过在连续表示中执行部分计算提供了一种有前景的替代方案。然而,现有方法通常预定义潜在计算何时被调用以及如何在解码过程中分配,留下一个关键问题未解决:何时调用潜在计算、执行何种类型的计算以及分配多少预算。我们提出\textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning(Tyler),一个用于自回归解码过程中潜在推理的类型化和预算感知框架。Tyler学习一个策略,在每个解码步骤中,选择发射一个文本标记或切换到专门用于特定推理功能的潜在计算模块。一旦被调用,一个算子将当前推理状态映射为支持全局规划、局部状态更新或可重用过程抽象的潜在标记。在三个骨干LLM上的广泛实验中,Tyler相比CoT提高了最多14.49个百分点的准确率,相比最强的竞争基线提高了最多4.30个百分点。它进一步在多种推理领域上泛化,并以最低的遗忘实现了最佳的最后阶段性能。

英文摘要

Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead. Latent reasoning offers a promising alternative by carrying part of the computation in continuous representations. However, existing methods typically predefine when latent computation is invoked and how it is allocated during decoding, leaving a key problem unresolved: when to invoke latent computation, what type of computation to perform, and how much budget to allocate. We propose \textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning (Tyler), a typed and budget-aware framework for latent reasoning during autoregressive decoding. Tyler learns a policy that, at each decoding step, chooses between emitting a text token and switching to a latent computation module specialized for a particular reasoning function. Once invoked, an operator maps the current reasoning state into latent tokens that support global planning, local state updates, or reusable procedural abstraction. Across extensive experiments on three backbone LLMs, Tyler improves accuracy by up to 14.49 points over CoT and by up to 4.30 points over the strongest competing baseline. It further generalizes across diverse reasoning domains and achieves the best final-stage performance with the lowest forgetting.

2606.16545 2026-06-16 cs.CL 新提交

Can LLM Coding Agents Reason About Time Series?

LLM 编码智能体能否推理时间序列?

Filip Rechtorík, Ondřej Dušek, Zdeněk Kasner

发表机构 * Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University(查尔斯大学数学与物理学院形式与应用语言学研究所)

AI总结 研究 LLM 编码智能体在时间序列分析中的能力,发现代码访问可提升性能达 10%,但仍有 22-34% 错误,并分析了推理差距。

Comments 17 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于金融、医疗或环境监测中的自动决策系统。时间序列数据在这些领域中无处不在,但难以自动处理。时间序列能否由 LLM 智能体分析?我们考察了三种方法:向智能体提供原始数值数据、将 LLM 用作编码智能体、或两者结合。在编码智能体设置中,模型使用 Python 代码迭代查询数据。利用两个时间序列理解基准,我们表明具有代码访问权限的智能体可以比处理原始数据的模型性能提升高达 10%。然而,即使性能最好的智能体仍然错误地回答约 22-34% 的问题。为了深入了解模型的策略和推理差距,我们使用强大的 LLM 评判器分析模型输出。我们的分析揭示,编码智能体能够选择适当的统计检验,但常常错过重要的细微差别。同时,具有原始数据访问权限的模型可以通过粗略计算得出正确结论。

英文摘要

Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

2606.16576 2026-06-16 cs.CL 新提交

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

LLM智能体能否推断世界模型?来自智能体自动机学习的证据

Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky

发表机构 * The Hebrew University of Jerusalem(海法大学) New York University(纽约大学) Google Research(谷歌研究)

AI总结 提出智能体自动机学习框架,通过成员查询和等价查询评估LLM智能体发现隐藏确定性有限自动机的能力,发现性能随DFA规模增加而急剧下降,推理模型优于非推理模型但仍存在规划、整合和假设构建缺陷。

详情
AI中文摘要

我们提出智能体自动机学习,以评估调用工具的LLM智能体通过交互发现隐藏环境的程度。在我们的设置中,智能体应通过与预言机的交互来发现隐藏的确定性有限自动机(DFA),交互方式包括(1)成员查询(“该字符串是否属于目标语言?”)和(2)等价查询(“这是目标DFA吗?”)。这产生了一个可扩展的测试平台,具有可控的任务复杂度、可测量的交互效率以及强基线(经典自动机学习算法)。评估最先进的LLM,我们发现性能随着DFA规模增加而急剧下降。推理模型明显强于非推理模型,但轨迹分析揭示了查询规划、证据整合和假设构建中的反复失败。总体而言,我们的结果表明,当前的LLM智能体有时可以执行非平凡的交互式发现,但在此任务上远不如经典算法稳健和高效。

英文摘要

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

2606.16629 2026-06-16 cs.CL 新提交

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

伊斯兰大语言模型:从知识获取到可信且抗幻觉的人工智能

Mohammed Amine Mouhoub

发表机构 * Paris Dauphine University(巴黎多芬纳大学)

AI总结 综述伊斯兰大语言模型领域,提出构建可信、抗幻觉的伊斯兰AI需结合阿拉伯语NLP、检索增强生成、教法学派推理及专家评估等关键技术。

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于知识密集型问答,包括宗教和法律问题。伊斯兰知识是一个特别苛刻的场景:答案应基于权威来源,引用必须精确,阿拉伯语变体与古典来源语言差异显著,且必须呈现合法的教法学派分歧而非合并为单一答案。本综述回顾了伊斯兰LLMs和可信伊斯兰AI的新兴领域。我们围绕阿拉伯语NLP和以阿拉伯语为中心的LLMs、伊斯兰NLP资源、古兰经问答、伊斯兰知识基准、检索增强生成、伊斯兰法律推理、继承推理、幻觉评估和可信度组织文献。我们认为,阿拉伯语流利度不足以实现伊斯兰AI。可靠系统需要策划来源、检索和验证模块、引用感知生成、教法学派感知推理、人类专家评估以及不仅衡量答案准确性还衡量忠实度、来源有效性和推理质量的基准。本综述以抗幻觉伊斯兰AI系统的研究议程结束。

英文摘要

Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

2606.16825 2026-06-16 cs.CL cs.AI cs.LG 新提交

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

循环绑定——混合专家语言模型中的专家层绑定

Martin Jaggi

发表机构 * EPFL(瑞士联邦理工学院洛桑)

AI总结 提出专家绑定方法,通过共享连续Transformer层的专家参数,在保持独立路由和注意力的同时,将MoE模型内存占用降低近2倍,且不损失困惑度或下游性能。

Comments Code available at https://github.com/epfml/looped-moe

详情
AI中文摘要

混合专家(MoE)架构通过每个令牌仅激活一小部分专家来高效扩展大型语言模型(LLM),但全部参数计数——主要由专家参数主导——必须保留在训练和推理内存中。为了解决这个问题,我们引入了专家绑定(Expert Tying),这是一种架构修改,它在连续Transformer层之间共享专家参数,同时保留独立的逐层路由和注意力。我们在常见的先进架构上评估了这种方法,包括OLMoE、Qwen3和DeepSeek风格的MoE。我们的预训练实验表明,绑定专家可以将内存占用减少近2倍,而几乎不降低困惑度或下游质量。通过利用MoE路径中固有的参数冗余,我们的方法提供了高度有利的计算-内存权衡,推动了下一代LLM的高效训练和扩展。

英文摘要

Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.

2606.16847 2026-06-16 cs.CL cs.AI 新提交

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

遵循潜在路径:利用锚定令牌导航扩散LLM的可撤销解码

Yizhen Yao, Qinglin Zhu, Runcong Zhao, Xiangxiang Dai, Yanzheng Xiang, Yulan He, Lin Gui

发表机构 * King's College London(伦敦国王学院) The Chinese University of Hong Kong(香港中文大学) The Alan Turing Institute, UK(英国艾伦·图灵研究所)

AI总结 针对扩散大语言模型解码速度与质量的权衡,提出无训练框架ASRD,通过锚定令牌解耦上下文,结合锚定引导生成与锚定扰动验证,在数学和编码基准上提升准确率6.4%,加速推理7.2倍。

详情
AI中文摘要

扩散大语言模型(dLLMs)为并行生成提供了有前景的途径,但面临解码速度与质量之间的权衡。虽然可撤销解码策略尝试通过验证和重新掩码来减轻错误,但它们通常在混合质量上下文中操作。这导致两个关键失败:\textit{错误传播},即新令牌从错误上下文中吸收有毒信息;以及\textit{局部错误强化},即错误相互强化以逃避检测。为缓解这些挑战,我们提出ASRD(锚定监督可撤销解码),一种在嵌入空间内运行的无训练框架。ASRD明确将解码上下文解耦为通过时间一致性识别的可信\textit{锚定令牌}和不确定候选令牌。利用动态锚定令牌缓存,我们引入两种互补机制:(1)锚定引导生成,将熵加权锚定信号注入掩码位置,以隐式地将注意力引导向可靠的全局骨架;(2)锚定扰动验证,对不确定候选令牌施加正交扰动,破坏并重新掩码由脆弱局部共识驱动的错误。在数学和编码基准上的大量实验表明,ASRD优于最近的重新掩码基线,准确率提升高达6.4%,同时推理吞吐量加速高达7.2倍。

英文摘要

Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textit{Error Propagation}, where new tokens absorb toxic information from erroneous context, and \textit{Local Error Reinforcement}, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textit{Anchor Tokens}, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

2606.16905 2026-06-16 cs.CL 新提交

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

说科学的语言:面向自然科学的通用生成基础模型

Mingyang Li, Yurou Liu, Jieping Ye, Bing Su, Ji-Rong Wen, Zheng Wang

发表机构 * Alibaba Group(阿里巴巴集团) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)

AI总结 提出LOGOS模型,通过统一科学语法将异构任务转化为自回归框架中的下一个词预测,在多个科学任务上匹配或超越领域专用基线,验证了“一模型适用于所有”的可行性。

详情
AI中文摘要

在本报告中,我们提出了LOGOS(科学中的生成对象语言),一个科学生成语言模型,它基于共享的科学语法,在单一自回归框架内统一了自然科学中的异构任务。它将不同的科学对象及其空间交互编码为公共词汇表上的令牌序列。通过将空间接触和约束模式表示为离散令牌,该模型以纯序列方式捕获复杂的结构交互,而不依赖显式坐标或几何神经网络。这种统一表示使得广泛的下游任务能够一致地表述为同一语法空间中的下一个词预测,从而在持续的多领域预训练和下游目标之间建立强对齐。在多种任务中,LOGOS始终匹配或超越领域专用基线,为自然科学中“一模型适用于所有”的可行性提供了初步证据。我们训练了不同规模(1B、3B和8B参数)的LOGOS模型,并发现模型大小与性能之间存在一致的正相关关系。这表明,未来的人工智能科学(AI4S)可能不在于构建独立于大型语言模型(LLM)的技术栈,而在于通过共享架构、共享训练范式和共享推理基础设施,将科学基础模型与LLM深度对齐,从而使LLM真正成为AI4S的新入口。我们发布了模型权重和相关资源以促进进一步研究。

英文摘要

In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of "one model fits all" in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

2606.16908 2026-06-16 cs.CL 新提交

LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

LESS Is More: 扩散语言模型的互稳定采样

Amr Mohamed, Guokan Shang, Michalis Vazirgiannis

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) Ecole Polytechnique(巴黎综合理工学院)

AI总结 针对扩散语言模型固定步数采样效率低的问题,提出无训练的自适应采样器LESS,通过互稳定规则动态决定掩码位置何时解码,在7个基准上平均准确率提升且步数减少72.1%。

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代精炼掩码序列,支持并行令牌更新和双向条件化,为自回归解码提供了一种有前景的替代方案。然而,其实际效率受到采样过程的限制,该过程在解码前执行固定数量的反向去噪步骤,将计算花费在已经稳定的位置上,有时过早地提交不稳定的位置。我们提出\textsc{LESS},一种无需训练、模型无关的自适应采样器,将令牌提交视为在线停止问题。\textsc{LESS}通过联合稳定性规则实现互稳定采样:仅当其top-1预测具有高置信度、其top-1令牌在最近的反向步骤中持续出现、且其预测分布在top-$K$步间Jensen-Shannon散度下稳定时,掩码位置才符合解码条件。我们在Dream-7B、LLaDA-8B和LLaDA-1.5-8B上评估\textsc{LESS},涵盖全序列扩散和半自回归块采样模式,跨越七个涵盖通用知识、数学和代码的基准。\textsc{LESS}在强无训练自适应采样器上提高了平均准确率,同时比固定预算解码减少了$72.1\%$的反向步骤。由于每个反向步骤需要一次Transformer前向传播,这些步数减少转化为更少的前向评估、更低的实测墙钟延迟和更低的估计推理计算量。

英文摘要

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

2606.16934 2026-06-16 cs.CL cs.LG 新提交

Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

探索代码解释器有效推理的外在属性与内在属性

Patomporn Payoungkhamdee, Napat Laosaengpha, Jenta Wonglertsakul, Pittawat Taveekitworachai, Pume Tuchinda, Panjapong Poobanchuen, Ekapol Chuangsuwanich, Can Udomcharoenchaikit, Samuel Cahyawijaya, Peerat Limkonchotiwat, Sarana Nutanong

发表机构 * Vidyasirimedhi Institute of Science and Technology(维达亚希米科技学院) Kasetsart University(科琼大学) SCB 10X King Mongkut’s University of Technology Thonburi(朱拉隆功技术大学泰竹分校) Department of Computer Engineering Chulalongkorn Univesity(朱拉隆功大学计算机工程系) Cohere AI Singapore(AI新加坡)

AI总结 本文从外在属性(关键token)和内在属性(代码特定认知行为)两个角度研究代码解释器推理,发现强模型更频繁出现关键token和验证、回溯等行为,并利用这些属性在推理和训练中提升性能。

详情
AI中文摘要

使用代码解释器(CI)进行推理已成为一种有效范式,通过可执行计算和迭代验证增强大型语言模型(LLM)的推理能力。尽管其应用日益广泛,但有效代码推理的行为属性仍未被充分探索。在本工作中,我们受自然语言推理研究的启发,从两个不同视角研究代码推理:外在属性(由关键token表示)和内在属性(由代码特定的认知行为表示)。在多个LLM上,我们发现更强的CI推理模型一致地表现出更高比例的关键token和认知行为,特别是验证、回溯和反向链。基于这些观察,我们研究了如何在推理和训练期间利用这些属性。在推理时,附加代码特定的关键token在数学、排序和优化等若干推理能力上提升了性能,但在其他方面收益有限。在训练时,用代码特定的认知行为增强最先进的框架,在三个评估模型中的两个上提升了监督微调和强化学习性能。进一步分析表明,这些行为减少了错误回答中的过度思考,提高了token效率,同时也揭示了限制某个模型收益的因素。我们的发现首次系统性地描述了有效CI推理的特征,并展示了利用关键属性改进CI推理的潜力和局限性。

英文摘要

Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.

2606.17034 2026-06-16 cs.CL cs.LG 新提交

KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing

KVEraser: 学习操控KV缓存以实现高效的局部上下文擦除

Mufei Li, Shikun Liu, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出KVEraser方法,通过学习操控KV缓存实现局部上下文擦除,避免全局重计算,在长上下文任务中接近全重算性能且延迟仅增加24%。

Comments Oral at the ICML 2026 Workshop on the Impact of Memorization on Trustworthy Foundation Models

详情
AI中文摘要

在KV缓存上进行事后上下文擦除具有挑战性,因为局部编辑会产生全局影响:一旦某个跨度被处理,其影响会传播到所有后续token的缓存状态。这个问题在长上下文LLM应用中自然出现,其中过时的检索事实、错误的工具观察、撤回的用户偏好或有害的提示注入可能仅在预填充后才发现。精确擦除必须重新计算删除跨度后的所有token,使其计算成本取决于后缀长度而非擦除跨度长度。我们引入KVEraser,一种学习型KV缓存编辑方法,用于高效的局部上下文擦除。给定已处理的上下文和要移除的跨度,KVEraser仅用学习到的操控状态替换擦除区间的KV状态,同时保持其余缓存不变。为了学习可迁移的擦除机制,我们构建了一个两阶段训练流程:通用跨度-邻居预训练教会擦除器抑制擦除跨度的影响,而任务特定微调将此能力适应下游场景。实验表明,在1K--32K上下文长度的域内任务中,KVEraser在擦除后性能上几乎匹配全重算,而其延迟仅增加24%,而全重算延迟增加17.6倍。KVEraser还能泛化到具有有害事实干扰项的未见长文档QA任务,在全重算的3--4倍加速下,在近似基线中取得最佳性能。

英文摘要

Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K--32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3--4x speedup over full recomputation.

2606.17053 2026-06-16 cs.CL cs.CV 新提交

Context-Aware RL for Agentic and Multimodal LLMs

上下文感知强化学习用于智能体与多模态大语言模型

Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu

发表机构 * Princeton University(普林斯顿大学) UC Davis(加州大学戴维斯分校)

AI总结 提出ContextRL方法,通过间接辅助目标(上下文选择奖励)增强大模型在长上下文和多模态任务中的细粒度推理能力,在5个长程基准和12个视觉问答基准上分别提升+2.2%和+1.8%。

Comments 29 pages, 9 figures

详情
AI中文摘要

大语言模型在需要从长或复杂上下文中识别细小但决定性证据(如工具跟踪中的一行或图像中的细微细节)时常常失败。我们提出ContextRL,一种上下文感知的强化学习方法,通过一个间接辅助目标来提升长程推理和多模态性能。ContextRL不是仅监督最终答案,而是向模型提供查询、答案和两个高度相似的上下文,并奖励它选择支持查询-答案对的上下文,从而鼓励细粒度定位。我们在两个领域构建对比上下文数据:对于编码智能体,轨迹作为上下文,通过条件过滤生成1k对;对于多模态推理,图像作为上下文,通过生成式编辑和相似性搜索生成7K对。ContextRL在5个长程基准上比标准GRPO平均提升+2.2%,在12个多样化视觉问答基准上平均提升+1.8%。为了分离所提目标与额外数据的影响,我们与数据增强基线进行比较,这些基线将相同的对比上下文重新用作标准查询-上下文-答案示例。这些基线几乎没有改进,表明收益来自所提出的上下文选择目标,而非仅对比数据。

英文摘要

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

2606.17056 2026-06-16 cs.CL 新提交

The Value Axis: Language Models Encode Whether They're on the Right Track

价值轴:语言模型编码它们是否在正确的轨道上

Nick Jiang, Isaac Kauvar, Jack Lindsey

发表机构 * Stanford University(斯坦福大学) Anthropic

AI总结 通过构建Qwen3-8B的“价值轴”,发现语言模型内部追踪当前轨迹的成功概率,并影响自信、自我纠正和探索行为。

Comments Code repository: https://github.com/nickjiang2378/value-axis

详情
AI中文摘要

我们研究语言模型是否内部追踪其当前轨迹的价值,定义为当前策略实现目标的似然。使用合成的上下文强化学习数据,我们为Qwen3-8B构建了一个“价值轴”。我们发现沿此轴的激活区分了高与低口头自信、无回溯与有回溯的展开、正确与错误的代码。向高价值引导因果地抑制自我纠正并减少解释冗长,而向低价值引导则诱导回溯和探索。我们证明直接偏好优化(DPO)可以增加奖励行为(例如使用某个词)的内部价值,使模型在展示这些行为后表现得更自信。最后,我们将价值轴应用于研究野外设置。例如,我们发现Qwen在训练后对政治敏感的聊天查询分配低价值,并且监督微调增加了训练领域内的内部自信。我们的结果表明语言模型线性编码对预期目标成功的一个估计,该估计调节它们追求方向的自信。

英文摘要

We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

2603.04592 2026-06-16 cs.CL cs.CV 新提交

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

从静态推理到动态交互:流式大型语言模型综述

Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究院)

AI总结 本文统一了流式LLM的定义,提出系统分类法,综述其方法、应用与未来方向。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

标准大型语言模型(LLM)主要设计用于预定义输入的静态推理,这限制了它们在动态实时场景中的适用性。为解决这一差距,流式LLM范式应运而生。然而,现有流式LLM的定义仍然零散,混淆了流式生成、流式输入和交互式流式架构,且缺乏系统分类法。本文对流式LLM进行了全面概述和分析。首先,我们基于数据流和动态交互建立了流式LLM的统一定义,以澄清现有歧义。基于这一定义,我们提出了当前流式LLM的系统分类法,并对其底层方法进行了深入讨论。此外,我们探讨了流式LLM在现实场景中的应用,并概述了有前景的研究方向,以支持流式智能的持续进展。我们在以下网址维护一个持续更新的相关论文仓库:此 https URL。

英文摘要

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

2606.15621 2026-06-16 cs.LG cs.CL 交叉投稿

Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

重新喂食并非重放:在反事实令牌信用估计中测量重放噪声

Nils Matteson

发表机构 * Northeastern University(东北大学)

AI总结 通过三遍实验设计,测量了在反事实令牌信用估计中重新喂食前缀导致的噪声,发现其改变信用估计的比率高于副本噪声基底,建议恢复解码器状态或使用批不变内核。

Comments 10 pages, 3 figures. Code, per-pivot data, logs, and registration: https://github.com/thaw-ai/thaw (benchmarks/, paper/refeed-drift/)

详情
AI中文摘要

逐令牌反事实信用估计询问语言模型生成结果中哪个令牌导致最终答案正确或错误:在某个枢轴处截断转录,替换一个替代令牌,重放后续内容,并比较结果。已发表的方法将转录前缀作为新提示重新喂食,假设这能重现模型在生成过程中经过的状态。我们在一个标准推理引擎上测量了这一假设的代价,采用三遍设计:从验证的解码时KV状态恢复的继续生成,一个完全相同的第二遍精确传递(副本噪声基底),以及一个重新喂食传递。在六种配置和三个模型(包括一个GRPO训练的检查点)中,在低边际决策令牌处,重新喂食改变信用估计的比率比副本基底高14-28个百分点(在治疗无关条件下为7-21个百分点;问题聚类t=2.9-6.4)。大多数变化是量化估计器的零边界交叉而非极性反转,且扰动均值为零,因此平均量基本安全;但选择并非如此:通过阈值化$|\hat{A}_t|$在重新喂食下选择的临界令牌集与精确恢复选择的Jaccard重叠为0.34-0.90,而副本上限为0.63-0.96。一个因果确认闭环:在vLLM的批不变内核下,所有三遍在每一个测量通道上完全相同,分歧率均为零。副本传递本身在9-23%的合格估计上存在分歧:决策令牌处的单样本信用测量在任何重放下都不可靠。设置事先固定;第二遍活动中的精确传递缓存命中被仪器化(100%命中率,3434个枢轴);总计算成本低于10美元。我们建议反事实信用研究恢复解码器状态或使用批不变内核,并报告副本基底。

英文摘要

Per-token counterfactual credit estimation asks which token in a language-model rollout caused the final answer to be right or wrong: cut the transcript at a pivot, substitute an alternative token, replay continuations, and compare outcomes. Published methods re-feed the transcript prefix as a fresh prompt, assuming this reproduces the state the model passed through during generation. We measure what that assumption costs on a stock inference engine, with a three-pass design: continuations resumed from the verified decode-time KV state, an identical second exact pass (a replica noise floor), and a re-feed pass. Across six configurations and three models (including a GRPO-trained checkpoint), at low-margin decision tokens, re-feeding changes the credit estimate at rates 14-28 percentage points above the replica floor (7-21pp under a treatment-independent conditioning; problem-clustered t = 2.9-6.4). Most changes are zero-boundary crossings of the quantized estimator rather than polarity reversals, and the perturbation is consistent with mean-zero, so averaged quantities are largely safe; but selection is not: a critical-token set chosen by thresholding $|\hat{A}_t|$ under re-feed overlaps the exact-resume selection at Jaccard 0.34-0.90, versus a 0.63-0.96 replica ceiling. A causal confirmation closes the loop: under vLLM's batch-invariant kernels all three passes are identical on every measured channel, with both disagreement rates exactly zero. Replica passes themselves disagree on 9-23% of eligible estimates: single-sample credit measurements at decision tokens are unreliable under any replay. Settings were fixed in advance; exact-pass cache hits in the second campaign are instrumented (100% hit rate, 3,434 pivots); total compute was under 10 USD. We recommend that counterfactual credit studies resume decoder state or use batch-invariant kernels, and report a replica floor.

2606.15652 2026-06-16 cs.LG cs.CL 交叉投稿

MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

MosaicQuant: 基于内点-离点分离的统一4位LLM量化

Yangjia Hu, Haodong Wang, Zicong Hong, Qianli Liu, Quanxin Shou, Jian Lin, Song Guo, Xiaowei Shen, Xiangjun Huang, Dian Wang, Jian Yang

发表机构 * HKUST(香港科技大学) EPFL(瑞士联邦理工学院洛桑) MetaX Integrated Circuits Co., Ltd(MetaX集成电路有限公司)

AI总结 提出MosaicQuant,通过将权重矩阵量化为密集4位基分量和稀疏4位残差分量,结合ZipperEngine融合稀疏块计算,实现统一4位推理,在LLaMA3和Qwen3上保持近FP16精度并加速1.24倍。

Comments 17 pages

详情
AI中文摘要

4位量化显著减少了内存占用并加速了大语言模型(LLM)的推理。然而,其有限的位宽表示难以忠实捕捉密集的常见值(内点)和罕见的大幅度值(离点),导致显著的精度下降。现有的混合精度方法通过保留离点的高精度来缓解这一问题,但代价是破坏了低比特执行的统一性,引入了精度转换和额外的数据移动,削弱了实际加速效果。我们提出MosaicQuant,一种基于内点-离点分离新原理的统一4位LLM量化范式。MosaicQuant不提升离点精度,而是将整个权重矩阵量化为密集的4位基分量,其中内点被忠实捕捉,而离点不可避免地量化。然后引入一个稀疏的4位残差分量来补偿这些量化误差,选择性地针对输出失真最严重的误差关键权重块。然而,仅统一表示是不够的,因为将稀疏残差作为单独内核执行仍然会破坏统一的低比特推理流水线。为弥补这一差距,我们引入ZipperEngine,通过重叠流水线将稀疏块计算融合到密集4位GEMM内核中,不仅统一了表示,而且将执行统一为单个连贯的低比特推理流水线。在LLaMA3和Qwen3上的大量实验表明,MosaicQuant在保持接近FP16精度的同时,相比W16A16基线实现了高达1.24倍的加速。

英文摘要

4-bit quantization significantly reduces the memory footprint and accelerates the inference of large language models (LLMs). However, its limited bit-width representation struggles to faithfully capture both dense common values (\emph{inliers}) and rare large-magnitude values (\emph{outliers}), causing substantial accuracy degradation. Existing mixed-precision methods mitigate this by retaining outliers in high precision, but at the cost of breaking the uniformity of low-bit execution, introducing precision conversion and extra data movement that undermine practical speedup. We propose \textbf{MosaicQuant}, a unified 4-bit LLM quantization paradigm built on a novel principle of \emph{inlier--outlier disaggregation}. Rather than elevating outlier precision, MosaicQuant quantizes the full weight matrix into a dense 4-bit base component, where inliers are captured faithfully while outlier are inevitably quantized. A sparse 4-bit residual component is then introduced to compensate for these quantization errors, selectively targeting the most error-critical weight blocks where output distortion is shown to be concentrated. However, a unified representation alone is insufficient, as naïvely executing the sparse residual as a separate kernel still breaks the unified low-bit inference pipeline. To bridge this gap, we introduce \textbf{ZipperEngine}, which fuses sparse block computation into the dense 4-bit GEMM kernel via an overlapped pipeline, unifying not only the representation but also the execution into a single coherent low-bit inference pipeline. Extensive experiments on LLaMA3 and Qwen3 demonstrate that MosaicQuant preserves near-FP16 accuracy while achieving up to $1.24\times$ speedup over the W16A16 baseline.

2606.16140 2026-06-16 cs.AI cs.CL 交叉投稿

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B:探索小型语言模型中可验证推理的前沿

Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3B参数紧凑模型VibeThinker-3B,通过频谱到信号后训练范式(课程SFT、多域强化学习、离线自蒸馏)在可验证推理任务上达到前沿性能,匹配甚至超越大模型,并验证推理增强不损害指令可控性。

详情
AI中文摘要

本技术报告介绍了VibeThinker-3B,一个具有3B参数的紧凑密集模型,旨在探究在严格的小模型范围内可验证推理能推进到何种程度。基于频谱到信号后训练范式,我们通过优化的流程系统性地增强模型,该流程包括基于课程的监督微调、多域强化学习和离线自蒸馏。实验评估表明,VibeThinker-3B在高度要求的可验证任务上达到了前沿水平。具体来说,它在AIME26上获得94.3分(通过声明级测试时缩放提升至97.1),在LiveCodeBench v6上获得80.2的Pass@1,并在最近的未见LeetCode竞赛中表现出强大的分布外泛化能力,接受率达96.1%。这有效地将其置于一流推理系统的性能区间,匹配或超越规模大数个数量级的旗舰模型,如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外,IFEval上的93.4分证实了这种极端的推理增强并未损害严格的指令可控性。扩展我们之前的1.5B工作,这些发现推动了参数压缩-覆盖假说,该假说将可验证推理视为可压缩到紧凑推理核心中,而开放域知识和通用能力则需要广泛的参数覆盖事实、概念和长尾场景。这一观点表明,紧凑模型不仅是部署高效的替代品,更是通往参数密集能力领域前沿性能的互补路径。

英文摘要

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

2606.16310 2026-06-16 cs.LG cs.CL 交叉投稿

QK-Normed MLA: QK normalization without full key caching

QK归一化MLA:无需完整键缓存的QK归一化

Yizhou Han, Yao Zhao, Jun Zhou, Longfei Li, Ruoyu Sun

发表机构 * The Chinese University of Hong Kong(香港中文大学) Ant Group(蚂蚁集团)

AI总结 提出QK归一化与MLA兼容的方法,通过吸收静态权重和动态标量,无需缓存完整键,在400M模型训练中降低损失并提升下游精度,解码延迟增加小于2%。

Comments 13 pages, 5 figures, conference-style manuscript

详情
AI中文摘要

查询-键(QK)归一化通过控制点积前查询和键的尺度来稳定注意力,但无法直接与多头潜在注意力(MLA)兼容。MLA通过缓存低维潜在状态而非完整键来实现高效解码,而投影后的QK RMSNorm似乎需要对每个缓存的token使用完全投影的键。我们表明这种明显的不兼容性是实现伪影,而非架构约束。RMSNorm分解为静态仿射权重和动态标量RMS统计量。静态键侧权重可以吸收到MLA查询侧投影中;动态键统计量简化为每个token和KV组的一个逆RMS标量。得到的公式在精确算术中与显式投影后QK RMSNorm完全等价,并保留了MLA的潜在解码路径。在我们训练高达100B token的400M参数模型中,QK归一化MLA相比QK裁剪实现了更低的训练损失和更好的下游准确率,而H800解码基准测试显示在高达256k上下文下延迟开销小于2%。这些结果使得QK归一化成为MLA模型实用的稳定选项,无需完整键缓存。

英文摘要

Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA's latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

2606.16811 2026-06-16 cs.AI cs.CL 交叉投稿

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

从最小标签扩展LLM推理:一种带有轻量级验证器的半监督框架

Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi

发表机构 * Fujitsu Limited(富士通株式会社) Kyoto University(京都大学) National Institute of Informatics(国立信息学研究所)

AI总结 提出半监督框架,用轻量级推理正确性分类器和熵过滤从少量标注数据生成高质量伪推理链,在数学和视觉问答任务上达到10-15倍标注数据效果。

Comments LREC 2026. Section 3.3 is updated

详情
AI中文摘要

对于大型语言模型(LLMs)的发展,最近生成伪中间推理的方法取得了显著进展。但它们通常依赖大量正确标注的答案来评估推理质量。本文提出一种半监督框架,从最小监督中扩展推理学习,将推理验证本身转变为数据创建机制。我们仅在少量标注样本上训练一个轻量级推理正确性分类器,用于判断LLM生成的中间推理轨迹是否有效。此外,基于熵的置信度阈值过滤掉不可靠样本,剩余的高置信度推理轨迹用于微调模型。在可验证数学问题(Orca-Math子集)和基于视觉编程的图像场景图问答(GQA)上的实验表明,我们的方法达到了与使用10-15倍标注数据相当的准确率。消融分析证实,分类器和熵过滤对于可扩展且抗噪声的伪标签生成都是必不可少的。通过用轻量级推理验证替代昂贵的答案级监督,我们的方法为构建大规模推理资源提供了一条实用路径,并为未来从最小人工输入中学习的自主推理系统铺平了道路。

英文摘要

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

2505.09655 2026-06-16 cs.CL cs.LG 版本更新

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

DRA-GRPO:你的GRPO需要了解多样化的推理路径以进行数学推理

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

发表机构 * Morgan Stanley(摩根士丹利) Clemson University(克莱姆森大学) Arizona State University(亚利桑那州立大学) Washington University in St. Louis(圣路易斯华盛顿大学) University of Notre Dame(圣母大学) University of Arizona(亚利桑那大学)

AI总结 针对GRPO在数学推理中因奖励信号非单射导致策略坍塌的问题,提出基于子模互信息的多样性感知奖励调整框架DRA,通过逆倾向评分去偏梯度估计,在五个数学基准上以少量数据和成本取得平均58.2%的准确率。

Comments ACL2026

详情
AI中文摘要

使用强化学习(特别是组相对策略优化GRPO)对大型语言模型进行后训练已成为增强数学推理的一种范式。然而,标准GRPO依赖于标量正确性奖励,这些奖励在语义内容上通常是非单射的:不同的推理路径获得相同的奖励。这导致了多样性-质量不一致性,策略会坍缩到一组狭窄的主导模式,而忽略同样有效但结构新颖的策略。为弥补这一差距,我们提出了多样性感知奖励调整(DRA),这是一个理论上有基础的框架,它使用采样组的语义密度来校准奖励信号。通过利用子模互信息(SMI),DRA实现了一种逆倾向评分(IPS)机制,有效去偏梯度估计。这产生了对抗冗余的排斥力,推动策略更好地覆盖高奖励区域。我们的方法是即插即用的,并与GRPO变体无缝集成。在五个数学基准上的实证评估表明,DRA-GRPO持续优于强基线,在DeepSeek-R1-Distill-Qwen-1.5B上仅使用7,000个训练样本和55美元成本就达到了58.2%的平均准确率,突显了多样性校准在数据高效对齐中的关键作用。代码可在该网址获取。

英文摘要

Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.

2505.23666 2026-06-16 cs.CL cs.LG 版本更新

LoLA: Low-Rank Linear Attention With Sparse Caching

LoLA: 低秩线性注意力与稀疏缓存

Luke McDermott, Robert W. Heath, Rahul Parhi

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出LoLA,一种无需训练的线性注意力增强方法,通过三种记忆系统(局部滑动窗口、稀疏全局缓存和循环隐状态)提升关联回忆,在pass-key检索任务上将准确率从0.6%提升至97.4%,且缓存大小比Llama-3.1 8B小4.6倍。

详情
AI中文摘要

Transformer推理的每token成本随上下文长度扩展,阻碍了其在终身上下文学习中的应用。线性注意力是一种高效的替代方案,即使在无限上下文长度下也能保持恒定的内存占用。虽然这可能是终身学习的潜在候选,但其内存容量不足。在本文中,我们提出LoLA,一种无需训练的线性注意力增强方法,可提升关联回忆。LoLA将上下文中的过去键值对分配到三种记忆系统中:(i) 局部滑动窗口缓存中的近期对;(ii) 稀疏全局缓存中的难以记忆的对;以及(iii) 线性注意力循环隐状态中的通用对。通过消融实验,我们表明自回忆误差指标对于高效管理长期关联记忆至关重要。在pass-key检索任务上,LoLA将基础模型的准确率从0.6%提升至97.4%。这是在4K上下文长度下,缓存大小比Llama-3.1 8B小4.6倍的情况下实现的。LoLA在零样本常识推理任务上也优于其他1B和8B参数的次二次模型。

英文摘要

The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.

2506.11418 2026-06-16 cs.CL 版本更新

CentroidKV: Efficient Long-Context LLM Inference via KV Cache Clustering

CentroidKV: 通过KV缓存聚类实现高效的长上下文LLM推理

Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan

发表机构 * Peking University(北京大学) Huawei Technologies(华为技术有限公司) University of Science and Technology of China(中国科学技术大学)

AI总结 提出CentroidKV框架,通过在线聚类KV缓存减少内存占用,在保持性能的同时实现高达75%的缓存压缩和1.92倍解码加速。

详情
AI中文摘要

具有扩展上下文窗口的大型语言模型(LLM)在处理复杂任务中越来越普遍。然而,长上下文LLM所需的大量键值(KV)缓存带来了显著的部署挑战。现有方法要么丢弃未来生成可能需要的潜在关键信息,要么由于高计算开销而提供有限的效率提升。在本文中,我们介绍了CentroidKV,一个简单而有效的在线KV缓存聚类框架。我们的方法基于观察到键状态在序列维度上表现出高度相似性。为了实现高效聚类,我们将序列划分为块,并提出分块软匹配(Chunked Soft Matching),它在每个块内采用交替分区策略,并基于相似性识别聚类。CentroidKV然后将每个聚类内的KV缓存合并为单个质心。此外,我们提供了计算复杂度和块内分区策略最优性的理论分析。在各种模型和长上下文基准上的广泛实验表明,CentroidKV在保持可比模型性能的同时,实现了高达75%的KV缓存内存减少。此外,由于计算开销极小,CentroidKV将推理的解码阶段加速高达1.92倍,并将服务吞吐量提高高达4倍。

英文摘要

Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce CentroidKV, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. CentroidKV then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that CentroidKV achieves up to 75% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, CentroidKV accelerates the decoding stage of inference by up to $1.92\times$ and increases the serving throughput by up to $4\times$.

2509.24494 2026-06-16 cs.CL 版本更新

Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO

为什么树状分支对GRPO中的思维优势估计至关重要

Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong

发表机构 * Hongcheng Wang(王宏城) Yinuo Huang(黄炎诺) Sukai Wang(王苏凯) Guanghui Ren(任广辉) Hao Dong(董浩)

AI总结 本文从方差角度理论证明,在GRPO中增加思维采样数无法消除估计方差,而增加每个思维的延续分支数能以1/M速率降低方差,从而阐明树状分支的必要性。

Comments Accepted by ICML 2026, code are available at https://github.com/whcpumpkin/GRPO-MA

详情
AI中文摘要

组相对策略优化(GRPO)使用可验证奖励训练思维链推理,但缺乏价值函数的思维级优势估计常面临高方差问题。尽管实践中采用树状分支来降低方差,但缺乏对其有效性及必要性的理论解释。我们从方差角度研究最小树状设置下(每个思维采样多个延续)的思维级优势估计。利用多元delta方法,我们揭示了采样维度的不对称性:增加采样思维数($K$)会留下严格为正的估计方差下限,而增加每个思维的延续数($M$)能使主导阶估计方差以$1/M$速率趋近于零。这意味着,在本文研究的固定温度GRPO风格估计器(无价值模型)中,仅靠扩大思维采样无法实现准确的思维级优势估计,因此延续级分支是一种有原则且可能必要的机制,而非启发式方法。实验进一步提供了其有效性和潜在必要性的经验证据,不仅在数学领域,而且在视觉领域以及不同模型架构和规模下,均展示了优化稳定性、训练效率和最终性能的提升。

英文摘要

Group Relative Policy Optimization (GRPO) trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce variance, it lacks a theoretical explanation of why it works and whether it is important or potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple continuations are sampled for each thought. Using the multivariate delta method, we reveal a sampling-dimension asymmetry. Increasing sampled thoughts ($K$) leaves a strictly positive estimation-variance floor, whereas increasing continuations per thought ($M$) drives the leading-order estimation variance to zero at rate $1/M$. This implies that, within the fixed-temperature GRPO-style estimator without value models studied here, accurate thought-level advantage estimation cannot be achieved by scaling thought sampling alone, making continuation-level branching a principled and potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for its effectiveness and potential necessity, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across vision domains and under different model architectures and sizes.

2510.07651 2026-06-16 cs.CL cs.AI 版本更新

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

OBCache: 面向高效长上下文LLM推理的最优脑KV缓存剪枝

Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, Enmao Diao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OBCache框架,将缓存驱逐形式化为逐层结构化剪枝问题,基于最优脑损伤理论量化令牌显著性,通过输出感知分数改进现有驱逐策略,在长上下文任务中提升准确性。

Comments ICML 2026

详情
AI中文摘要

具有扩展上下文窗口的大型语言模型(LLM)实现了强大的应用,但带来了显著的内存开销,因为缓存所有键值(KV)状态随序列长度和批大小线性扩展。现有的缓存驱逐方法通过利用注意力稀疏性来解决这一问题,但它们通常使用累积注意力权重对令牌进行启发式排名,而不考虑其对注意力输出的真实影响。我们提出了最优脑缓存(OBCache),一个将缓存驱逐形式化为逐层结构化剪枝问题的原则性框架。基于最优脑损伤(OBD)理论,OBCache通过测量由剪枝令牌引起的注意力输出扰动来量化令牌显著性,并为孤立键、孤立值以及联合键值对推导出闭式分数。我们的分数不仅考虑了注意力权重,还考虑了值状态和注意力输出的信息,从而通过输出感知信号增强了现有的驱逐策略。在LLaMA和Qwen模型上的实验表明,将现有工作中跨不同查询位置估计令牌显著性的启发式分数替换为OBCache的输出感知分数,持续提高了长上下文准确性。代码可在 https://github.com/DreamSoul-AI/OBCache 获取。

英文摘要

Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy. Code is available at https://github.com/DreamSoul-AI/OBCache.

2510.13940 2026-06-16 cs.CL cs.AI 版本更新

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

少即是多:用最小测试时干预提升大语言模型推理能力

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) Kuaishou Technology(快手科技) AIML(人工智能实验室) ZJU(浙江大学) Ant Group(蚂蚁集团) HKUST(香港科技大学)

AI总结 针对大语言模型推理中高计算成本问题,提出最小测试时干预(MTI)框架,通过仅在不确定位置应用分类器自由引导和轻量负提示引导,在保持高效的同时提升推理准确性和稳定性。

Comments Code: https://github.com/EnVision-Research/MTI

详情
AI中文摘要

大语言模型(LLMs)的最新进展集中在通过增加推理计算来改进测试时扩展以提升推理能力,但这往往以牺牲效率为代价。我们重新审视测试时行为,发现了一个简单但未被充分探索的现象:推理不确定性高度局部化——只有一小部分高熵标记对输出正确性起主导作用。受此启发,我们提出了最小测试时干预(MTI),这是一个无需训练的框架,以最小的开销增强推理准确性和稳定性。MTI包括:(i)选择性CFG干预,仅在不确定位置应用分类器自由引导;(ii)轻量负提示引导,重用主模型的KV缓存以高效近似无条件解码。MTI在通用、编码和STEM任务上均取得一致提升——例如,在DeepSeek-R1-7B的六个基准测试上平均提升9.28%,在使用Ling-mini-2.0的AIME2024上提升11.25%——同时保持高效性。

英文摘要

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

2511.08577 2026-06-16 cs.CL cs.AI cs.LG cs.PF 版本更新

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Think-at-Hard: 选择性潜在迭代以改进推理语言模型

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

AI总结 针对循环变压器中潜在过思考问题,提出Think-at-Hard方法,通过轻量级决策器选择性地在困难令牌上触发潜在迭代,并采用深度感知LoRA和双因果注意力机制,在数学、问答和编码任务上一致提升性能。

Comments Accepted by ICML'26

详情
AI中文摘要

提升大型语言模型(LLMs)的推理能力,特别是在参数约束下,对实际应用至关重要。循环变压器通过执行多次潜在迭代来细化每个令牌,超越单次前向传播。然而,我们识别出一种潜在过思考现象:大多数令牌预测在第一次前向传播后已经正确,但在后续迭代中有时会被修改为错误。我们询问选择性地跳过潜在迭代是否能提高准确性,并揭示了一个显著的潜力:使用预言迭代策略可将性能提升高达7.3%。受此启发,我们提出了Think-at-Hard (TaH),一种针对选择性迭代优化的循环变压器。TaH采用轻量级神经决策器来触发潜在迭代,仅在标准前向传播后可能不正确的令牌上触发。在潜在迭代期间,深度感知的低秩适应(LoRA)模块将目标从一般的下一个令牌预测转变为聚焦的困难令牌细化。双因果注意力机制将注意力从令牌序列维度扩展到额外的迭代深度维度,实现跨迭代信息流,同时保持完全的序列并行性。在九个基准上的实验显示,在数学、问答和编码任务上一致提升。在相同参数数量下,TaH在93%的令牌上跳过迭代,性能比始终迭代的基线高3.8-4.4%,并超过单次迭代的Qwen3基线3.0-3.8%。当允许LoRA和决策器增加不到3%的参数时,增益分别进一步增加到5.3-6.2%和6.1-6.8%。我们的代码可在以下网址获取:https://this URL。

英文摘要

Improving the reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. We ask whether selectively skipping latent iterations can improve accuracy, and reveal significant potential with an oracle iteration policy that boosts performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration, only at tokens likely to be incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

2512.21577 2026-06-16 cs.CL cs.AI cs.LG stat.ML 版本更新

A Unified Definition of Hallucination: It's The World Model, Stupid!

幻觉的统一定义:是世界模型的问题,笨蛋!

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出幻觉的统一定义,即用户可观察到的错误内部世界建模,并连接至HalluWorld基准测试,以区分真实幻觉与规划或奖励错误。

Comments ICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorld

详情
AI中文摘要

尽管自语言模型诞生以来已有无数缓解尝试,但即使在当今最前沿的LLM中,幻觉仍然是一个持续存在的问题。这是为什么?我们回顾了现有的幻觉定义,并将它们整合为一个统一的定义,其中先前的定义被包含在内。我们认为,幻觉可以通过将其简单地定义为不准确的(内部)世界建模来统一,其形式是用户可观察到的。例如,陈述与知识库相矛盾的事实,或生成与来源相矛盾的摘要。通过改变参考世界模型和冲突策略,我们的框架统一了先前的定义。我们认为,这种统一观点是有用的,因为它迫使评估澄清其假定的参考“世界”,区分真实幻觉与规划或奖励错误,并为跨基准比较和缓解策略讨论提供共同语言。基于这一定义,我们还将我们的框架连接到HalluWorld,这是一个补充基准,它实例化了完全指定的参考世界模型,用于压力测试模型幻觉。

英文摘要

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

2601.17421 2026-06-16 cs.CL 版本更新

Oops, Wait: Discourse Tokens Matter in Reasoning Model

哎呀,等等:话语标记在推理模型中的重要性

Jaehui Hwang, Byeongho Heo, Sangdoo Yun, Dongyoon Han

发表机构 * NAVER AI Lab(NAVER人工智能实验室)

AI总结 本文研究话语标记(如“wait”)在推理轨迹中的作用,发现数据高效微调可部分复现其模式,但不如大规模后训练与高置信答案转换对齐。

详情
AI中文摘要

最近的研究表明,通过后训练,即使使用少量(约1K)推理轨迹进行数据高效训练,也能诱导大型语言模型产生非平凡的推理能力。此类训练语料通常包含诸如“wait”、“so”和“alternatively”等标志性标记,这些标记频繁出现在推理轨迹中,并可能在此过程中发挥作用。本文专注于描述后训练中可观察的标记级模式,并案例研究数据高效监督微调(SFT)与大规模后训练的不同之处及其不足之处。为此,我们首先识别跨模型和训练设置中与推理轨迹上正确答案相关的标记。然后,我们聚焦于“wait”标记的分布和(功能)角色,主要研究数据高效训练的模型与对应模型的对比。我们的研究发现,即使在数据高效SFT中,话语标记也与正确性和推理准确性的跳跃相关。这表明数据高效SFT可以部分复现话语标记模式以模仿有意义的推理行为,但这些模式与高置信答案转换的对齐程度不如大规模后训练。

英文摘要

Recent studies suggest that even data-efficient training with ($\simeq$1K) reasoning trajectories can induce non-trivial reasoning capabilities in large language models through post-training. Such training corpora often contain iconic tokens such as "wait", "so", and "alternatively", which frequently appear in reasoning trajectories and may play a role in this process. This paper focuses on characterizing observable token-level patterns in post-training and a case study of how data-efficient supervised fine-tuning (SFT) differs from, and falls short of, large-scale post-training. To this end, we first identify tokens that correlate with correct answers along reasoning trajectories across models and training setups. We then focus on the distribution and (functional) roles of the "wait" token to primarily study the model trained in a data-efficient manner compared with the counterpart. Our study finds that discourse tokens are associated with correctness and a reasoning accuracy jump, even in data-efficient SFT. This suggests data-efficient SFT can partially reproduce discourse-token patterns to mimic meaningful reasoning behavior, but the patterns are less aligned with high-confidence answer transitions than those from large-scale post-training.

2602.11543 2026-06-16 cs.CL 版本更新

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

使用分布式GPU预训练大型语言模型:一种内存高效的分散式范式

Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系) OPPO Research Institute(OPPO研究院)

AI总结 提出SPES框架,通过分散式训练MoE LLM的子集专家降低内存需求,结合专家合并预热策略,在16个48GB GPU上训练2B参数模型,性能媲美集中式训练。

详情
AI中文摘要

预训练大型语言模型(LLMs)通常需要配备数千个高内存GPU(如H100/A100)的集中式集群。最近的分散式训练方法通过采用联邦优化来减少通信开销;然而,它们仍然需要在每个节点上训练整个模型,因此仍受限于GPU内存限制。在这项工作中,我们提出了SPES(稀疏专家同步),一种用于预训练混合专家(MoE)LLMs的内存高效分散式框架。SPES在每个节点上仅训练一部分专家,大幅降低了内存占用。每个节点更新其本地专家,并定期与其他节点同步,消除了全参数传输,同时确保高效的知识共享。为了缓解稀疏专家更新下每个专家数据利用率有限的问题,我们引入了一种专家合并预热策略,即在训练早期让专家交换知识,以快速建立基础能力。通过SPES,我们使用16个独立的48GB GPU通过互联网连接训练了一个2B参数的MoE LLM,在相似计算预算下取得了与集中式训练LLM相竞争的性能。我们进一步展示了可扩展性,从头开始训练了一个7B模型,并从密集检查点升级了一个9B模型,两者均匹配先前的集中式基线。我们的代码可在该https URL获取。

英文摘要

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To mitigate limited per-expert data utilization under sparse expert updates, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

2603.08999 2026-06-16 cs.CL 版本更新

Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning

学习何时采样:面向高效链式思维推理的置信度感知选择性采样

Juming Xiong, Kevin Guo, Congning Ni, Wexin Liu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究院)

AI总结 提出置信度感知选择性采样框架,通过分析单条推理轨迹自适应决定是否触发多路径采样,在保持性能的同时显著降低推理成本。

详情
AI中文摘要

大型语言模型(LLMs)通过链式思维(CoT)推理能够实现强大的推理性能,但往往生成不必要的长推理路径,导致高昂的推理成本。基于自一致性的方法进一步提高准确性,但需要采样和聚合多个推理轨迹,带来大量计算开销。本文提出一种置信度感知的选择性采样框架,在推理时分析单条推理轨迹,自适应地决定是仅依赖该轨迹还是触发多路径采样。该框架利用从推理状态中提取的轨迹级数值特征和句子级语言特征来指导选择性多路径推理。我们在MedQA上训练该框架,并在MedQA上进行域内评估,以及在MathQA、MedMCQA和MMLU上进行仅校准的迁移评估,无需进一步微调。实验结果表明,所提框架与完整和高效的多路径推理基线相比,性能相当,准确率变化分别为$-0.41 \pm 0.58$和$-0.31 \pm 0.58$个百分点,同时令牌使用量分别减少$71.7 \pm 5.0\\%$和$36.6 \pm 9.1\\%$。这些发现表明,推理轨迹包含用于不确定性估计的丰富信号,从而能够实现一种简单、可迁移的机制来平衡LLM推理中的准确性和效率。

英文摘要

Large language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based approaches push accuracy higher still, but they require sampling and aggregating multiple reasoning trajectories, leading to substantial computational overhead. In this paper, we introduce a confidence-aware selective sampling framework that, at inference time, analyzes a single reasoning trajectory to adaptively determine whether to rely on that trajectory alone or trigger multi-path sampling. The framework uses trajectory-level numeric features and sentence-level linguistic features extracted from reasoning states to guide selective multi-path reasoning. We train it on MedQA and evaluate it in-domain on MedQA and under calibration-only transfer on MathQA, MedMCQA, and MMLU, without further fine-tuning. Experimental results show that the proposed framework maintains comparable performance to full and efficient multi-path reasoning baselines, with accuracy changes of $-0.41 \pm 0.58$ and $-0.31 \pm 0.58$ percentage points, respectively, while reducing token usage by $71.7 \pm 5.0%$ and $36.6 \pm 9.1%$. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

2604.03472 2026-06-16 cs.CL cs.AI 版本更新

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

词汇丢弃:LLM共同进化中的课程多样性

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对LLM共同进化中问题多样性崩溃的问题,提出词汇丢弃机制,通过在策略训练和课程生成时随机掩码输出logits维持多样性,在数学推理任务上提升求解器性能平均+4.4点。

详情
AI中文摘要

共同进化自我对弈,其中一个语言模型生成问题,另一个求解,有望在没有人类监督的情况下实现自主课程学习。在实践中,提议者迅速收敛到满足奖励函数的狭窄问题分布。这种多样性崩溃使得课程对求解者无信息量,从而停滞共同进化循环。我们引入词汇丢弃,一种在策略训练和课程生成期间应用于提议者输出logits的随机掩码,作为维持多样性的轻量级机制。该掩码是硬性的且非平稳的,防止提议者锁定在固定的token序列上。通过R-Zero在数学推理上训练Qwen3-4B和Qwen3-8B,我们发现词汇丢弃在整个训练过程中在词汇、语义和功能指标上维持了提议者的多样性。它还带来了求解器性能的提升,在8B规模上平均提高+4.4点,在竞赛级基准上增益最大。我们的发现表明,显式的动作空间约束,类似于经典自我对弈中游戏规则的结构性作用,可以帮助维持语言中的生产性共同进化。词汇丢弃是该原则的一个简单实例。

英文摘要

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. It also yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

2604.25853 2026-06-16 cs.CL cs.AI cs.LG 版本更新

G-Loss: Graph-Guided Fine-Tuning of Language Models

G-Loss:图引导的语言模型微调

Aditya Sharma, Vinti Agarwal, Rajesh Kumar

发表机构 * BITS Pilani(BITS 派拉尼) Bucknell University(巴克内尔大学)

AI总结 提出G-Loss损失函数,通过构建文档相似度图并利用半监督标签传播捕捉全局语义结构,引导语言模型学习更具判别性和鲁棒性的嵌入,在多个分类任务上提升准确率并加速收敛。

Comments 20 pages, Learning on Graphs (LoG2025)

详情
AI中文摘要

用于微调预训练语言模型(如BERT)的传统损失函数,包括交叉熵、对比损失、三元组损失和监督对比损失,仅在局部邻域内操作,未能考虑全局语义结构。我们提出了G-Loss,一种图引导的损失函数,它结合半监督标签传播来利用嵌入流形中的结构关系。G-Loss构建了一个文档相似度图,捕捉全局语义关系,从而引导模型学习更具判别性和鲁棒性的嵌入。我们在五个涵盖关键下游分类任务的基准数据集上评估了G-Loss:MR(情感分析)、R8和R52(主题分类)、Ohsumed(医学文档分类)和20NG(新闻分类)。在大多数实验设置中,G-Loss收敛更快,并产生语义一致的嵌入空间,从而比使用传统损失函数微调的模型获得更高的分类准确率。

英文摘要

Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

2605.07013 2026-06-16 cs.CL 版本更新

CoBit: Language Modeling with Bitstream Diffusion

CoBit: 基于比特流扩散的语言建模

Georgios Batzolis, Mark Girolami, Luca Ambrogioni

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出CoBit模型,将文本建模为固定宽度二进制比特流上的连续扩散过程,采用匹配滤波残差参数化和基于熵率门控的朗之万校正采样器,在LM1B和OpenWebText上达到接近自回归模型的生成困惑度,并消除词汇表缩放瓶颈。

详情
AI中文摘要

扩散语言模型(DLM)有望实现并行、顺序无关的生成,但在标准基准测试中,它们在样本质量和多样性上历来落后于自回归模型。最近的连续流和扩散方法缩小了这一差距。在这项工作中,我们通过将文本建模为固定宽度二进制比特流上的连续扩散过程,进一步缩小了自回归差距。我们将所得模型称为CoBit(连续比特流扩散)。我们的方法将语义标记表示为模拟比特序列,并使用匹配滤波残差参数化将上下文学习与解析的独立比特后验分离。关键的是,我们采用了一种随机采样器,该采样器应用由熵率分布门控的朗之万型校正,将随机性集中到高信息区域,而在其他区域几乎保持确定性。在LM1B上,我们的130M参数模型在匹配真实数据熵(4.31)的情况下,使用256次神经函数评估(NFE)达到了59.76的生成困惑度(GenPPL),优于先前的DLM基线,并达到了自回归参考水平。在OpenWebText(OWT)上,我们的采样器建立了一个新的连续DLM帕累托前沿,在熵为5.26时实现了27.06的GenPPL,使用的步数比先前1024 NFE基线少4倍。将相同的配方扩展到462M参数模型(CoBit-M)进一步改善了OWT上GenPPL-熵前沿,优于130M模型(CoBit-S)以及中等规模的连续和离散DLM基线,在熵为5.40时达到GenPPL 19.5,接近真实数据熵(5.44),并在高质量区域接近预训练的GPT-2 Medium。作为一个额外的好处,比特流扩散消除了标准DLM的O(V)词汇表缩放瓶颈:通过语义比特修补预测O(log V)逐比特逻辑,它降低了内存并提高了吞吐量,这是一种随着词汇表大小增长而可扩展的范式。

英文摘要

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches have narrowed this gap. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. We refer to the resulting model as CoBit (Continuous Bitstream Diffusion). Our approach represents semantic tokens as analog bit sequences and uses a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On LM1B, our 130M-parameter model reaches a generative perplexity (GenPPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our sampler establishes a new continuous-DLM Pareto frontier, achieving GenPPL 27.06 at entropy 5.26 using 4x fewer steps than previous 1024-NFE baselines. Scaling the same recipe to a 462M-parameter model (CoBit-M) further improves the OWT GenPPL-entropy frontier over the 130M model (CoBit-S) and over medium-scale continuous and discrete DLM baselines, reaching GenPPL 19.5 at entropy 5.40, near real-data entropy (5.44), and approaching pretrained GPT-2 Medium over the high-quality region. As an additional benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck of standard DLMs: by predicting O(log V) bitwise logits via semantic bit-patching, it lowers memory and raises throughput, a scalable paradigm as vocabulary sizes grow.

2605.17106 2026-06-16 cs.CL cs.LG 版本更新

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

HyDRA:异构LLM池的混合动态路由架构

Aashna Garg, Siddharth Singha Roy, Jinu Jang, Federico Brancasi, Shengyu Fu

发表机构 * Microsoft(微软)

AI总结 本文提出HyDRA,一种能够根据查询预测细粒度多维能力需求并匹配配置定义模型配置的混合动态路由架构,实现了在异构LLM池中高效且无需重新训练的模型选择。

Comments preprint v2

详情
AI中文摘要

生产中的LLM部署越来越多地维护跨越数量级成本差异的异构模型池。现有路由器做出二元强弱决策,并将学习参数与特定模型身份耦合,当目录更改时需要重新训练。我们提出了HyDRA(混合动态路由架构),一种框架,可以预测每个查询的细粒度、多维能力需求,并通过短缺匹配与配置定义的模型配置匹配。一个带有K=4个独立sigmoid头的ModernBERT编码器对每个查询进行评分,评分维度包括推理、代码生成、调试和工具使用;然后,一个短缺匹配算法会选择最便宜的模型,其能力满足预测的需求。部署的预测器在生产中的中位CPU推理延迟为86毫秒,并且完全解耦于模型目录--添加或删除模型只需配置更改,无需重新训练。在SWE-Bench Verified(5模型池:GPT-5.4-mini,Claude Haiku 4.5,GPT-5.3 Codex,Claude Sonnet 4.6,GPT-5.4)上,HyDRA的可调短缺阈值跨越三个领域:峰值质量超过始终强劲的Claude Sonnet 4.6基线(75.4% vs. 74.2%分辨率)在12.9%的成本节省;等质量匹配Sonnet在54.1%的成本节省,比我们先前的内部二元路由器在9.1%的改进;激进的推动节省到72.5%在3.2点质量折损。结果在LiveCodeBench、BigCodeBench和tau-bench上通用。HyDRA已部署到GitHub Copilot的所有用户在VS Code Chat自动模式中,并且据我们所知,在LLM路由文献中首次展示了跨CJK、欧洲和其他文字家族的语言不变路由。

英文摘要

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

2605.21850 2026-06-16 cs.CL cs.AI 版本更新

ACC: Compiling Agent Trajectories for Long-Context Training

ACC:用于长上下文训练的代理轨迹编译

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao, Kou Shi, Ziao Zhang, Lin Chen, Zehui Chen, Lijun Wu, Feng Zhao

发表机构 * MoE Key Lab of BIPC, University of Science and Technology of China(中科院大学科学技术大学MoE关键实验室) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出ACC,一种将代理轨迹编译为长上下文问答对的方法,通过整合多轮交互中的工具响应和环境观察,提升大语言模型的长上下文推理能力。

详情
AI中文摘要

近期代理的发展重新激发了对LLM长上下文推理能力的需求。然而,训练LLM具备这种能力需要耗费成本的长文档整理或启发式上下文合成。我们发现,当代理解决问题时,会产生大量轨迹,涉及调用工具和接收环境观察,这些证据分散在多个回合中,需要整合远距离上下文片段。然而,标准代理SFT会屏蔽工具响应,仅训练回合级工具选择,导致监督盲区,使这些分散的信号无法被利用。我们提出Agent Context Compilation (ACC),将搜索、软件工程和数据库查询代理的轨迹转换为长上下文QA对,结合原始问题与多回合收集的工具响应和环境观察,训练模型直接回答而不使用工具。这使问题与证据之间的依赖关系显式化,使模型能够直接监督长上下文推理,无需额外标注。ACC是一种简单但有效的做法,可与任何现有的长上下文扩展或训练方法结合,提供可扩展的监督微调数据。我们通过MRCR和GraphWalks长距离依赖建模任务验证了ACC,挑战需要跨回合核心ference解析和图遍历的基准测试。训练Qwen3-30B-A3B使用ACC在MRCR上达到68.3(+18.1),在GraphWalks上达到77.5(+7.6),结果与Qwen3-235B-A22B相当,同时在GPQA、MMLU-Pro、AIME和IFEval上保持通用能力。进一步的机制分析表明,ACC训练的模型表现出任务自适应的注意力重构和专家专业化。

英文摘要

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

2606.02955 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Fast-dLLM++: 用于更快扩散LLM推理的Fréchet轮廓解码

Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对扩散大语言模型推理中并行令牌生成的瓶颈,提出Fréchet轮廓解码方法,通过利用异构置信度轮廓选择并行提交集,在保持模型和缓存不变的情况下提升吞吐量。

Comments Initial version accepted at Workshop on Structured Probabilistic Inference & Generative Modeling, ICML 2026. Project Page: https://ringo-star.github.io/projectpage_frechet/

详情
AI中文摘要

扩散大语言模型承诺并行令牌生成,但推理仍然受限于决定哪些掩码令牌可以安全地一起提交。Fast-dLLM通过KV缓存和置信度引导的并行解码解决了这个问题,但其解码理论使用同质高置信度假设,实际上将每个候选集简化为其最弱的选择令牌。我们认为这留下了速度提升空间,因为实际解码步骤表现出异构置信度轮廓。我们提出 extbf{Fast-dLLM++},一种无需训练的扩展,引入了\emph{Fréchet轮廓解码}:从完整的排序置信度轮廓中选择并行提交集,而不是单个最坏情况置信度。得到的规则是Fast-dLLM因子选择器的异构置信度泛化,在等置信度情况下精确恢复先前规则,并在所选令牌具有不均匀置信度时增加一个可证明的\emph{异构性奖励}。Fast-dLLM++完全保持模型、扩散过程和缓存实现不变,使其成为现有Fast-dLLM解码的直接替代品。在GSM8K、MATH、HumanEval和MBPP上使用LLaDA-8B模型的实验表明,理论改进直接转化为经验收益:轮廓感知选择通过利用最弱令牌规则忽略的安全并行性改进了准确率-吞吐量前沿,在可比准确率下实现了高达37%的吞吐量提升。我们的匿名代码发布在此https URL。

英文摘要

Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.

2606.05014 2026-06-16 cs.CL 版本更新

Depth-Attention: Cross-Layer Value Mixing for Language Models

深度注意力:语言模型的跨层值混合

Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin

发表机构 * LUMIA Lab(LUMIA实验室) School of Artificial Intelligence(人工智能学院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Sun Yat-sen University(中山大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出深度注意力机制,在注意力模块内部实现跨层值混合,无需额外参数和推理状态,提升语言模型性能。

Comments 21 pages, 4 figures, 9 tables

详情
AI中文摘要

自注意力机制可以在序列中自由选择信息,但在深度方向上,Transformer仅将每一层的输出加到残差流中,因此后续层无法选择性重用早期层的表示。最近的跨层方法改善了这种流动,但在注意力之外的隐藏状态上操作,在推理时增加了键值缓存之外的状态——随着现代LLM使用分组查询和多头潜在注意力压缩缓存,这一成本日益显著。我们引入深度注意力,它在注意力模块内部执行这种选择:在一层对序列进行注意力之前,其查询在同一token位置上对早期层的键进行注意力,并将它们的值混合到自注意力随后读取的值中。由于深度注意力重用标准的注意力查询、键和值缓存槽,将深度混合后的值替换原始值,因此它不增加参数,也不引入超出标准键值缓存的持久推理状态——缓存大小与普通解码器相同,且小于基于隐藏状态的跨层方法。在1.5B和3B参数的Qwen3风格解码器上,深度注意力取得了最低的困惑度和最高的平均下游准确率,相比普通Transformer提升高达2.3个准确率点,在困惑度和平均准确率上超越了强跨层基线,同时仅增加不到0.01%的额外算术FLOPs,且无额外持久推理状态。这些增益在360M到3B参数范围内保持一致,并扩展到循环Transformer。

英文摘要

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

2606.14142 2026-06-16 cs.CL cs.AI 版本更新

Implicit Reasoning for Large Language Model-based Generative Recommendation

基于大语言模型的生成式推荐的隐式推理

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah, Donald Loveland

发表机构 * University of Virginia(弗吉尼亚大学) Snap Inc.(Snap公司)

AI总结 针对大语言模型用于生成式推荐时显式推理的三大局限(世界知识表达弱化、语义ID与自然语言嵌入空间不对齐、推理质量敏感),提出轻量级隐式推理范式PauseRec,在性能、训练成本和推理速度上均优于显式方法。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用作生成式推荐(GR)的骨干,有望利用预训练的世界知识。然而,如何可靠地调用这些知识进行GR仍不清楚。一个关键障碍是,基于LLM的GR通常使用语义ID(SIDs)表示物品,这破坏了LLM的自然语言推理接口,因为这些标记在预训练期间对LLM是未见过的。现有方法通过昂贵的多阶段流程来应对,这些流程将SID接地并引发显式推理,但对每个阶段何时以及为何必要提供的见解有限。在这项工作中,我们系统地分解了基于LLM的GR的显式推理训练流程,揭示了三个关键局限:弱化的世界知识表达、SID与自然语言标记嵌入空间之间的不对齐,以及对推理质量的敏感性,所有这些都损害了显式推理性能。为了规避这些问题,我们提出了PauseRec,一种为GR量身定制的轻量级隐式推理范式。PauseRec非常实用,避免了昂贵的推理轨迹获取和推理对齐训练,带来了诸多好处:(1)其性能比标准显式CoT方法高出高达6.22%,(2)将训练成本降低高达65%的GPU小时,(3)将推理速度提升高达71.3%。这些结果使PauseRec成为显式推理生成的轻量级替代方案,能够实现更有效、更高效的基于LLM的GR。

英文摘要

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

2606.14694 2026-06-16 cs.CL 版本更新

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

AdaSR: 自适应流式推理与分层相对策略优化

Junlong Tong, Wenqi Xu, Yingqi Fan, Anhao Zhao, Xuan Lu, Yang Tan, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学) The Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出AdaSR框架,通过分层相对策略优化(HRPO)实现流式输入下的自适应推理,在推理准确率、计算效率和流式延迟间取得更好平衡。

详情
AI中文摘要

大型推理模型通常遵循先读后想的范式:它们观察完整输入,在静态上下文中推理,然后产生答案。然而许多真实场景本质上是动态的,例如音频和视频流,信息以连续流的形式到达,模型必须在部分观察下进行推理、更新和响应。最近的流式推理方法允许模型边读边想,但它们主要依赖于对预构建轨迹的监督模仿,这限制了其灵活性。在本文中,我们提出AdaSR,一种自适应流式推理框架,使模型能够在输入流式传输期间进行推理,并在流完成后进行最终深思,学习何时思考以及在不同阶段分配多少计算量。为了优化这一分层推理过程,我们引入了分层相对策略优化(HRPO),它将策略优化分解为流式推理和深度推理阶段,提供更细粒度的优势分配,而不是将单一序列级优势均匀分配给所有token。HRPO整合了格式、准确性和自适应思考奖励,以强制执行有效的推理协议,保持最终任务性能,并鼓励延迟感知的计算分配。实验表明,与监督微调基线相比,AdaSR在推理准确率、计算效率和流式延迟之间实现了更好的平衡。我们在以下网址发布代码:此 https URL。

英文摘要

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.

2506.17104 2026-06-16 cs.AI cs.CL cs.LO 版本更新

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

迈向基于一阶逻辑定理证明的大语言模型高级数学推理

Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对大语言模型在多步一阶逻辑数学推理中的困难,提出DREAM方法,通过公理驱动策略多样化和子命题错误反馈提升推理多样性和正确性,在定理证明数据集上性能提升0.6%-6.4%。

Comments Accepted by EMNLP 25

详情
AI中文摘要

大语言模型(LLMs)在一阶逻辑(FOL)推理方面展现出有前景的能力,并在各个领域得到应用。然而,它们在涉及多步FOL推理的复杂数学推理中的有效性仍待研究。尽管LLMs在已有的数学推理基准上表现有竞争力,但它们在多步FOL任务上表现不佳,例如Deepseek-Prover-V2-7B在我们提出的定理证明数据集上的准确率仅为4.2%。这一问题源于对多样化证明策略的探索有限,以及早期推理错误可能破坏整个证明。为解决这些问题,我们提出DREAM,一种自适应解决方案,增强LLMs生成策略的多样性和合理性。DREAM包含公理驱动策略多样化机制以促进多样化的策略结果,以及子命题错误反馈以帮助LLMs反思和纠正其证明。我们的贡献包括:通过FOL定理证明在LLMs的数学推理方面取得开创性进展,引入一种新颖的推理阶段解决方案,将性能提升0.6%至6.4%,并提供包含447个数学定理的Lean 4格式数据集用于评估。

英文摘要

Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.

2601.18699 2026-06-16 cs.LG cs.CL 版本更新

Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

大型语言模型在持续微调过程中灾难性遗忘的机制分析

Gustav Olaf Yunus Laitinen-Fredriksson Lundstrom-Imanov

发表机构 * Division of Statistics and Machine Learning (STIMA), Department of Computer and Information Science (IDA), Linköping University(统计与机器学习系(STIMA)、计算机与信息科学系(IDA)、利厄普堡大学)

AI总结 本文系统比较了20个顶级LLM在持续微调中的灾难性遗忘,通过行为分析和机制解释定位易受参数覆盖的神经回路,并提出低秩电路投影(LRCP)方法,在开放权重模型中恢复高达94.2%的祖先能力。

Comments 12 pages, 8 figures, 5 tables. Preprint submitted to Elsevier

详情
AI中文摘要

大型语言模型(LLMs)在适应目标任务时的顺序微调常常引发灾难性遗忘,即获取新目标技能会削弱原有能力。本文对代表2026年中期的二十个顶级模型进行了灾难性遗忘的系统比较研究。我们将研究分为两条主线:(i)对十个领先闭源模型(包括Claude Fable 5、GPT-5.5 High和Gemini 3.5 Flash)的行为和语义输出漂移分析;(ii)对十个著名开放权重架构(如DeepSeek-V4-Pro、Llama 4 Maverick和Qwen 3.6-27B)的深度机制解释。通过权重空间轨迹追踪、中心核对齐(CKA)以及混合专家(MoE)层中的路由门漂移计算,我们定位了高度易受参数覆盖的神经回路。我们的发现表明,早期层的注意力头表现出系统性熵扩散,而中深层的前馈网络(或稀疏专家块)则遭受局部表示崩溃。基于这些见解,我们引入了低秩电路投影(LRCP),一种子空间正则化的训练干预。实证评估显示,LRCP在开放权重配置中成功恢复了高达94.2%的祖先能力,并匹配了标准PEFT基线的适应速度。

英文摘要

Sequential fine-tuning of Large Language Models (LLMs) adaptation to target tasks often triggers catastrophic forgetting, where the acquisition of novel target skills degrades ancestral capabilities. This paper presents a systematic comparative study of catastrophic forgetting across twenty premier models representing the state-of-the-art in mid-2026. We categorize our investigation into two primary research lines: (i) a behavioral and semantic output drift analysis of ten leading closed-source models (including Claude Fable 5, GPT-5.5 High, and Gemini 3.5 Flash), and (ii) a deep mechanistic interpretation of ten prominent open-weight architectures (such as DeepSeek-V4-Pro, Llama 4 Maverick, and Qwen 3.6-27B). Through weight-space trajectory tracking, Centered Kernel Alignment (CKA), and routing gate drift calculations in Mixture-of-Experts (MoE) layers, we localize the neural circuits highly susceptible to parameter overwriting. Our findings indicate that early-layer attention heads exhibit systemic entropic dispersion, while mid-to-deep feed-forward networks (or sparse expert blocks) suffer localized representation collapse. Informed by these insights, we introduce Low-Rank Circuit Projection (LRCP), a subspace-regularized training intervention. Empirical evaluations show that LRCP successfully mitigates up to 94.2% of ancestral capabilities in open-weight configurations and matches the adaptation velocity of standard PEFT baselines.

2603.07079 2026-06-16 cs.LG cs.CL 版本更新

Entropy-Aware On-Policy Distillation of Language Models

熵感知的在线策略蒸馏语言模型

Woogyeol Jin, Taywon Min, Yongjin Yang, Dennis Wei, Yi Zhou, Swanand Ravindra Kadhe, Nathalie Baracaldo, Kimin Lee

AI总结 针对在线策略蒸馏中反向KL导致生成多样性下降和教师高熵时学习信号不稳定的问题,提出熵感知的在线策略蒸馏方法,通过在高熵时引入前向KL平衡模式寻求与模式覆盖,提升了生成多样性和学生-教师对齐度。

Comments 18 pages, 11 figures, ICML 2026

详情
AI中文摘要

在线策略蒸馏是一种有前景的语言模型知识迁移方法,学生模型沿着自身轨迹从密集的token级信号中学习。该框架通常使用反向KL散度,鼓励学生匹配教师的高置信度预测。然而,我们表明反向KL的模式寻求特性会降低生成多样性,并在教师分布具有高熵时产生不稳定的学习信号。为解决此问题,我们引入了熵感知的在线策略蒸馏。我们的关键思想是在教师熵高时,用前向KL增强标准的反向KL目标,以捕获全部合理输出范围,同时在其他地方保留精确模仿。它在不牺牲在线策略训练效率的情况下,平衡了模式寻求的精确性与模式覆盖的鲁棒性。实验表明,我们的方法保持了生成多样性(持续的token级熵),并改善了学生-教师对齐(在高熵token上降低前向KL)。在六个数学推理基准上,与基线在线策略蒸馏方法相比,Qwen3-0.6B-Base的Pass@8准确率提升+1.37,Qwen3-1.7B-Base提升+2.39,Qwen3-4B-Base提升+5.05。这些结果表明,考虑教师不确定性对于保持多样性和实现有效知识迁移至关重要。

英文摘要

On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

2605.22873 2026-06-16 cs.LG cs.AI cs.CL 版本更新

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

LLM何时推理?基于熵相变的动力系统视角

Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * Samsung Research(三星研究院) State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(通用人工智能国家重点实验室,北京理工大学)

AI总结 本文通过早期解码熵动态检测LLM的推理状态,提出轻量级无训练路由框架EDRM,自适应选择推理策略,在减少token消耗的同时提升准确率。

详情
AI中文摘要

链式思维(CoT)推理已成为增强LLM能力的默认策略,但其应用引发了一个基本问题:显式推理何时真正有益?实证证据揭示了一个显著悖论:CoT在事实性和开放式任务上往往带来边际甚至负增益,同时成倍增加token消耗。在这项工作中,我们表明LLM推理不是任务或模型的静态属性,而是在生成过程中涌现的\emph{动态解码状态}。通过系统分析,我们发现早期熵动态提供了这一状态的可靠信号:受益于CoT的任务表现出一致的熵降低,而其他任务则呈现不稳定或增加的模式。这种行为可以解释为从高熵探索状态到低熵结构化推理状态的类相变转变。基于这些见解,我们提出了 extbf{EDRM}(基于熵动态的推理流形),一个轻量级且无需训练的路由框架,利用早期解码熵自适应选择推理策略。EDRM将熵轨迹嵌入到紧凑且可解释的流形表示中,支持零样本部署和细粒度实例级适应。在15个基准测试和4个不同规模与架构的LLM上,EDRM始终优于静态基线。在数据集层面,EDRM实现了 extbf{41--55\%}的token减少,同时仅需50个校准样本即可提高准确率。在实例层面,它进一步将准确率提升高达 extbf{4.7\%},同时保持 extbf{27--45\%}的token节省。这些结果表明,推理应被选择性地调用而非默认使用,并展示了基于熵的解码控制对于高效自适应LLM推理的有效性。

英文摘要

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

2606.04547 2026-06-16 cs.IR cs.CL 版本更新

Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

超越检索:学习紧凑用户表示以实现可扩展的LLM个性化

Heng Cao, Fan Zhang, Jian Yao, Yujie Zheng, Changlin Zhao, Lu Hao, Yuxuan Wei, Wangze Ni, Huaiyu Fu, Yuqian Sun, Xuyan Mo

发表机构 * Microsoft(微软公司) Shanghai International Studies University(上海国际问题研究大学) Zhejiang University(浙江大学) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学)

AI总结 提出TAP-PER框架,通过可学习的用户状态前缀嵌入编码用户偏好,避免显式提示构建和繁重的每用户适配器,在六个LaMP任务上优于基线方法,并显著减少参数开销。

Comments 16 pages, 6 figures

详情
AI中文摘要

个性化大型语言模型需要在保持鲁棒性和部署规模效率的同时,将模型行为适应于个体用户。现有方法通常在输入层面(通过检索用户历史或构建个人资料提示)或参数层面(通过维护用户特定的参数高效模块)进行个性化。前者使个性化对检索质量和提示设计敏感,而后者则产生随用户数量增长的存储和维护成本。为解决这些限制,我们提出TAP-PER(时间注意力前缀个性化),一种基于前缀的框架,将用户偏好编码为可学习的表示,消除了显式提示构建,并用轻量级用户状态前缀嵌入替代了繁重的每用户适配器。受个性化推荐系统启发,TAP-PER将用户建模分解为用户状态和查询条件组件,并引入时间信号以捕捉用户兴趣的演变特性。在六个LaMP任务上的实验表明,TAP-PER在分类、评分和生成设置中均持续优于基于提示和基于模型的基线。此外,在1000用户规模下,TAP-PER的每用户参数比OPPU少130倍,总参数量约为PER-PCS的一半,证明无需显式提示构建或繁重的每用户适配器即可实现可扩展的LLM个性化。

英文摘要

Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 版本更新

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) HKUST (Guangzhou)(香港科技大学(广州)) University of British Columbia(不列颠哥伦比亚大学) Nanyang Technological University(南洋理工大学)

AI总结 通过系统评估,发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线(如思维链自一致性),揭示了现有评估框架的缺陷和架构膨胀问题。

详情
AI中文摘要

普遍观点认为多智能体系统优于单智能体系统,其优势包括上下文保护、并行处理和分布式决策。然而,这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较,这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统(旨在比手动设计的系统具有更强的泛化能力),对单智能体系统(特别是思维链自一致性)进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务(例如 BrowseComp-Plus)上,我们证明自动多智能体系统始终不如思维链自一致性,尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来,我们引入了一个为多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明,专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构,这表明现有的评估框架未能考虑增加计算成本的边际效用,从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是,对生成的多智能体系统架构的系统解构表明,当前的自动化设计范式产生了架构膨胀,优先考虑表面复杂性,但这并未转化为功能效用,暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2605.28860 2026-06-16 cs.LG cs.AI cs.CL cs.CR 版本更新

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源:为什么RL比SFT更好地保留电路?

Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 通过引入差异电路脆弱性指标,研究比较了强化学习与监督微调在大型语言模型微调中对内部计算电路的保留程度,发现RL虽任务适应较慢但能更好保留电路,从而减轻灾难性遗忘。

详情
AI中文摘要

微调大型语言模型(LLMs)经常导致先前能力的灾难性遗忘。最近的研究表明,强化学习(RL)比监督微调(SFT)更有效地保留先前能力,这归因于策略梯度更新更接近基础策略\cite{shenfeld2025rl}。我们将这种行为解释扩展到机制层面,并探究RL的优势是否通过内部计算电路的更强保留来体现。我们引入了差异电路脆弱性,一种头部级别的度量,用于衡量电路在微调下的退化程度,并将其用于比较RL和SFT在Qwen2.5-3B-Instruct适应科学问答任务上的表现。我们发现了清晰的机制权衡:SFT更快地适应目标任务,但导致更大的电路破坏和先前能力的遗忘,而RL保留了更大比例的基础电路,代价是任务适应较慢。这些发现表明,电路保留可能有助于解释为什么RL对灾难性遗忘更具鲁棒性。我们在此发布了代码:https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。

英文摘要

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

2. 机器翻译与跨语言处理 4 篇

2606.15483 2026-06-16 cs.CL 新提交

Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

基于AI的翻译教学中的评价判断:AI中介翻译与译后编辑的课堂案例研究

Gokhan Dogru

发表机构 * Universitat Pompeu Fabra Barcelona(巴塞罗那庞培法华大学)

AI总结 通过分析23个学生项目,研究结构化比较通用LLM和在线MT系统如何激发AI中介翻译中的评价判断,发现学生不盲从自动指标,而是基于充分性、流畅性等理由选择译后编辑输出。

Comments Workshop on Teaching AI-based Translation and Technologies (TAITT 2026) - EAMT 2026

详情
AI中文摘要

基于翻译本科课程中第四年机器翻译与译后编辑课程的23个匿名学生项目,本文研究了通用大语言模型和在线机器翻译系统的结构化比较如何引发AI中介翻译中的评价判断。学生将英文维基百科短文本翻译成加泰罗尼亚语或西班牙语,生成四个系统输出,使用自动指标和人工充分性/流畅性评估进行评价,选择一个输出进行译后编辑,并在书面报告中证明其决定。对所有23个项目报告了描述性计数,而定性解释基于22个附有书面报告的案例。结果表明,学生并未将自动指标视为最终权威:最终的译后编辑选择往往与指标排名不同,并通过充分性、流畅性、术语、自然性和预期的译后编辑工作量来证明其合理性。因此,本研究并非在受控条件下对系统进行基准测试;而是分析学生在真实课堂作业中如何证明系统选择的合理性。

英文摘要

Drawing on 23 anonymized student pro-jects from a fourth-year Machine Transla-tion and Post-editing course in a BA-level translation programme, this paper exam-ines how structured comparison of gen-eral-purpose LLMs and online MT sys-tems can elicit evaluative judgement in AI-mediated translation. Students translat-ed short specialised English Wikipedia texts into Catalan or Spanish, generated four system outputs, evaluated them using automatic metrics and human adequa-cy/fluency assessment, selected one output for post-editing, and justified their deci-sion in written reports. Descriptive counts are reported for all 23 projects, while qualitative interpretation is based on the 22 cases accompanied by written reports. Results show that students did not treat automatic metrics as final authority: final post-editing selections often diverged from metric rankings and were justified through adequacy, fluency, terminology, naturalness, and expected post-editing ef-fort. The study therefore does not bench-mark systems under controlled conditions; it analyses how students justified system choice within an authentic classroom as-signment.

2606.16596 2026-06-16 cs.CL 新提交

How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

机器翻译质量能带你走多远?目标导向设置中的外在话语评估

Wafaa Mohammed, Kata Naszadi, Vlad Niculae

发表机构 * Language Technology Lab, University of Amsterdam(语言技术实验室,阿姆斯特丹大学)

AI总结 研究机器翻译在静态和交互式目标导向任务中的外在话语评估,发现高内在翻译质量不能保证下游话语成功,且强系统仍存在指代不一致问题。

详情
AI中文摘要

现有的机器翻译(MT)指标和话语焦点评估主要从内在角度评估翻译质量,而不衡量翻译错误的下游后果。在这项工作中,我们专注于两种不同机制下的机器翻译外在话语评估:静态和交互式。在静态机制下,我们提出一个实体计数任务作为话语中指代一致性的探针。我们表明,高内在MT质量并不能可靠地预测下游话语成功,且强MT系统仍然会产生指代不一致。对于交互式机制,我们研究目标导向的多智能体福利外交游戏作为长期沟通和协调的探针。我们发现,交互特定的翻译失败会影响下游协调。我们的结果强调了目标导向环境作为对话语敏感的MT外在评估的可行框架。

英文摘要

Existing machine translation (MT) metrics and discourse-focused evaluations primarily assess translation quality intrinsically, without measuring the downstream consequences of translation errors. In this work, we focus on extrinsic discourse evaluation of machine translation under two distinct regimes: static and interactive. Under the static regime, we propose an entity counting task as a probe of referential consistency in discourse. We show that high intrinsic MT quality does not reliably predict downstream discourse success and strong MT systems still produce referential inconsistencies. For the interactive regime, we study the goal-oriented multi-agent Welfare Diplomacy game as a probe of long-horizon communication and coordination. We find that interaction-specific translation failures impact downstream coordination. Our results highlight goal-oriented environments as a viable framework for discourse-sensitive extrinsic MT evaluation.

2601.22777 2026-06-16 cs.CL 版本更新

RASST: Retrieval-Augmented Simultaneous Speech Translation

RASST:检索增强的同声传译

Jiaxuan Luo, Siqi Ouyang, Jiaxing Xu, Lei Li

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对同声传译中罕见术语翻译不准的问题,提出检索增强方法RASST,通过轻量级语音-文本检索器提供分块术语提示,并合成训练数据教会模型何时应用检索术语,在ACL 60/60和ESO测试集上术语准确率提升近40%,BLEU提升最多3点。

Comments Under Review

详情
AI中文摘要

同声传译从部分语音输入增量生成目标文本。最近的语音大语言模型显著提高了SST质量,但仍难以处理罕见和领域特定的术语。检索增强已用于自动语音识别和神经机器翻译,但将其扩展到SST并非易事:在部分语音下检索必须快速准确,并且模型必须在增量生成过程中决定是否以及何时应用检索到的术语。我们提出了检索增强同声传译(RASST),解决了这两个挑战。为了在部分输入下实现准确的跨模态检索,RASST训练了一个轻量级语音-文本检索器,通过多尺度检索为语音LLM提供分块术语提示。为了正确使用这些提示,我们合成了训练数据,教会语音LLM决定是否以及何时应用每个检索到的术语。在ACL 60/60开发集和ESO测试集上的实验表明,RASST将术语准确率提高了近40%,整体翻译质量提高了最多3个BLEU点,且计算开销可忽略不计。

英文摘要

Simultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.

2507.17588 2026-06-16 cs.CV cs.CL 版本更新

Dual-branch Prompting for Multimodal Machine Translation

双分支提示用于多模态机器翻译

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) School of Computer and Software Engineering, Xihua University(西华大学计算机与软件工程学院) School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine(成都中医药大学针灸推拿学院)

AI总结 提出基于扩散模型的双分支提示框架D2P-MMT,利用重建图像过滤视觉噪声,通过分布对齐损失提升鲁棒翻译性能。

Comments This manuscript has been fully accepted and published by ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM)

详情
AI中文摘要

多模态机器翻译(MMT)通常通过整合对齐的视觉特征来增强纯文本翻译。尽管取得了显著进展,最先进的MMT方法在推理时通常依赖于配对的图像-文本输入,并且对无关的视觉噪声敏感,这限制了它们的鲁棒性和实际应用性。为了解决这些问题,我们提出了D2P-MMT,一种基于扩散的双分支提示框架,用于鲁棒的视觉引导翻译。具体来说,D2P-MMT仅需要源文本和由预训练扩散模型生成的重建图像,该图像自然地过滤掉分散注意力的视觉细节,同时保留语义线索。在训练期间,模型使用双分支提示策略从真实图像和重建图像中联合学习,鼓励丰富的跨模态交互。为了弥合模态差距并减轻训练-推理差异,我们引入了一种分布对齐损失,强制两个分支的输出分布之间的一致性。在Multi30K数据集上的大量实验表明,与现有最先进方法相比,D2P-MMT实现了更优的翻译性能。我们的代码在此https URL公开可用。

英文摘要

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

3. 信息抽取、检索与问答 21 篇

2606.14875 2026-06-16 cs.CL 新提交

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

上下文压缩并非单一事物:在匹配预算下可读的符号化重新表达与连贯摘要的比较

Sisong Bei, Mikhail L. Arbuzov, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher(独立研究员) Palo Alto Networks

AI总结 提出Telegraph English可读符号格式,将检索段落重写为结构化实体关系陈述,在匹配预算下比三种压缩基线及连贯摘要更有效,F1提升13-20个百分点。

详情
AI中文摘要

我们研究了使用小型语言模型进行多跳问答时的上下文压缩。我们提出Telegraph English,一种可读的符号化格式,将检索到的段落重写为结构化的实体关系陈述,以更低的token成本保留推理证据。在MuSiQue、TwoWiki和HotpotQA上的受控实验中,Telegraph English在每个数据集上都优于三种匹配预算的压缩基线(字符级删除、截断和随机子采样),F1得分提升13至20个百分点。它还在最难的数据集上优于由同一编码器生成的连贯散文摘要。一个预先注册的深度交互假设被证伪:在数据集内,优势并未随推理深度增加而增长。我们将这些结果解释为证据,表明在匹配的token预算下,可读的符号化重新表达比自然语言或连贯摘要更密集地保留了实体内容。

英文摘要

We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

2606.15412 2026-06-16 cs.CL cs.AI 新提交

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

基于大语言模型的少样本生物医学关系抽取:监督学习的可行替代方案?

Jakob Mraz, Tomaž Curk, Blaž Zupan

发表机构 * University of Ljubljana(卢布尔雅那大学) Baylor College of Medicine(贝勒医学院)

AI总结 研究利用大语言模型进行少样本生物医学关系抽取,比较成对分类与联合生成两种方法,发现联合生成更精确高效,在宏F1上超越监督基线,尤其在稀有关系类型上表现突出。

详情
AI中文摘要

生物医学关系抽取(BioRE)是将生物医学文献转化为结构化知识的关键步骤。然而,现有方法大多依赖在昂贵标注数据集上训练的监督模型,限制了其在关系类型和领域上的可扩展性和适应性。我们研究了基于提示学习的大语言模型(LLMs)进行少样本BioRE,并比较了两种任务形式:成对分类(预测单个实体对的关系)和联合生成(在单次模型调用中提取多个关系)。在BioREDirect数据集上的实验揭示了明确的精确率-召回率权衡。成对分类实现了更高的召回率,而联合生成更精确且计算效率更高。最佳模型达到了0.44的微F1分数,显著优于之前的少样本结果(0.34),但仍低于监督基线(0.56)。这一差距大部分归因于一个定义模糊的关系类型。当使用宏F1评估时(在类别不平衡设置下更能反映跨关系类型的性能),基于提示的方法优于监督基线(0.45 vs. 0.38),尤其在稀有关系类型上。这些发现突显了LLMs在低资源场景下进行BioRE的潜力,并强调了定义良好的关系模式的重要性。

英文摘要

Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

2606.15449 2026-06-16 cs.CL cs.IR cs.LG 新提交

Transfer Learning for FHIR Questionnaire Terminology Binding

面向 FHIR 问卷术语绑定的迁移学习

Maxim Gorshkov

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 将 FHIR 问卷项与 LOINC 代码的绑定视为检索问题,比较六种方法,发现 BioLORD 在 top-1 准确率上最优,而对比微调在 top-5 和 top-10 上表现更好,并分析了分布偏移和错误类型。

详情
AI中文摘要

电子预授权工作流要求 FHIR 问卷项携带 LOINC 代码,但 HL7 Da Vinci CDS-Library 中的大多数项缺乏这些绑定。我们将其视为一个检索问题:给定问卷项的文本,从 97,314 个活跃代码池中找到正确的 LOINC 代码。我们在一个包含 54 个项的评估集上比较了六种方法(TF-IDF、冻结 MiniLM、BioBERT、BioLORD、对比微调 MiniLM 以及 TF-IDF+GPT 重排序器),该评估集涵盖三种查询风格(自然问题、中等和简洁)。没有单一方法在所有指标上获胜。BioLORD 是一个在生物医学本体定义上预训练的冻结编码器,尽管没有见过任务特定数据,但其 top-1 准确率最高(R@1 = 0.185,MRR = 0.246),而在原始 LHC-Forms 对上的对比微调则在 R@5(0.389)和 R@10(0.426)上表现最佳。分布偏移消融实验表明,为什么我们主表中的微调不是最强的:在原始对中添加 GPT 生成的释义后,R@5 从 0.389 降至 0.296,因此增强联合在除 R@1 外的所有指标上均不如仅使用原始训练。性能在 5k 训练对时达到峰值。对 BioLORD 的 R@1 失败案例的错误分析表明,错误特异性和歧义文本案例共占错误的 59%。

英文摘要

Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item's text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD's R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

2606.15566 2026-06-16 cs.CL cs.AI 新提交

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

科学话语中的立场检测:以贝叶斯认知科学为例的LLM辅助方法

Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

发表机构 * University of New Hampshire(新罕布什尔大学) Independent Researcher(独立研究员)

AI总结 提出结合理论驱动编码手册、专家标注和诊断门控提示优化的方法,利用三个前沿LLM检测贝叶斯模型在科学文本中的现实主义/工具主义立场,在210篇文章的6858条引文中达到0.78的联合信度。

Comments 9 pages, 4 figures; Code and data: https://github.com/EyupEK/autoresearch_bayes

详情
AI中文摘要

定性编码是社会科学的核心,但专家标注难以规模化。LLM提供了一种可能的扩展,但当目标构念是解释性的、理论负载的且仅间接表达时,需要仔细验证。我们在一个困难案例中研究这个问题:检测作者是将贝叶斯模型视为心理和神经机制的描述(现实主义)还是有用的数学工具(工具主义)。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控提示优化搜索(为三个前沿LLM:GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview生成共享的零样本提示)以及多评估者信度分析。最终提示在保留样本上实现了0.76的综合信度分数(ICC=0.79和α=0.74的调和平均数),所有诊断均满足。在来自210篇文章的6858条引文上部署后,三个LLM达到了显著的引文级一致性(ICC=0.80;α=0.76;综合=0.78)和近乎完美的文章级排名稳定性(评估者对之间r=0.96-0.97)。语料库总体偏向弱现实主义,但文章级立场很少一致:仅1.4%的文章使用单一波段,而59.5%的文章跨越四个或更多波段。低层感知/运动文章比高层认知文章高出8.8个现实主义点(p<.001,d=0.60),量化了长期持有的定性直觉。我们将其作为专家主导的案例研究呈现;该框架旨在推广到类似的理论密集型任务,而非所有定性分析。

英文摘要

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

2606.15641 2026-06-16 cs.CL 新提交

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

将示例提炼为任务指令:面向真实B2B对话的增强上下文学习

Guy Rotman, Adi Kopilov, Danit Berger Zalmanson, Omri Allouche

AI总结 针对B2B对话分类中传统上下文学习因示例拼接导致上下文过长而性能受限的问题,提出知识蒸馏方法将冗长示例压缩为结构化分类标准和精确任务描述,实现令牌使用减少99%,宏平均AUC提升7%,且随上下文增长保持鲁棒。

Comments Accepted for publication in Findings of the Association for Computational Linguistics 2026

详情
AI中文摘要

上下文学习(ICL)是低资源分类的标准方法,但其在专业领域的有效性尚未充分探索。我们解决了语义复杂、多方B2B对话分类的挑战,传统ICL在此面临显著限制,尤其是当多个少样本示例拼接导致上下文长度增加时。我们引入了\ exttt{Call Playbook}数据集,包含源自真实B2B对话的五项分类任务,针对核心销售概念。为了弥合性能与实际应用之间的差距,我们提出了新颖的知识提取方法,将冗长示例蒸馏为紧凑、可解释的结构化分类标准和精确任务描述。我们的方法实现了令牌使用减少99%,宏平均AUC比传统ICL提升高达7%。值得注意的是,与先进的令牌压缩基线(其F1分数下降超过9点)不同,我们的方法在上下文增长时保持鲁棒。重要的是,我们的框架能够直接优化分类逻辑,满足了真实NLP应用中对透明度、效率和用户交互的关键需求。

英文摘要

In-context learning (ICL) is the standard method for low-resource classification, yet its efficacy in specialized domains remains largely unexplored. We address the challenge of classifying semantically complex, multi-party B2B conversations, where traditional ICL encounters significant limitations, especially as context length increases due to the concatenation of multiple few-shot examples. We introduce the \texttt{Call Playbook} dataset, featuring five classification tasks derived from real-world B2B conversations targeting core sales concepts. To bridge the gap between performance and practical utility, we propose novel knowledge extraction methods that distill verbose examples into compact, interpretable representations of structured classification criteria and precise task descriptions. Our approach achieves a 99\% reduction in token usage and improves macro-averaged AUC by up to 7\% over traditional ICL. Notably, it remains robust as context grows, unlike advanced token compression baselines which degrade by over 9 F1 points. Importantly, our framework enables direct refinement of classification logic, addressing critical needs for transparency, efficiency, and user interaction in real-world NLP applications.

2606.15770 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

ttda704 at SemEval-2026 Task 6: 用于政治回避检测的结构化思维链提示

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam(胡志明市信息技术大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 针对总统访谈中政治回避策略分类任务,比较QLoRA微调Qwen3与结构化思维链提示DeepSeek-V3.2/Grok-4-Fast,发现后者在Macro F1上显著更优,最佳系统在9类回避任务上Macro F1达0.5147。

详情
AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统,该任务涉及对从美国总统访谈中提取的英文问答对进行政治回避策略分类。我们系统比较了两种不同的范式:(1) 使用QLoRA对Qwen3模型(4B-32B)进行参数高效微调,通过分层上采样和加权交叉熵损失来应对严重的类别不平衡;(2) 对具备推理能力的API模型(即DeepSeek-V3.2和Grok-4-Fast)使用结构化思维链(CoT)提示。我们的评估表明,启用推理能力的模型的结构化CoT提示在绝对Macro F1上显著优于我们的基线参数高效微调实现。我们最好的系统,即具有扩展推理和少样本分层CoT提示的Grok-4-Fast,在子任务2(9类回避)上达到0.5147的Macro F1,在子任务1(3类清晰度)上达到0.7979的Macro F1,在官方排行榜上分别位列子任务2的第8名(共33支队伍)和子任务1的第13名(共41支队伍)。此外,我们的消融研究揭示了回避检测中有效提示设计的关键见解:在分层分类法中呈现标签有助于结构化模型推理,而少样本示例提供了任务校准。然而,最强的提示变体在Macro F1上并无统计显著差异,而显式启用扩展推理模式通过促进检测回避意图所需的多步语用分析,带来了显著的性能提升。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two distinct paradigms: (1) Parameter-Efficient Fine-Tuning of Qwen3 models (4B-32B) using QLoRA, enhanced with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting of reasoning-capable API models, namely DeepSeek-V3.2 and Grok-4-Fast. Our evaluation demonstrates that structured CoT prompting of reasoning-enabled models substantially outperforms our baseline parameter-efficient fine-tuning implementation in absolute Macro F1. Our best system, Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieves a Macro F1 of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity), ranking 8th out of 33 teams on Subtask 2 and 13th out of 41 teams on Subtask 1 on the official leaderboard. Furthermore, our ablation studies reveal key insights into effective prompt design for evasion detection: presenting labels within a hierarchical taxonomy helps structure model reasoning, while few-shot exemplars provide task calibration. However, the strongest prompt variants are not statistically distinguishable in Macro F1, and explicitly enabling extended reasoning modes yields substantial performance gains by facilitating the multi-step pragmatic analysis required to detect evasive intent.

2606.15833 2026-06-16 cs.CL 新提交

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

当正确边无法被验证:不完全KGQA中的溯源缺口及一种偏好溯源的补全策略

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对不完全知识图谱问答中补全边的可验证性问题,发现76-96%的正确边缺乏文本支持,提出偏好溯源的TGComplete策略,在保持答案质量的同时显著提高边精度和严格忠实性。

详情
AI中文摘要

不完全知识图谱问答(IKGQA)需要补全缺失的边以继续推理。越来越多的研究通过检索文本来验证补全的边,将文本支持视为边质量的代理。我们提出了一个据我们所知尚未被系统检验的问题:文本可验证性是否真的反映了正确性?利用标准随机删除协议提供的黄金删除三元组,我们测量了这两者。发现是反直觉的:在黄金正确的补全边中,76-96%即使在穷尽检索下也没有支持段落,这一结果在删除率(20%/40%)、数据集(CWQ/WebQSP)和关系类型(结构型、常识型、长尾型)上均稳健。大多数Freebase风格的事实根本不会以头尾共现的形式出现在文本中。因此,文本忠实性衡量的是溯源,而非正确性——两者之间存在一个语料内检索无法弥合的范式级差距。这重新定义了边补全问题。由于大多数补全边——无论正确与否——对答案而言是因果冗余的(95-97%的正确答案不依赖于任何无支持的边),核心问题从“边是否正确?”转变为“在溯源不确定下是接受还是放弃?”在此框架下,我们提出了TGComplete,一种偏好溯源的接受策略,它在推理断点处检索证据,通过轻量级循环验证候选边,并在缺乏支持时放弃。与生成-补全基线GoG相比,它在黄金标准上获得了更高的边精度(15-21% vs 3-14%),且没有统计上显著的EM损失,同时被接受边的严格忠实性提高了3.1-7.4倍——代价是召回率较低。我们将TGComplete定位为并非全面更优,而是在精度/溯源召回权衡下的一个原则性选择,适用于可审计性重要的场景。

英文摘要

Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness -- separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges -- correct or not -- are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges -- at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

2606.15971 2026-06-16 cs.CL 新提交

SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

SAG: 查询时动态超边的SQL检索增强生成

Yuchao Wu, Junqin Li, XingCheng Liang, Yongjie Chen, Yinghao Liang, Linyuan Mo, Guanxian Li

发表机构 * Zleap AI

AI总结 提出SAG架构,通过SQL查询动态构建局部超边索引,避免全局图维护,在多跳推理基准上达到最优召回率,支持亿级数据生产部署。

详情
AI中文摘要

检索增强生成(RAG)为大语言模型访问外部知识提供了一种有效方法。然而,现有方法依赖密集相似性检索,在处理结构化约束和多跳推理方面存在固有限制。引入知识图谱可部分缓解这些问题,但代价是语义碎片化、高维护成本和难以增量更新。本文介绍了SAG(SQL检索增强生成),一种用于检索和代理系统的结构化架构。SAG不预先构建全局静态图,而是将每个块转换为一个语义完整的事件和一组索引实体,然后使用SQL连接查询动态地将共享实体的事件链接到局部超边中,在查询时构建动态实例化的局部索引结构。这种设计避免了全局图重建和持续维护的需要;该系统通过依赖标准数据库基础设施,自然支持增量写入、并发处理和持续扩展。在HotpotQA、2WikiMultiHop和MuSiQue这三个标准多跳基准上,SAG在9个Recall@K指标中的8个上取得了最佳结果,在MuSiQue(多跳推理需求最高的基准)上达到80.0%的Recall@5。SAG还已在数亿数据项的生产规模上部署,在线检索延迟保持在秒级。项目网站和代码见https://github.com/Zleap-AI/SAG-Benchmark。

英文摘要

Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning demands.SAG has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at https://github.com/Zleap-AI/SAG-Benchmark.

2606.16074 2026-06-16 cs.CL cs.AI 新提交

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2:通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine(耶鲁大学医学院) Yale School of Public Health(耶鲁大学公共卫生学院) Texas State University(德克萨斯州立大学)

AI总结 提出PVminerLLM2,通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术,解决监督微调难以处理的细粒度错误,在患者声音结构化提取任务上优于基线模型。

详情
AI中文摘要

动机:患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息,但大多是非结构化的,限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而,仅靠监督微调(SFT)难以处理罕见、细粒度且分布不均的错误,尤其是在令牌关键的结构化输出中。结果:我们提出了PVminerLLM2,一组改进的用于结构化患者声音提取的LLM,它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了(i)带有令牌级门控稳定项的偏好目标,防止在偏好优化下绝对令牌似然的退化,以及(ii)混淆感知的偏好对构建,以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权,以解决令牌不平衡和类别偏斜问题。在多种模型规模下,PVMinerLLM2始终优于强基线,在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升,并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现:PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于:https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

2606.16409 2026-06-16 cs.CL 新提交

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

PathRouter: 在智能体图检索增强生成中对齐奖励与检索质量

Bo Wang, Heyan Huang, Yaolin Li, Wei Tang, Yuan Zhang, Wenbo Li, Mingze Gao, Ge Shi, Chong Feng

发表机构 * Beijing Institute of Technology(北京理工大学) Joy Future Academy

AI总结 针对智能体图RAG中答案路径奖励混淆和搜索更新模糊问题,提出PathRouter框架,通过路径感知训练联合评估答案正确性与证据路径重叠,并引入冻结金证据教师提供token级KL指导,在六个QA基准上显著提升F1和证据路径重叠。

详情
AI中文摘要

智能体图RAG训练语言模型代理迭代检索并推理图结构证据,通过高效导航复杂信息网络实现更准确和上下文感知的决策。然而,仅基于结果的强化学习存在\textit{\textbf{答案路径奖励混淆}},即正确答案可能来自捷径而非有用证据路径。它还表现出\textit{\textbf{搜索更新模糊}},因为标量轨迹级反馈未指示应调整哪些检索动作。为缓解这些缺陷,我们提出PathRouter,一种用于智能体图RAG的路径感知训练框架。PathRouter联合评估每条轨迹的答案正确性和证据路径重叠,产生四种轨迹类别,并采用差异化的GRPO优势缩放,抑制捷径强化同时保留证据寻求行为。对于证据贫乏的轨迹,冻结的金证据教师提供推理和搜索查询token上的token级KL指导,排除答案token以避免直接响应模仿。在三个模型大小、六个QA基准上的实验表明,PathRouter一致提升了答案F1和证据路径重叠,与强基线相比,3B模型平均F1提升3.1,7B模型提升4.9。

英文摘要

Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.

2606.16817 2026-06-16 cs.CL cs.IR 新提交

Understanding the Behaviors of Environment-aware Information Retrieval

理解环境感知的信息检索行为

Ruifeng Yuan, Chaohao Yuan, David Dai, Yu Rong, Hong Cheng, Hou Pong Chan, Chenghao Xiao

发表机构 * Fudan University(复旦大学) Alibaba DAMO Academy(阿里巴巴达摩院) Chinese University of Hong Kong(香港中文大学) Stanford University(斯坦福大学) Shanghai University of Finance and Economics(上海财经大学)

AI总结 通过强化学习使LLM适应不同检索器的查询策略,发现不同检索器偏好不同查询风格,并提出分支式滚动技术提升训练稳定性。

Comments ACL 2026 Main

详情
AI中文摘要

最近的检索增强生成(RAG)方法在处理复杂查询方面展示了强大的能力,但当前研究忽略了一个关键挑战:不同的检索器需要根本不同的查询制定策略才能达到最佳性能。在这项工作中,我们首次系统分析了LLM如何通过强化学习(RL)学习适应不同检索器的查询制定策略。我们的实证研究表明,RL有效地教会了LLM根据特定检索器特征定制其查询。我们发现不同的检索器表现出令人惊讶的不同最优查询风格(例如,描述性 vs. 问题式),表明为一种检索器学习的策略对另一种检索器无效。我们进一步表明,通过结合检索器特定的人类指导和扩大模型规模可以提升性能。为了促进多检索步骤轨迹的学习,我们引入了一种基于分支的滚动技术,提高了训练稳定性。我们的工作为构建真正检索器感知的RAG系统提供了首个实证证据和可操作的见解。代码和资源可在 https://github.com/LCO-Embedding/Envs-aware-Information-Retrieval 获取。

英文摘要

Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at https://github.com/LCO-Embedding/Envs-aware-Information-Retrieval.

2606.16874 2026-06-16 cs.CL cs.CE cs.CY 新提交

Understanding Scam Trends and Rail Paths from Reddit Self-Disclosure Narratives

理解 Reddit 自我披露叙事中的诈骗趋势和路径

Yangjun Zhang, Mirko Bottarelli, Mark Hooper, Carsten Maple

发表机构 * The Alan Turing Institute, London, UK(艾伦·图灵研究所,伦敦,英国)

AI总结 通过构建2023-2025年Reddit自我披露数据集,采用启发式标注和LLM辅助方法分析诈骗类型趋势、多阶段路径及社区支持行为,发现诈骗过程以多路径为主且随时间变化。

Comments 6 pages, International Conference on AI and the Digital Economy (CADE) 2026

详情
AI中文摘要

在线诈骗行为本质上是多阶段的,其生命周期包括时间顺序的路径和事件,而非孤立的信号。现有工作分析了诈骗类型和路径的特征,但未追踪跨年份的诈骗趋势。此外,由于缺乏带有标注和覆盖不同诈骗类型的开源数据集,路径间关系的研究受到阻碍。为解决这些问题,我们构建了一个数据集,利用2023年至2025年Reddit自我披露叙述分析诈骗特征的年度趋势和路径。我们收集了21,304篇来自诈骗相关子版块的帖子,这些帖子至少包含身份、通信、平台和支付中的一个路径,通过启发式标注进行趋势分析。然后,我们通过LLM辅助方法标注了1,800篇包含显式或可恢复诈骗链的帖子,用于诈骗路径分析,该方法通过人工标注进行评估。最后,我们对帖子的评论运行主题模型,分析社区支持行为。结果表明,诈骗过程主要是多路径的。不同年份中,不同的诈骗类型和路径组件占主导地位。不同诈骗类型在路径复杂性上存在系统性差异。Reddit的支持行为随时间变得更加详细。这项工作支持合成诈骗链数据模拟和与AI相关的诈骗风险评估,但结果可能不适用于其他平台。

英文摘要

Online scam behavior is inherently multi-stage, and the lifecycle includes temporally ordered rails and events rather than isolated signals. Existing works analyze characteristics of scam types and rails, but they do not track scam trends across years. Moreover, the work on the relations between rails is hampered due to the lack of open-source datasets with annotations and coverage of different scam types. To address these gaps, we build a dataset to analyze the yearly trend of scam characteristics and rail paths using Reddit self-disclosure narratives from 2023 to 2025. We collect 21,304 posts from scam-related subreddits with at least one rail among identity, communication, platform, and payment for trend analysis by heuristic annotation. Then, we label 1,800 posts containing explicit or recoverable scam chains by an LLM-assisted method for scam path analysis. The method is evaluated with human annotation. Lastly, we run a topic model on the comments of the posts to analyze the community support behavior. The results reveal that scam processes are predominantly multi-rail. Across years, different scam types and rail components dominate. Different scam types vary systematically in path complexity. Reddit support behaviors have become more detailed over time. This work supports synthetic scam chain data simulation and AI-related scam risk assessment, though findings may not generalise to other platforms.

2606.14885 2026-06-16 cs.AI cs.CL 交叉投稿

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Dr-DCI: 通过动态工作空间扩展实现直接语料交互的规模化

Yi Lu, Zhuofeng Li, Ping Nie, Haoxiang Zhang, Yuyu Zhang, Kai Zou, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

发表机构 * University of Toronto(多伦多大学) Texas A&M University(德克萨斯A&M大学) University of Waterloo(滑铁卢大学) UC San Diego(加州大学圣迭戈分校) Verdent AI Netmind AI

AI总结 提出DR-DCI框架,将检索作为智能体可调用的动作来动态扩展本地工作空间,结合检索器的召回能力与DCI的局部操作精度,实现大规模语料上的高效搜索与验证。

Comments 25 pages, 4 figures, 22 tables

详情
AI中文摘要

大规模语料上的智能体搜索依赖于检索器中介接口(如BM25或ColBERT)实现可扩展的候选发现。虽然这些接口在排序相关文档方面有效,但它们仅将证据呈现为排序结果或有界文档视图,限制了智能体重组材料和跨文档验证约束的能力。直接语料交互(DCI)通过暴露可shell执行的语料操作来解决这一限制,支持灵活的搜索、过滤、比较和验证。然而,随着语料增长,全语料终端命令变得缓慢且不稳定,降低了性能和效率。我们提出DR-DCI,一种检索器引导的DCI框架,将检索视为智能体可调用的动作以扩展本地工作空间。智能体不是直接操作整个语料,而是动态地将相关文档拉入一个不断演变的工作空间,并在其中执行DCI操作。这种设计结合了检索器级别的召回与DCI级别的精度:检索保持探索的可扩展性,而DCI保留有效证据解析所需的局部操作。实验表明,DR-DCI在不同规模下均有效且高效。在Browsecomp-Plus上,DR-DCI达到71.2%的准确率,相比原始DCI和消融变体提升高达8.3个百分点,同时减少了工具使用、墙钟时间和估计成本。通过保留工作空间的上下文重置,准确率进一步提升至73.3%。在语料规模实验中,DR-DCI在10万到1000万文档范围内保持有效,而原始DCI变得不稳定,BM25表现显著更差。DR-DCI还扩展到2000万规模的文件级文档Wiki-18 QA设置,在六个基准测试中平均得分63.0,优于基于检索和训练搜索智能体的基线。消融分析进一步表明,排序预览和文档间DCI是性能的关键。

英文摘要

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

2606.15077 2026-06-16 cs.AI cs.CL 交叉投稿

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

风险感知的LLM智能体用于地理空间数据检索:设计与初步对抗性评估

Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

发表机构 * Dept. of Systems Design Engineering, University of Waterloo(滑铁卢大学系统设计工程系) SkyWatch Dept. of Geography and Environmental Management, University of Waterloo(滑铁卢大学地理与环境管理系) Dept. of Geomatics Engineering, University of Calgary(卡尔加里大学测绘工程系)

AI总结 提出一种基于LLM的框架,通过自然语言查询从云地理空间目录检索遥感数据,集成三个智能体实现安全、意图解析和API调用生成,初步对抗实验表明提示级安全指令提升鲁棒性但需系统级防御。

Comments Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026

详情
AI中文摘要

我们提出一个由LLM驱动的框架,用于通过自然语言查询从基于云的地理空间目录中检索遥感数据。该系统将用户意图转换为结构化的API调用,实现对卫星影像和环境数据集的高效访问。该架构集成了三个智能体:Guardrail用于安全和策略执行,General-QA用于意图解释,Recommender-Analyst用于模式感知的API调用生成。这种协调设计确保了与外部数据服务的可靠、语义对齐的交互。该模块化框架通过API模式替换可跨平台移植,并支持环境监测、灾害响应和气候分析等应用。它在用户意图与地理空间基础设施之间建立了可扩展的接口,实现了简化和自动化的地球观测工作流程。在对抗性多轮设置下的初步实验表明,提示级安全指令提高了鲁棒性,尽管在API操作场景中仍存在罕见的高影响失败,这突显了需要自适应、系统级的防御措施来平衡安全性、可用性和成本效率,这也激励了我们使用拦截级别的Guardrail智能体。

英文摘要

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

2606.15906 2026-06-16 cs.IR cs.AI cs.CL cs.DB cs.MM 交叉投稿

MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

MAGE-RAG:面向长文档问答的多粒度自适应图证据多模态RAG

Yilong Zuo, Xunkai Li, Jing Yuan, Qiangqiang Dai, Hongchao Qin, Ronghua Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MAGE-RAG框架,通过离线构建包含页面和元素节点的证据图,在线自适应构建证据子图,平衡证据覆盖与噪声控制,在长文档多模态问答中取得最优性能。

详情
AI中文摘要

长文档多模态问答要求系统在长PDF中定位稀疏证据,并整合来自文本、表格、图像、图表和复杂布局的线索。现有RAG方法大多依赖于文本块或页面的固定Top-k检索。文本检索可以压缩上下文,但往往丢失视觉和布局信息;页面级视觉检索保留原始页面,但也会将大量无关区域送入阅读器,导致证据覆盖、噪声和推理成本之间的静态权衡。本文提出MAGE-RAG,一种用于长文档多模态问答的多粒度自适应图证据框架。MAGE-RAG以页面检索作为查询时证据构建的入口。离线阶段,它构建一个包含页面节点和元素节点的证据图,编码包含关系、阅读顺序、布局邻接、章节层次和语义邻居关系。查询时,在线证据控制器在显式预算下迭代地激活、打开、搜索和剪枝证据。生成的证据子图随后被渲染为结构化的多模态阅读器输入,使LVLM能够在有限上下文中消费紧凑且相关的证据。在LongDocURL和MMLongBench-Doc上,我们建立了统一的比较和分析协议,涵盖直接MLLM、文本RAG、页面级视觉RAG和图/智能体RAG。实验表明,MAGE-RAG在LongDocURL上达到52.75的整体准确率,在MMLongBench-Doc上达到53.26的准确率和51.19的F1。细粒度分解、预算-性能曲线、消融和基于轨迹的分析进一步表明,查询时证据子图构建能够平衡分散证据覆盖与上下文噪声控制。我们的代码可在https://github.com/laonuo2004/MAGE-RAG.git获取。

英文摘要

Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at https://github.com/laonuo2004/MAGE-RAG.git.

2606.15998 2026-06-16 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

实体标签并非实体信号:文档重排序中可观测相关性的框架

Utshab Kumar Ghosh, Shubham Chatterjee

发表机构 * Department of Computer Science, Missouri University of Science and Technology(计算机科学系,密苏里科技大学)

AI总结 提出实体可观测相关性(OER)与概念相关性(CER)的区分,证明CER监督效果差,而OER对齐可显著提升重排序性能。

Comments ICTIR '26

详情
Journal ref
Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)
AI中文摘要

实体感知的文档检索使用与查询关联的实体作为排序信号,假设语义相关的实体也是有用的检索信号。我们证明这一假设是不充分的,并解释原因。与作为真实观测的词项不同,实体链接是由不完美的链接器产生的假设:如果链接器在相关和非相关文档中无差别地触发,那么一个实体可能在主题上重要,却不提供任何判别性信号。我们将此形式化为概念实体相关性(CER)——实体是否与查询主题相关——和可观测实体相关性(OER)——其在集合中的观测出现是否能区分相关与非相关文档——之间的区别。在四个集合和包括人工实体判断的标注来源上,CER和OER表现出接近随机的吻合度(κ≈0),而OER的操作化实现吻合度较高(κ≈0.5),确认CER是系统性异常值。基于CER的监督选择主题上合理但判别性弱的实体,在某些集合上仅能过滤不到4%的非相关文档。将监督与OER对齐可将非相关文档过滤提升至10倍,并在BM25基础上将开放世界MAP提升0.051。我们的发现促使实体感知检索中从概念实体相关性向可观测实体相关性的转变。

英文摘要

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

2606.16661 2026-06-16 cs.IR cs.CL 交叉投稿

SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG

SCAR: 语义连续性感知检索以实现RAG中的高效上下文扩展

Nathanaël Langlois

AI总结 提出SCAR自适应检索策略,通过查询-邻居相关性与结构连续性惩罚的权衡,选择性扩展相邻块,在保持高召回率的同时显著减少token开销。

Comments 5 pages, 1 figure

详情
AI中文摘要

检索增强生成(RAG)中的固定长度分块常导致边界碎片化,关键证据被分割到不同片段中,降低检索召回率。虽然静态窗口化和父检索提高了召回率,但引入了显著的token开销。我们提出SCAR(语义连续性感知检索),一种自适应检索策略,通过权衡查询-邻居相关性与结构连续性惩罚,选择性扩展相邻块。SCAR使用与每个检索块自身查询相关性相关的相对扩展阈值,产生近似尺度不变的决策规则,可在不同嵌入模型间迁移而无需重新校准。在四个不同语料库(RFC、GDPR、一份10-K报告和一份合并协议;N=320个查询;160个边界碎片化查询)上,SCAR在边界碎片化查询上实现了92.8%的召回率,仅使用7.84个块,相比静态窗口化(10.16个块)减少了22.9%。配对bootstrap检验(B=10,000)确认块减少非常显著(p<0.0001,Cohen's d=-1.49,大效应),召回率差异较小(Cohen's d=-0.33)。该策略在三个嵌入模型(text-embedding-3-large、BGE-large-en-v1.5、zembed-1)上使用相同的单一超参数设置进行迁移,并且在10-K语料库上的下游RAGAS评估证实,SCAR在保持生成忠实度的同时将上下文token减少了27.1%。

英文摘要

Fixed-length chunking in Retrieval-Augmented Generation (RAG) often leads to boundary fragmentation, where critical evidence is split across segments, degrading retrieval recall. While static windowing and parent retrieval improve recall, they introduce significant token overhead. We propose SCAR (Semantic Continuity-Aware Retrieval), an adaptive retrieval policy that selectively expands neighboring chunks by weighing query-neighbor relevance against a structural continuity penalty. SCAR uses a relative expansion threshold tied to each retrieved chunk's own query-relevance, yielding an approximately scale-invariant decision rule that transfers across embedding models without recalibration. Across four diverse corpora (RFC, GDPR, a 10-K report, and a Merger agreement; N=320 queries; 160 boundary-fragmented), SCAR achieves 92.8% recall on boundary-fragmented queries with only 7.84 chunks, a 22.9% reduction compared to static windowing (10.16 chunks). Paired bootstrap tests (B=10,000) confirm the chunk reduction is highly significant (p<0.0001, Cohen's d=-1.49, large effect), with a small recall difference (Cohen's d=-0.33). The policy transfers across three embedding models (text-embedding-3-large, BGE-large-en-v1.5, zembed-1) using the same single hyperparameter setting, and downstream RAGAS evaluation on the 10-K corpus confirms SCAR preserves generation faithfulness while reducing context tokens by 27.1%.

2501.09310 2026-06-16 cs.CL cs.AI cs.SE 版本更新

Understanding, Detecting, and Repairing Real-World In-Context-Learning-Based Text-to-SQL Errors

理解、检测和修复基于上下文学习的真实世界文本到SQL错误

Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu

发表机构 * East China Normal University(东华师范大学) Shanghai China(上海中国) Shanghai Innovation Institute(上海创新研究院) sei.ecnu.edu.cn(东华师范大学电子邮件)

AI总结 本研究首次全面调查基于上下文学习的文本到SQL错误,总结27种错误类型,并提出MapleDoctor框架,相比现有方法修复率提高13.8%,误修复极少,延迟降低67.4%。

Comments Accepted by FSE 2026

详情
AI中文摘要

大型语言模型(LLMs)已被用于文本到SQL任务,利用其上下文学习(ICL)能力将自然语言问题转换为SQL查询。然而,这种技术面临正确性问题。在本文中,我们首次对基于ICL的文本到SQL错误进行了全面研究。我们的研究涵盖了四种代表性的ICL技术、五种基本修复方法、两个基准测试和两种LLM设置。我们发现文本到SQL错误普遍存在,并总结了7个类别的27种错误类型。我们还发现,现有的修复尝试在正确性提升方面有限,同时具有高计算开销和许多误修复。基于这些发现,我们提出了MapleDoctor,一种新颖的文本到SQL错误检测和修复框架。评估表明,MapleDoctor优于现有解决方案,修复了13.8%更多的查询,误修复数量可忽略不计,并减少了67.4%的修复延迟。该工件可在GitHub上公开获取。

英文摘要

Large language models (LLMs) have been adopted for text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into SQL queries. However, such a technique faces correctness problems. In this paper, we conduct the first comprehensive study of text-to-SQL errors of ICL-based techniques. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 27 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement while having high computational overhead and many mis-repairs. Based on these findings, we propose MapleDoctor, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleDoctor outperforms existing solutions by repairing 13.8% more queries with a negligible number of mis-repairs and reducing 67.4% repair latency. The artifact is publicly available at GitHub.

2510.06198 2026-06-16 cs.CL cs.IR 版本更新

The Answer Lies Within: Self-Derived Rewards Enable Explainable Relation Extraction

答案源于内部:自衍生奖励实现可解释关系抽取

Xinyu Guo, Zhengliang Shi, Minglai Yang, Mihai Surdeanu

发表机构 * University of Arizona(亚利桑那大学) Shandong University(山东大学)

AI总结 针对大语言模型在无预定义标签的单次关系抽取中易受无关词干扰和抽象层级不匹配的问题,提出COGRE认知推理框架和HIT@DICT强化学习奖励策略,通过自动提取信用词典奖励关系相关短语,显著提升准确率和解释质量。

Comments Working in process

详情
AI中文摘要

尽管大语言模型具有显著的推理能力,但在没有预定义关系标签的单次关系抽取中仍存在困难。我们识别出两个陷阱:模型常被无关词元而非传达关系的语义误导,并且往往无法与人类标注者期望的抽象层级对齐。我们提出一个新颖框架,通过两个组件弥合这一差距:(1) COGRE,一个受认知启发的推理框架,将关系抽取结构化为一系列模拟人类文本处理的过程;(2) HIT@DICT,一种强化学习中间奖励策略,通过奖励推理中与关系相关的短语,鼓励推理与关系标签对齐。该奖励基于从正确预测中自动提取的信用词典推导得出。实验表明,我们的框架通过解决这两个陷阱,同时提高了准确率和解释质量。例如,COGRE搭配Qwen2.5-14B-Instruct在单次NYT29上达到24.65%的F1分数,超越了先前基于推理的设计。使用HIT@DICT进行强化学习优化后,性能进一步提升+23.46个百分点。最后,人工评估显示,我们的最佳模型生成的关系短语与金标签高度对齐,使人工解释质量评分相对提高54%。

英文摘要

Despite the remarkable reasoning capabilities of large language models, they still struggle with one-shot relation extraction without predefined relation labels. We identify two pitfalls: models are often misled by irrelevant tokens instead of relation-conveying semantics, and they often fail to align with the abstraction level human annotators expect. We introduce a novel framework that closes this gap with two components: (1) COGRE, a cognitively-inspired reasoning framework that structures RE into a series of processes mimicking human text-processing; and (2) HIT@DICT, a reinforcement learning intermediate reward strategy that encourages reasoning to align with relational labels by rewarding relation-relevant phrases in reasoning. The reward is derived on a credit dictionary automatically extracted from correct predictions. Our experiments show that our framework improves both accuracy and explanation quality by addressing these two pitfalls. For example, COGRE with Qwen2.5-14B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using HIT@DICT further improves performance by +23.46% points. Finally, human evaluation shows that our best model generates relational phrases closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).

2511.16681 2026-06-16 cs.CL cs.AI 版本更新

SPI: Query-Depth-Adaptive Indexing for Streaming RAG in Vector Databases

SPI:向量数据库中流式RAG的查询深度自适应索引

Dong Liu, Yanxuan Yu

发表机构 * Yale University(耶鲁大学) Columbia University(哥伦比亚大学)

AI总结 提出语义金字塔索引(SPI),通过多级分辨率组织和不确定性感知控制器实现查询深度自适应,支持流式插入和渐进式ANN搜索,在MS MARCO和Natural Questions上相比基线实现1.4-2.3倍延迟降低。

详情
AI中文摘要

向量数据库(VecDB)越来越多地部署在检索增强生成(RAG)管道中,其中查询处理和文档摄取同时发生。索引层需要提供低延迟搜索,同时在不频繁全局重建的情况下纳入新向量。现有的VecDB管道通常在统一表示机制下运行,尽管查询所需的语义粒度存在显著差异。这促使设计一种支持增量更新同时根据查询分布和复杂性调整检索深度的索引。我们提出**语义金字塔索引(SPI)**,一种VecDB层索引框架,将嵌入组织成$L$个语义对齐的分辨率级别,并通过轻量级不确定性感知控制器为每个查询选择检索深度。SPI支持渐进式粗到细ANN搜索、无需全局重建的逐级流式插入,以及通过LSH分区和异步gRPC协调的分布式执行。与具有固定遍历规则的分层ANN结构(例如SPANN)不同,SPI在查询时自适应分辨率,同时保持与FAISS和Qdrant后端的兼容性。在MS MARCO和Natural Questions上,在相同密集编码器系列下,SPI在Recall@10上具有竞争力且延迟更低,相对于可比较的近似ANN基线,在固定Recall@10目标下实现了**1.4-2.3倍**的平均检索延迟降低。一个最多8个节点的原型扩展研究显示吞吐量扩展了6.2倍(约73%效率);为完整性包含了16节点配置,但显示出递减的效率。我们提供了top-$K$稳定性保证:具有足够检索裕度的查询在较浅层返回相同的top-$K$集合。代码和配置可从此https URL获取。

英文摘要

Vector databases (VecDBs) are increasingly deployed in retrieval-augmented generation (RAG) pipelines where query processing and document ingestion occur concurrently. The index layer needs to provide low-latency search while incorporating new vectors without frequent global rebuilding. Existing VecDB pipelines typically operate within a uniform representation regime, despite substantial variation in the semantic granularity required across queries. This motivates an index design that supports incremental updates while adapting retrieval depth to query distribution and complexity. We propose \textbf{Semantic Pyramid Indexing (SPI)}, a VecDB-layer indexing framework that organizes embeddings into $L$ semantically aligned resolution levels and selects retrieval depth per query via a lightweight uncertainty-aware controller. SPI supports progressive coarse-to-fine ANN search, level-wise streaming insertion without global rebuilds, and distributed execution through LSH partitioning with asynchronous gRPC coordination. Unlike hierarchical ANN structures with fixed traversal rules (e.g., SPANN), SPI adapts resolution at query time while remaining compatible with FAISS and Qdrant backends. On MS MARCO and Natural Questions, SPI achieves competitive Recall@10 with lower latency under the same dense encoder family, yielding a \textbf{1.4--2.3$\times$} average retrieval latency reduction under fixed Recall@10 targets relative to comparable approximate-ANN baselines. A prototype scaling study up to 8 nodes shows $6.2\times$ throughput scaling (${\approx}73\%$ efficiency); the 16-node configuration is included for completeness but shows diminishing efficiency. We provide a top-$K$ stability guarantee: queries with sufficient retrieval margin return an identical top-$K$ set at a shallower level. Code and configurations are available at https://github.com/FastLM/SPI_VecDB.

2603.19595 2026-06-16 cs.IR cs.CL 版本更新

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

All-Mem: 通过动态拓扑演化实现智能体终身记忆

Can Lv, Heng Chang, Shengyu Tao, Mingju Chen, Zhaoxin Fan, Ziwei Zhang, Yuchen Guo, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing(北京未来区块链与隐私计算先进创新中心) Beihang University(北航) Tsinghua University(清华大学) Chalmers University of Technology(查尔姆斯理工大学)

AI总结 提出All-Mem框架,通过在线/离线结合的非破坏性拓扑结构记忆库,解决终身交互代理中历史增长导致的检索冗余和噪声问题,在LoCoMo和LongMemEval-s上提升检索与问答性能。

详情
AI中文摘要

终身交互代理期望在数月或数年内协助用户,这需要在固定上下文和延迟预算下持续写入长期记忆,同时为每个新查询检索正确的证据。现有的记忆系统随着历史增长往往会退化,产生冗余、过时或噪声的检索上下文。我们提出\textbf{All-Mem},一个在线/离线终身记忆框架,通过显式的、非破坏性的整合维护一个拓扑结构化的记忆库,避免了基于摘要压缩的典型不可逆信息损失。在在线操作中,它将检索锚定在有界可见表面上以保持粗略搜索成本有界。定期离线时,LLM诊断器提出置信度评分的拓扑编辑,通过三个算子(拆分、合并和更新)执行门控,同时保留不可变证据以保持可追溯性。在查询时,类型化链接支持从活动锚点到存档证据的跳数有界、预算可控的扩展。在\textbf{LoCoMo}和\textbf{LongMemEval-s}上的实验表明,与代表性基线相比,检索和问答性能得到提升。代码可在该https URL获取。

英文摘要

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present \textbf{All-Mem}, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: Split, Merge, and Update, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on \textbf{LoCoMo} and \textbf{LongMemEval-s} show improved retrieval and QA over representative baselines. The code is available at https://github.com/LvCan926/All-Mem.

4. 对话系统与智能体 31 篇

2606.14832 2026-06-16 cs.CL 新提交

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness: 通过混合GUI、CLI和工具操作利用手机使用代理

Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan(腾讯混元) The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 提出PhoneHarness,一个混合动作基准和执行框架,用于评估手机代理在可验证移动工作流中的表现,通过GUI、CLI和工具动作的组合,达到75.0%通过率,比最强基线高12.9个百分点。

Comments Project Page: https://phoneharness.github.io/

详情
AI中文摘要

手机代理越来越被期望完成真实的移动工作流,而不仅仅是预测下一个屏幕动作。然而,当前许多移动代理文献仍然主要将代理评估为GUI控制器,它们观察屏幕、发出点击和滑动,并根据目标应用状态评分。真实的手机使用任务更为广泛:它们需要决定何时使用应用GUI、设备端命令或结构化工具,同时留下预期副作用实际发生的证据。我们引入了PhoneHarness,一个混合动作基准和执行框架,用于研究在可验证移动工作流上的手机使用代理。PhoneHarness在GUI、CLI和主机端工具动作上运行设备端代理循环,结合确定性动作路由与有界GUI委托和可审计执行轨迹。其基准PhoneHarness Bench评估代理是否完成具有可观察副作用的任务,而不仅仅是产生合理的最终答案。在注释评估集上,PhoneHarness达到75.0%的通过率,比最强的非PhoneHarness设置高出12.9个百分点。因此,PhoneHarness和PhoneHarness Bench扮演着不同但相互依赖的角色:框架使混合手机工作流可执行,而基准衡量代理是否能够可靠且安全地使用该框架。我们的发现表明,可靠的手机自动化依赖于动作表面路由和可验证执行,而不仅仅是视觉GUI控制。

英文摘要

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

2606.15017 2026-06-16 cs.CL 新提交

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

在线技能与记忆模块是否总是值得其令牌消耗?Web代理的预算约束研究

Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier

发表机构 * ServiceNow AI Research(ServiceNow AI 研究院) ÉTS Montreal(蒙特利尔高等技术学院) University of British Columbia(不列颠哥伦比亚大学) McGill University(麦吉尔大学)

AI总结 在固定推理预算下,对比三种在线增强模块与令牌匹配的基线,发现基线在总成功率上匹配或超越所有增强方法,且常使用更少令牌。

详情
AI中文摘要

在线Web代理通常用记忆、工作流或技能模块增强基础执行器。这些模块可提升性能,但也消耗测试时令牌,这一成本很少与执行器的推理成本一同报告。我们研究在线增强(每项任务都需支付此开销),并在固定总推理预算下重新评估其收益。我们将AWM、ASI和ReasoningBank与令牌匹配的普通基线(使用相同预算进行额外执行器步骤)进行比较。在三个WebArena领域和三个模型(Gemini 3 Flash、GPT-5.4-mini和Qwen 3.6-27B)上,普通基线在总成功率上匹配或超越所有三种增强方法,同时通常使用更少总令牌。我们在WorkArena-L1上使用Qwen 3.6-27B观察到类似趋势,表明该效果扩展到企业知识工作任务。我们的结果表明,技能和工作流记忆在特定领域可能有用,但其表面收益在预算匹配的执行器面前往往消失。我们进一步表明,运行间方差显著影响结果,应作为在线Web代理的核心评估标准报告。

英文摘要

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

2606.15152 2026-06-16 cs.CL 新提交

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

智能体能否读懂房间气氛?多模态模拟中的视觉社交智能基准测试

Shijun Wan, Xuehai Wu, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AgentViSS基准,通过240个场景和四项角色级任务评估多模态大模型的视觉社交智能,发现局部角色扮演接近饱和而交互调控仍困难。

详情
AI中文摘要

社交互动依赖于语言和可见的社交信号,如面部表情、姿势、注视和情绪变化。然而,现有的社交智能体基准大多基于文本,很少测试多模态智能体是否能够利用视觉线索来指导互动。我们引入了\ extsc{\enchmarkname{}},一个评估多模态社交模拟中视觉社交智能的基准。它包含240个场景、585个角色实例和2,340个角色任务实例,结合了对齐的文本-视觉证据、结构化角色档案和四个角色级任务:表情任务、特征任务、交互调控任务和交互结果任务。在口头视觉和直接视觉条件下评估七个最新的多模态大语言模型,揭示了局部角色扮演与交互管理之间的明显差距:角色特定的表情和冲突处理接近饱和,而交互调控和基于视觉的结果实现仍然困难得多。代码已发布在https://github.com/JunsWan/AgentViSS,数据集可在https://huggingface.co/datasets/JunsWan/AgentViSS获取。

英文摘要

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

2606.15390 2026-06-16 cs.CL cs.AI cs.LG 新提交

Not All Skills Help: Measuring and Repairing Agent Knowledge

并非所有技能都有用:测量与修复智能体知识

Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Purdue(普渡大学) NVIDIA(英伟达)

AI总结 提出ASSAY框架,通过随机掩码测量技能因果贡献,分离技能生成与筛选,在推理时抑制负面技能,显著提升LLM智能体任务完成率。

Comments 18 pages, 5 figures

详情
AI中文摘要

LLM智能体可以通过从经验中积累自然语言技能来改进,而无需更新权重,但当前系统将所有关于保留哪些技能以及如何应用它们的决策完全交由LLM判断。我们认为这混淆了两个不同的角色:从经验中生成技能是判断擅长的创造性行为,而决定该技能是否真正有帮助则需要跨多个任务的实证证据。通过随机掩码测量每个技能的因果贡献,我们发现技能库表现出普遍的因果异质性:单个技能通常在某些任务类型上有帮助,但在其他任务类型上有害,然而它们的相反效应在总体上相互抵消,使得全局筛选方法无法察觉。我们提出ASSAY,一个将生成与筛选分离的框架:它在小型开发集上计算每个技能的因果归因,离线重组技能库,并为每个测试任务抑制预测效应为负的技能。在跨越四个提供商的七个基础模型以及两个基准(AppWorld和tau-bench)上,ASSAY始终优于先前的技能筛选方法。在AppWorld最难的数据划分上,DeepSeek-V3实现了69.3%的任务目标完成率(相对提升47.4%),在所有已发表方法(包括权重调整方法)中达到了新的最先进水平。在tau-bench零售领域,GPT-4.1相对提升8.7%,在公开排行榜上超越了o4-mini、o1和GPT-4.5,且无需任何权重修改。消融实验将主要收益归因于每任务掩码,证实瓶颈在于推理时将技能与任务匹配,而非全局移除不良技能。代码已开源:https://github.com/aiming-lab/assay。

英文摘要

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

2606.15405 2026-06-16 cs.CL cs.AI 新提交

T-Mem: Memory That Anticipates, Not Archives

T-Mem:预测而非归档的记忆

Weidong Guo, Dakai Wang, Zixuan Wang, Hui Liu, Yu Xu

发表机构 * Tencent(腾讯)

AI总结 提出T-Mem架构,通过写时触发机制覆盖描述性和关联性回忆,解决长对话中语义关联检索问题,在LoCoMo和LoCoMo-Plus上达到SOTA。

详情
AI中文摘要

长期记忆对于对话代理在扩展对话中保持连贯性、遵循多个会话前做出的承诺以及根据每个用户调整行为至关重要。然而,当前基于LLM的长期对话记忆受限于查询与存储内容(包括词汇和稠密向量)之间的相似性。当查询和记忆共享表面特征(如措辞或命名实体,我们称之为描述性)时,该方法有效。但它忽略了另一类同样有价值的案例,即查询和记忆不共享表面特征,仅通过潜在语义弧(关联性)相连。在这种机制下,现有的长期记忆系统普遍失败。覆盖这另一半使得助手首次能够主动将过去的对话作为语义资产。在记忆方面,这是认知科学中称为情景未来思维的工程对应物:预演过去的经验,以便在未来需要找到它的上下文中使用。我们将这些写时预演称为触发器。我们提出T-Mem,这是第一个覆盖描述性和关联性回忆的长期对话记忆架构。在两种证据粒度(单个事实和完整交流)上,T-Mem实例化一个描述性触发器家族和一个关联性触发器家族,使得每个记忆都能从表面相似和相关性约束的查询中访问。作为实证验证,T-Mem在LoCoMo和LoCoMo-Plus上达到了最先进水平。

英文摘要

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

2606.15422 2026-06-16 cs.CL q-bio.BM 新提交

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent: 一种用于肽设计与优化的人工智能代理

Houxu Chen, Achuth Chandrasekhar, Amir Barati Farimani

发表机构 * Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA(生物医学工程系,卡内基梅隆大学,匹兹堡,PA 15213,美国) Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA(机械工程系,卡内基梅隆大学,匹兹堡,PA 15213,美国)

AI总结 提出Pepti-Agent,一种基于模型上下文协议(MCP)的闭环肽设计框架,通过可独立检查的生成、预测和突变工具,结合大语言模型控制器和实时属性预测,实现多目标优化与可复现基准测试。

详情
AI中文摘要

治疗性肽占据小分子和生物制剂之间有价值的设计空间,但它们的开发需要同时满足几个相互竞争的约束:溶解度、溶血活性和非特异性表面污染由重叠的序列特征控制,因此改善一个属性往往会降低另一个属性。计算设计通过将生成模型与基于序列的属性预测器配对,迭代地提出和优化候选物来解决这一问题。然而,这些组件通常被连接成难以检查、扩展或重用的整体脚本,并且它们通常通过自然语言推理而不是跟踪每个候选物不断变化的多属性状态来优化序列。我们提出了Pepti-Agent,一个闭环的、肽特异性的框架,它将生成、属性预测和单残基突变暴露为可独立检查的模型上下文协议(MCP)工具。一个大语言模型控制器调用这些工具,并在调用之间查阅实时的预测器输出,因此优化由每个序列当前的属性概况指导,而不是仅由语言推理指导。任务特异性的PeptideGPT模型生成候选物,基于ProtBERT的分类器对溶解度、溶血和非污染进行评分,两个可互换的突变算子提出序列编辑。通过记录控制器决策、预测器输出和接受突变的每一步迹,Pepti-Agent为多目标设计策略的基准测试和为实验验证优先排序候选物提供了可复现的基础。

英文摘要

Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

2606.15532 2026-06-16 cs.CL cs.LG 新提交

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

EIBench: 基于模拟器的基准测试和用于情绪管理的回合信用强化学习

Rongzhi Zhu, Xiang Huang, Yuchuan Wu, Rui Wang, Zequn Sun, Tao Ren, Weiyao Luo, Bingxue Qiu, Jieping Ye, Yongbin Li, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) Qwen-Character Team, Alibaba Group(阿里巴巴集团Qwen-Character团队)

AI总结 提出EIBench模拟器基准,包含2222个场景,通过2x2分类(支持、防御、修复、魅力)评估多轮情绪管理;并设计CTC-GRPO方法利用逐轮状态更新作为密集反馈,提升模型情绪智能。

详情
AI中文摘要

大型语言模型(LLM)的情绪智能(EI)通常通过静态理解任务或单轮对话生成来评估。然而,情绪管理是交互式的:一个好的模型不仅应识别用户的情绪,还应在多轮对话中改善用户的情绪和关系状态。我们引入了EIBench,一个基于模拟器的交互式情绪管理基准。EIBench包含2222个场景,其中2009个用于训练,213个用于保留测试。场景按2x2分类法组织,涵盖支持、防御、修复和魅力,分别对应不同形式的支持、边界维护、信任修复和融洽关系建立。在每个场景中,LLM模拟器扮演用户,每轮后更新情绪-关系状态,并将最终状态映射到基于锚点的分数。这一设计使EIBench既是一个评估基准,也是一个训练环境:最终状态提供结果奖励,而逐轮状态更新为强化学习提供密集反馈。我们评估了15个开源和闭源LLM。当前模型在支持和融洽关系建立场景中表现良好,但在用户压力下的边界维护方面存在困难。为了提升LLM的情绪智能能力,我们提出了中心化回合信用GRPO(CTC-GRPO),这是GRPO的一个扩展,它重用模拟器的逐轮状态更新作为密集的回合级反馈,同时保留最终结果奖励。CTC-GRPO将Qwen3-8B在EIBench上的得分从-22.4提升至+22.4,并在分布外评估(包括SAGE +12.4和EQBench3 +20.9%)中也有所提升。我们的结果表明,模拟器追踪的用户状态可以支持多轮情绪管理的评估和训练。

英文摘要

Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

2606.15911 2026-06-16 cs.CL cs.IR 新提交

Interactor: Agentic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

Interactor: 面向赞助搜索中广告描述生成的智能体强化学习迭代创建框架

Penghui Wei, Jiayu Wu, Chao Ye, Zhi Guo, Shuanglong Li, Lin Liu

发表机构 * Baidu Inc.(百度公司)

AI总结 提出Interactor框架,利用智能体强化学习多轮迭代生成广告描述,通过多个生成奖励模型评估知识容量和落地页一致性,显著提升广告描述的知识丰富度和忠实度。

详情
AI中文摘要

本文聚焦于自动生成赞助搜索中信息丰富的广告描述。与通常优化以吸引用户点击反馈的广告标题不同,广告描述具有更长的文本跨度,并有可能融入世界知识来满足用户搜索意图,同时呈现广告的细粒度卖点。我们提出了Interactor,一个基于智能体强化学习优化的多轮迭代创建框架,用于广告描述生成。生成模型作为策略,与由多个生成奖励模型组成的定制环境交互。给定策略的初始生成结果,定制的GenRMs评估包括知识容量和落地页一致性在内的多维质量,提供二元信号和推理反馈。策略随后基于这些反馈迭代优化描述,确保持续改进。在工业数据集上的实验表明,Interactor框架在生成知识丰富且忠实的广告描述方面显著优于最先进的方法。自2026年5月起,它已在领先的搜索广告系统中在线部署,为广告收入和用户体验做出贡献。

英文摘要

This paper focuses on automatically generating informative ad descriptions in sponsored search. Unlike ad titles which are usually optimized to attract user click feedbacks, ad descriptions have a longer text span and possess the potential of incorporating world knowledge to address user search intents while presenting the fine-grained selling points of the ads. We propose Interactor, a multi-turn iterative creation framework optimized with agentic RL for ad description generation. The generation model acts as a policy that interacts with a customized environment consisting of multiple generative reward models. Given initial generations by the policy, the customized GenRMs evaluate multi-dimensional qualities including knowledge capacity and landing page consistency, providing both binary signals and reasoning feedbacks. The policy then iteratively refines the descriptions based on such feedbacks to ensure continuous improvement. Experiments on industrial datasets show that the Interactor framework significantly outperforms state-of-the-art approaches in generating knowledge-rich and faithful ad descriptions. Since May 2026, it has been deployed online in a leading search ads system, contributing to both ad revenue and user experience.

2606.16111 2026-06-16 cs.CL 新提交

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

面向帕累托最优工具集成智能体的帕累托排名策略优化

Junyi Li, Xiaowei Qian, Yingyi Zhang, Wenlin Zhang, Guojing Li, Sheng Zhang, Xiao Han, Yichao Wang, Xiangyu Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ParetoPO框架,通过超体积引导动态标量化和帕累托排名优势计算,在多目标下优化工具使用语言模型的准确性与效率权衡。

Comments ICML 2026 Spotlight Paper

详情
AI中文摘要

近期工具集成语言智能体的进展显著提升了其解决复杂推理任务的能力。然而,现有对齐方法主要关注最大化任务准确率,而忽略了工具使用效率等辅助目标,这些目标对于实际部署至关重要。为解决这一差距,我们提出ParetoPO,一个两阶段多目标优化框架,用于在竞争目标下对齐使用工具的大型语言模型(LLMs)。在第一阶段,ParetoPO利用超体积引导的动态标量化,基于全局帕累托前沿进展自适应调整奖励权重。在第二阶段,它用基于帕累托排名的优势计算替代标量化学习信号,通过优势感知的信用分配促进非支配轨迹。该设计能够在多个冲突目标间实现细粒度的动作级优化。在数学推理和多跳问答任务上的实验结果表明,与静态和启发式基线相比,ParetoPO始终能发现具有更优准确率-效率权衡的策略。

英文摘要

Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

2606.16215 2026-06-16 cs.CL cs.AI cs.LG 新提交

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT: 多轮工具使用智能体的特权轨迹协同训练

Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Ohio State University(俄亥俄州立大学) University of Pennsylvania(宾夕法尼亚大学) Arizona State University(亚利桑那州立大学)

AI总结 提出PACT框架,通过特权轨迹(专家轨迹)在训练时提供密集监督信号,结合轨迹条件RL和组件感知SFT损失,避免推理时依赖轨迹,显著提升多轮工具使用智能体的性能。

Comments Project page: https://zhenbangdu.github.io/pact-project-page/

详情
AI中文摘要

多轮工具使用智能体必须在多个交互轮次中进行推理、调用工具并适应观察结果。对此类智能体进行后训练具有挑战性,因为强化学习通常面临稀疏奖励和弱信用分配问题(尽管匹配仅提示推理设置),而基于专家轨迹的监督微调提供密集过程监督,但可能过度约束模型到固定轨迹。为解决这一问题,我们提出PACT,一种用于多轮工具使用智能体的特权轨迹协同训练框架。关键思想是仅将专家轨迹作为训练时的优化信号,而非推理时的提示。PACT保持推理生成仅基于提示,然后通过两个互补信号利用专家轨迹指导优化:一个轨迹条件RL代理,在专家轨迹上下文中评估仅提示轨迹;一个组件感知SFT损失,以退火强度监督推理前缀和工具调用。为减少对训练时轨迹上下文的过度依赖,PACT进一步引入仅提示锚定。我们还提供了一个潜在轨迹视角,连接两个基于轨迹的目标,并解释专家轨迹如何在推理生成中不被使用的情况下指导优化。在FTRL、BFCL和ToolHop上的实验表明,PACT持续优于强SFT和RL基线,凸显了特权轨迹协同训练在多轮工具使用学习中的价值。

英文摘要

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

2606.16285 2026-06-16 cs.CL cs.LG 新提交

HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

HiMPO:面向长周期智能体的后见知情记忆策略优化以减少纠缠信用分配

Jiangze Yan, Yi Shen, Wenjing Zhang, Jieyun Huang, Zhaoxiang Liu, Ning Wang, Kai Wang, Shiguo Lian

发表机构 * Unicom Data Intelligence, China Unicom(联通数据智能有限公司,中国联通) Data Science & Artificial Intelligence Research Institute, China Unicom(中国联通数据科学与人工智能研究院)

AI总结 提出HiMPO框架,通过比较记忆更新前后的任务相关信息估计局部效用,并利用后见相关性作为回顾性滤波器,减少记忆写入动作的信用纠缠,提升长周期智能体性能。

Comments Preprint. 2 figures

详情
AI中文摘要

长周期智能体依赖记忆机制压缩交互历史,但优化记忆写入面临独特的信用分配挑战:记忆更新可能因下游工具故障、噪声观测或推理错误而受到奖励或惩罚,而非其自身贡献。这种因果纠缠的信用可能导致智能体丢弃有用证据或保留无关信息。我们提出HiMPO,一种后见知情记忆策略优化框架,用于在长周期智能体中对记忆写入动作分配较少纠缠的信用。HiMPO首先通过比较在相同写前状态下从先前记忆和更新记忆中可恢复的任务相关信息,估计记忆更新的局部效用。然后,它使用后见相关性作为有界回顾性滤波器,当局部效用不受目标结果支持时,衰减记忆信用。由此产生的记忆特定优势仅应用于记忆令牌,而轨迹级奖励则优化智能体的其余行为。在基于裁判的开放领域任务和客观压缩记忆问答中,HiMPO在保持压缩上下文效率的同时,优于基于强记忆和基于强化学习的基线。受控干预进一步表明,HiMPO减少了工具诱导错误的责备泄漏,并提高了记忆更新的归因保真度。

英文摘要

Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

2606.16428 2026-06-16 cs.CL cs.AI cs.HC 新提交

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

LectūraAgents:面向自适应个性化AI辅助学习与具身教学的多智能体框架

Jaward Sesay, Yue Yu, Siwei Dong, Yemin Shi, Guangyao Chen, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Peking University(北京大学) Cornell University(康奈尔大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出LectūraAgents多智能体框架,通过层次化架构和自适应具身教学机制(如手势、高亮等)实现端到端个性化学习,并设计教学动作-语音对齐算法提升连贯性,在多个课程级别上优于现有方法。

详情
AI中文摘要

有效的个性化AI辅助学习需要系统不仅能够生成准确的、针对学习者的教育材料,还能动态调整其教学方式以适应不同学习者。然而,现有的教育智能体主要关注讲座内容自动化和模拟,往往缺乏针对个体学习者的多模态和具身教学方法的建模。为此,我们提出LectūraAgents——一个多智能体框架,通过端到端的自适应具身教学实现个性化学习。其核心模拟了教授-学生关系,其中ProfessorAgent领导一个由专业下属智能体组成的协作团队,通过研究、规划、审查和具身交付适应学习者需求的讲座内容。该框架有三个主要贡献:(1)用于端到端个性化学习的层次化多智能体架构;(2)自适应具身教学机制,其中ProfessorAgent在教学环境中对内容执行可见且具有教学动机的教学动作(例如手写、高亮、下划线等);(3)教学动作-语音对齐(TASA)算法,该算法采用基于显著性的启发式和时序语义分割,生成与学习者档案对齐的连贯教学动作序列。我们在高中、本科和研究生级别的多样化课程上,使用基于样本特定量规的分析评估LectūraAgents;生成的讲座材料和教学动作由专家教育者评估和验证。实验结果显示,在讲座内容质量、具身教学质量、评估和个性化方面,LectūraAgents持续优于现有方法,使其成为大规模个性化学习的教学基础扎实的框架。

英文摘要

Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner's needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

2606.16432 2026-06-16 cs.CL cs.AI 新提交

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Stanford University(斯坦福大学)

AI总结 针对用户指令常因隐含环境假设而欠指定,导致LLM智能体执行失败的问题,提出ACCORD框架,在每次动作前主动探测缺失信息并整合轨迹上下文,无需额外训练,在AppWorld和AlfWorld上显著提升任务完成率。

详情
AI中文摘要

用户指令往往因人类对周围环境的隐含假设而欠指定。对于在信息丰富的数字和物理环境中运行的大型语言模型(LLM)智能体,这些假设无法仅从指令中推断;必须从工具、数据、接口和观察的当前状态中恢复。因此,有效执行要求智能体识别缺失的上下文,将其基于观察到的证据,并带入后续动作。我们表明,当前智能体常常未能做到这一点。它们基于假设而非观察到的细节行动,忽略本可收集的信息,并且未能整合已经返回的证据。基于这一洞察,我们提出ACCORD(动作条件上下文接地),一种简单有效的自适应接地智能体框架。在每次动作前,ACCORD主动探测环境中缺失的信息,并整合来自智能体轨迹中原本会被忽略的相关上下文。无需额外训练或任务成功信号,ACCORD在AppWorld上将任务目标完成率从42.0%提升至62.6%(GPT-5-mini),比强基线高出最多20.6个百分点。这些增益在更强的基模型(Claude-4.5-sonnet上+10.8)、开放权重模型(Qwen3.5-27B-FP8上+10.1)以及具身AlfWorld基准(GPT-5-mini上成功率+7.4)上持续存在。

英文摘要

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

2606.16523 2026-06-16 cs.CL 新提交

SkillWiki: A Living Knowledge Infrastructure for Agent Skills

SkillWiki: 一个用于智能体技能的活知识基础设施

Dingcheng Huang, Yuda Ding, Bingshuo Liu, Qingbin Liu, Xi Chen, Jiang Bian, Hongliang Sun, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Tencent(腾讯) Nanyang Technological University(南洋理工大学)

AI总结 提出SkillWiki,一个支持智能体技能组织、落地和持续演化的活知识基础设施,通过将异构知识转化为可复用技能资产并关联原始证据,实现从知识摄入到技能生产、溯源探索、治理和执行驱动演化的完整生命周期。

详情
AI中文摘要

虽然知识通过维基百科管理,软件通过GitHub管理,但智能体技能仍然缺乏大规模生产、治理和演化的基础设施。SkillWiki是一个活知识基础设施,通过将异构知识转化为可复用的技能资产并链接到其原始证据,支持智能体技能的组织、落地和持续演化。我们的演示展示了完整的技能生命周期,从知识摄入和技能生产到溯源感知的探索、治理和执行驱动的演化。SkillWiki突显了一个未来,其中知识、技能和执行经验在共享基础设施内共同演化。现场演示和源代码可在https://github.com/Huangdingcheng/SkillWiki公开获取。

英文摘要

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at https://github.com/Huangdingcheng/SkillWiki.

2606.16603 2026-06-16 cs.CL cs.AI 新提交

VeriGraph: Towards Verifiable Data-Analytic Agents

VeriGraph: 迈向可验证的数据分析智能体

Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)

AI总结 提出VeriGraph框架,通过构建显式异质证据有向无环图(DAG)实现数据分析智能体的可验证性,并设计基于图的策略优化提升正确性与可审计性。

Comments 10 pages

详情
AI中文摘要

基于LLM的智能体在数据密集型分析任务中展现出强大能力,但其输出很少是可验证的:对线性文本轨迹的依赖使其推理难以审计。特别是,对原始数据的确定性计算和对自然语言主张的语义推导常常纠缠在非结构化流中,导致数值结论难以复现,定性判断难以检查。为解决这一问题,我们提出VeriGraph,一个可追踪的神经符号推理框架,使智能体在执行过程中构建显式的异质证据有向无环图(DAG)。VeriGraph引入了三种证据扩展原语,即计算扩展、基础扩展和推导扩展,以在统一图中连接原始数据、解释器变量、计算结果和自然语言主张。在此公式下,结构可追溯性简化为从原始数据源到终端主张的图可达性,而语义支持通过主张级证据评估来衡量。为了改进图构建,我们进一步设计了一种基于图的策略优化策略,采用复合奖励联合监督答案正确性、计算完整性和推导连贯性。在四个基准上的实验表明,VeriGraph-8B在所有基线中取得了最高总分。更重要的是,VeriGraph生成了可审计的证据图,具有显著更强的主张基础,在我们的主张级证据支持评估下达到了87.61%的基础率。这些结果表明,显式证据图构建是实现可验证数据分析智能体的有前景的途径。我们的代码可在https://github.com/ignorejjj/VeriGraph获取。

英文摘要

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

2606.17016 2026-06-16 cs.CL cs.AI cs.LG cs.MA 新提交

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot: 面向LLM智能体的缓存高效上下文管理

Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) Xi’an University of Electronic Science and Technology(西安电子科技大学) HomologyAI(同源人工智能)

AI总结 针对LLM智能体长会话中上下文累积导致推理成本高的问题,提出TokenPilot双粒度上下文管理框架,通过摄入感知压缩和生命周期感知驱逐策略,在保持性能的同时降低61%-87%的成本。

Comments LightMem Series: Work in Progress

详情
AI中文摘要

随着LLM智能体被部署在长周期会话中,上下文累积推高了推理成本。现有方法利用文本修剪或动态内存驱逐来最小化token占用,但其无约束的序列突变改变了布局,引入前缀不匹配和缓存失效。这揭示了文本稀疏性与提示缓存连续性之间的关键权衡。为解决此问题,我们提出TokenPilot,一个双粒度上下文管理框架。全局上,摄入感知压缩作为框架工具,稳定提示前缀并在摄入门处消除开放世界环境噪声。局部上,生命周期感知驱逐监控上下文段的持续剩余效用,强制执行保守的批处理轮次调度,仅在任务相关性过期时卸载内容段。在PinchBench和Claw-Eval上的隔离和连续模式实验表明,TokenPilot在隔离模式下成本降低61%和56%,在连续模式下降低61%和87%,同时与先前系统相比保持竞争性能。TokenPilot已集成到LightMem2中,地址为https://github.com/zjunlp/LightMem2。

英文摘要

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.

2606.17029 2026-06-16 cs.CL 新提交

DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

DEEPRUBRIC: 基于证据树规则监督的高效深度研究智能体强化学习

Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He

发表机构 * Shandong University(山东大学) Zhongguancun Academy(中关村学院) Fudan University(复旦大学)

AI总结 提出DeepRubric框架,通过构建证据树生成查询-规则对,确保奖励信号准确评估查询所需信息,以13倍少的RL GPU时间达到与先前最优模型相当的性能。

详情
AI中文摘要

深度研究智能体通过搜索和推理检索到的证据来综合长篇报告。基于规则的奖励强化学习通过优化智能体以符合可检查的标准(这些标准将报告质量转化为奖励信号)来改进这些智能体,但其效率取决于这些标准是否可靠地捕捉任务范围和证据需求。大多数现有研究要求LLM为给定查询生成规则,但当模型无法推断潜在信息需求时,生成的规则可能不完整,从而降低RL效率。为了获得更可靠的查询-规则监督,我们引入了DeepRubric,一个反向这一过程的数据构建框架:它首先确定基于证据的报告应该评估什么,然后从这些评估目标中合成对齐的查询-规则对,而不是为给定查询推断评估标准。从采样的种子主题开始,DeepRubric通过递归扩展有证据支持的子问题构建证据树,其叶子节点作为原子且可验证的评估目标。然后,它使用证据树合成训练查询和规则,确保奖励准确评估查询所请求的信息。使用DeepRubric,我们构建了9K个查询-规则监督示例,并使用基于规则的GRPO训练了DeepRubric-8B,在三个基准测试中实现了与先前开源最先进深度研究模型相当的性能,而RL GPU时间减少了约13倍。

英文摘要

Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

2606.15033 2026-06-16 cs.HC cs.CL cs.CY 交叉投稿

Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts

Cloze:一个用于研究心理健康背景下人机对话的开放研究平台

Matthew Flathers, Francesco Cipriani, John Torous

发表机构 * Beth Israel Deaconess Medical Center(贝塞斯达以色列德acons医疗中心) University College London(伦敦大学学院) Division of Digital Psychiatry(数字精神病学部)

AI总结 提出开源平台Cloze,支持在心理健康研究中控制、监控人机对话,统一配置模型、指令、安全约束并记录完整溯源,为建立人机交互证据基础提供研究基础设施。

Comments 7 pages, 2 figures. Cloze is released under AGPL-3.0

详情
AI中文摘要

Cloze是一个开源网络平台,用于在心理健康研究背景下进行受控、受监测的人机对话研究。消费者大语言模型(LLM)产品如ChatGPT、Claude和Gemini是为个人生产力而构建的,为研究人员提供的实验控制很少,数据导出不一致,并且没有跨提供商的共享安全框架。Cloze为研究团队提供了一个单一环境,在其中他们配置参与者与哪些模型对话、AI如何被指示、对话如何随时间安排以及哪些安全约束无条件适用,同时每条消息都带有完整的溯源信息(模型版本、提示配置、时间)。该平台目前支持OpenAI、Anthropic、Google以及通过Ollama在统一接口后提供的本地托管开放权重模型,并可在云端或完全本地运行,以便参与者数据无需离开机构。Cloze是为在心理健康背景下建立人机交互证据基础而研究的基础设施。它不是治疗产品。

英文摘要

Cloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.

2606.15367 2026-06-16 cs.AI cs.CL cs.IR cs.LG 交叉投稿

S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

S1-DeepResearch:超越搜索,迈向真实世界的长周期研究智能体

Yao Dong, Xinglin Xiao, Liwei Dong, Xinlong Jin, Zhengbo Li, Heng Zhang, Duyun Wang, Nan Xu

发表机构 * XScience Lab(XScience实验室) Wenge AI(问格人工智能)

AI总结 提出统一轨迹构建范式,结合封闭式问答与开放式探索,通过图基任务构建、智能体轨迹生成和多维验证,合成高质量长链推理轨迹,训练出在20个基准上达到开源最优的32B模型。

详情
AI中文摘要

深度研究智能体旨在通过长周期规划、证据收集、推理和报告生成来解决复杂的知识密集型任务。尽管搜索智能体近期在信息检索和答案验证方面展现出强大能力,但现有训练数据集大多以搜索为中心,主要关注封闭式问答和信息定位。因此,它们主要训练信息寻求行为,而对关键深度研究能力(包括证据整合、知识综合、规划、文件理解和结构化报告生成)的覆盖有限。在这项工作中,我们提出了一种用于深度研究智能体的统一轨迹构建范式,该范式结合了封闭式问答和开放式探索。所提出的框架包括图基任务构建、智能体轨迹展开和多维轨迹验证,能够可扩展地合成涵盖长链复杂推理、深度研究指令遵循、报告撰写、文件理解与生成以及技能使用的高质量智能体轨迹。与现有的面向搜索的数据集相比,我们合成的轨迹更强调知识综合、复杂推理和规划。S1-DeepResearch-32B在跨越五个能力维度(包括复杂推理、指令遵循、报告生成、文件理解和技能使用)的20个基准测试中,达到了同等规模开源模型的最先进性能。在几个具有挑战性的深度研究基准上,它接近领先的专有前沿模型的性能。这些结果强调了联合建模信息获取、知识综合和面向规划的智能体行为对于构建有效深度研究智能体的重要性。

英文摘要

Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

2606.16307 2026-06-16 cs.AI cs.CL 交叉投稿

State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

面向工具增强型大语言模型的基于状态的多智能体合成数据生成

Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结 提出StateGen平台,通过四角色LLM循环和状态管理器生成多轮、工具接地的高质量训练对话,消除工具调用幻觉,支持层次化多智能体设置。

Comments 9 pages, 5 figures, 6 tables, 1 algorithm

详情
AI中文摘要

训练工具增强型LLM代理需要大量多轮、工具接地的对话数据,这些数据标注成本高、生产环境中受隐私限制,且公共数据集中基本缺失。我们提出StateGen,一个合成数据生成平台,通过编排四角色LLM循环(角色条件用户模拟器、被测代理、状态接地工具模拟器和多轴LLM评判器)生成带有评分和丰富推理轨迹的训练对话。关键架构贡献是一个权威状态管理器,它在多轮对话中维护一个结构化的世界状态对象,强制执行后端即事实的不变性,从而从结构上消除了最主要的工具调用幻觉类别。StateGen通过将子代理声明为工具(所有子代理共享一个状态对象)自然地扩展到层次化多智能体设置。我们在三个生产语料库上报告了64,698个评估对话的结果:工具调用幻觉得分达到9.66/10,系统通过23维特征向量支持角色驱动变化,并且干净分离的训练集和黄金评估集划分确认数据不是记忆诱饵(按标准差距分析)。与八个外部系统的比较表明,没有单一公开平台同时具备多轮生成、状态接地工具模拟、层次化多智能体支持和内置评判器评分功能。

英文摘要

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

2606.16774 2026-06-16 cs.AI cs.CL 交叉投稿

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

OpenClaw-Skill:面向智能体大语言模型的集体技能树搜索

Tianyi Lin, Chuanyu Sun, Jingyi Zhang, Changxu Wei, Huanjin Yao, Shunyu Liu, Xikun Zhang, Liu Liu, Jiaxing Huang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) Royal Melbourne Institute of Technology(皇家墨尔本理工大学) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 提出集体技能树搜索(CSTS)框架,通过集体智能生成和评估技能节点,构建结构化、多样且可泛化的技能树,并引入集体技能强化学习,提升大语言模型在工具使用、多步推理和动态环境交互中的智能体能力。

Comments 13 pages, 2 figures

详情
AI中文摘要

为大型语言模型(LLM)智能体配备有效技能对于解决OpenClaw等现实世界系统中的复杂任务至关重要。在这项工作中,我们旨在开发一个自动构建此类可重用技能的框架,以增强LLM在工具使用、多步推理和动态环境交互方面的能力。为此,我们提出了集体技能树搜索(CSTS),一种新颖的基于树搜索的技能构建框架,用于构建结构化、多样且可泛化的技能树。CSTS的核心思想是利用集体智能,通过两个迭代阶段共同搜索、识别和组合有效技能:集体技能节点生成(CSN-Gen)和集体技能节点评估(CSN-Assess)。CSN-Gen利用来自多个模型的集体知识,为每个子任务探索多样化的候选技能,实现全面的技能探索。CSN-Assess使用多个模型作为评判者,通过两种评分机制评估和选择技能节点:(1)集体质量评分,聚合独立评估以产生技能有效性的稳健估计;(2)集体可迁移性评分,明确验证技能是否在不同模型间良好泛化。通过CSTS,我们构建了一套全面的技能树以及技能增强的训练数据,使模型能够有效学习和利用技能。此外,我们引入了集体技能强化学习,主动从技能树中选择多个相关技能,以拓宽解空间探索,避免陷入单一技能及其导致的同质或次优解。最终,我们训练的模型OpenClaw-Skill在长期规划、工具使用和跨挑战性基准的泛化方面展现出卓越的智能体能力。

英文摘要

Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

2602.00887 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EffGen: Enabling Small Language Models as Capable Autonomous Agents

EffGen: 使小型语言模型成为能干的自主智能体

Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工大学计算机科学系) Georgia Institute of Technology, Atlanta, GA, USA(佐治亚理工学院) Google DeepMind, USA(谷歌DeepMind)

AI总结 EffGen是一个针对小型语言模型优化的开源智能体框架,通过提示压缩、任务分解、复杂度路由和统一记忆系统,实现高效、安全的本地部署,在13个基准测试中优于LangChain等框架。

Comments Accepted to ICML 2026 Conference

详情
AI中文摘要

目前大多数基于语言模型的智能体系统都是通过API调用为大型语言模型(如GPT、Claude、Gemini)构建和优化的;虽然强大,但这种方法面临高令牌成本和敏感应用中的隐私问题等限制。我们提出了EffGen,一个针对小型语言模型优化的开源智能体框架,能够实现有效、高效且安全的本地部署。EffGen有四大贡献:(1)增强的工具调用与提示优化,可将输入提示压缩高达70-80%(在我们的基准测试中平均压缩57%),同时保留任务语义;(2)智能任务分解,根据依赖关系将复杂查询分解为并行或顺序子任务;(3)基于复杂度的路由,利用五个因素做出智能的执行前决策;(4)统一记忆系统,结合短期、长期和基于向量的存储。此外,EffGen统一了多种智能体协议(MCP、A2A、ACP)以实现跨协议通信。在13个基准测试上的结果表明,EffGen在成功率、执行速度和内存占用方面优于LangChain、AutoGen和Smolagents。我们的结果揭示,提示优化和复杂度路由具有互补的缩放行为:优化对小型语言模型更有利(1.5B模型提升11.2%,而32B模型提升2.4%),而路由对大型模型更有利(1.5B模型提升3.6%,而32B模型提升7.9%),两者结合在所有规模上都能带来一致的增益。EffGen在Apache 2.0许可证下发布,确保研究和商业用途的广泛可访问性,代码可在https://github.com/effgen/effgen获取,Python包可通过pip install effgen安装,项目网站和文档位于https://effgen.ai和https://docs.effgen.ai。

英文摘要

Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls; while powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce EffGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment. EffGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses input prompts by up to 70-80% (and 57% on average across our benchmarks) while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, EffGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show EffGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. EffGen is released under the Apache 2.0 License, ensuring broad accessibility for research and commercial use, with the code available at https://github.com/ctrl-gaurav/effGen, the Python package at https://pypi.org/project/effgen/ (pip install effgen), and the project website and documentation at https://effgen.org/ and https://docs.effgen.org/.

2605.18401 2026-06-16 cs.CL cs.AI 版本更新

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote: 代理技能的生命周期治理从收集、推荐到进化

Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Yuyu Luo, Zhiyu Li

发表机构 * Harbin Institute of Technology(哈尔滨理工大学) Soochow University(苏州大学) The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 本文提出SkillsVote框架,通过生命周期治理管理代理技能,从收集和推荐到进化,提升模型在终端基准和SWE-Bench Pro上的性能。

Comments 71 pages, 12 figures, 13 tables

详情
AI中文摘要

长周期LLM代理留下的轨迹可能成为可重用的经验,但原始轨迹噪声大且难以管理。我们将代理技能视为一种经验模式,结合可执行脚本和不可执行的指导。然而,开放技能生态系统包含冗余、不均匀、环境敏感的产物,随意更新会污染未来上下文。我们提出了SkillsVote,一个用于代理技能生命周期治理的框架,从收集和推荐到进化。SkillsVote对百万级开源语料库进行环境需求、质量和可验证性分析,然后合成可验证技能的任务。在执行前,SkillsVote在结构化技能库中进行代理库搜索以暴露教学技能上下文。在执行后,它将轨迹分解为技能关联的子任务,将结果归因于技能使用、代理探索、环境和结果信号,并只接受成功的可重用发现以进行证据门控更新。在评估中,离线进化使GPT-5.2在Terminal-Bench 2.0上提升高达7.9个百分点,而在线进化使SWE-Bench Pro提升高达2.6个百分点。总体而言,受控的外部技能库可以在不更新模型的情况下提升冻结代理,当系统控制暴露、信用和保存时。

英文摘要

Long-horizon LLM agents generate traces that could become reusable experience, but raw trajectories are noisy, local, and hard to govern. Agent Skills offer a structured artifact for combining procedural guidance, executable resources, and applicability boundaries. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills across collection, recommendation, attribution, and evolution. SkillsVote profiles a million-scale open source corpus for environment requirements, quality, and verifiability, and synthesizes tasks for verifiable skills. Before execution, it performs agentic library search over structured skill folders to expose instructional context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill-guided execution, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. Experiments on Terminal-Bench 2.0 and SWE-Bench Pro show that SkillsVote improves agent performance on challenging agentic coding benchmarks. The gains arise from two complementary pathways: online evolution over task streams at test time and offline transfer via frozen libraries built from either historical trajectories or curated open source skills.

2605.21027 2026-06-16 cs.CL cs.AI 版本更新

Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

超越文本到SQL:一个面向受控企业分析API的代理LLM系统

Gundeep Singh, Parsa Kavehzadeh, Jing Xia, Xue-Yong Fu, Julien Bouvier Tremblay, Md Tahmid Rahman Laskar, Vincent Lum, Shashi Bhushan TN

发表机构 * Dialpad Inc.(Dialpad公司)

AI总结 本文提出Analytic Agent,一个基于LLM的代理系统,能够将自然语言意图安全地转换为与企业分析API的交互,解决传统文本到SQL系统在企业环境中面临的可靠性与合规性问题。

Comments Accepted to the Enterprise AI Agents Workshop @ KDD 2026. The first four authors contributed equally to this work

详情
AI中文摘要

企业分析旨在使组织数据对决策制定可及,但非技术用户在使用传统商业智能工具或文本到SQL系统时仍面临障碍。尽管基于大型语言模型(LLM)的最新文本到SQL方法承诺通过自然语言访问结构化数据,但在企业环境中,分析流水线依赖受控的API而非原始数据库。实际上,这些API封装了复杂的业务逻辑以确保一致性、可审计性和安全性。然而,将数学或聚合逻辑委托给LLM会引入可靠性和合规性风险。为此,我们提出了Analytic Agent,一个基于LLM的代理系统,将自然语言意图转换为与企业分析API的安全交互。在90个由领域专家构建的真实企业使用案例上进行评估,它能够可靠地解释用户目标,验证权限,执行受控查询,并通过多步骤推理和政策感知编排生成合规的可视化结果。

英文摘要

Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

2606.08867 2026-06-16 cs.CL 版本更新

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

构建面向1亿用户规模的客户支持AI代理:一种评估驱动的框架

Aman Gupta, Kevin Rossell, Edesio Alcobaça, Jose Chrystian Lima Pacheco, Carolina Baptista de Lima, Shao Tang, Luiz Paulo Rabachini, Luis Moneda, Herbert Fei, Daniel Silva, Rohan Ramanath

发表机构 * Nubank

AI总结 提出一个统一框架,通过评估驱动开发、上下文工程、人工循环提示迭代和LLM评判一致性优化,在Nubank的100M+用户规模下实现客户支持AI代理的离线开发与在线效果桥接,并在五个生产部署中验证了离线指标与在线结果的高度相关性。

Comments 12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情
AI中文摘要

LLM能力的快速提升使得AI代理在广泛任务中越来越可行。其中最有前景的应用之一是构建生产就绪的面向客户代理,这一挑战需要在评估方法论、上下文工程、训练和在线测量方面协调卓越。然而,这些关键支柱通常是孤立开发的,导致只有在部署后才会暴露的盲点。\n在本文中,我们提出了一个统一框架,将离线开发与在线影响桥接起来,应用于Nubank(一家拥有1亿+用户的公司)的客户支持AI代理。我们的方法整合了几个关键组件:(1) 针对客户支持代理定制的结构化上下文工程,(2) 系统化的人工在环提示迭代,(3) 具有测量评估者间一致性和GEPA优化一致性的严格LLM评判评估,以及(4) 从构思到生产的验证。\n一个核心见解是评估管道质量直接决定迭代速度。我们展示了跨越不同领域的五个生产部署的结果:卡片递送、债务管理、信用额度支持、卡片管理和产品解释。这些部署在显著加速迭代的同时,带来了持续的客户满意度提升。在我们的卡片递送部署中,大规模A/B测试显示,与之前的代理变体相比,AI交易净推荐值提高了37个百分点,自助服务率提高了29个百分点,同时离线模拟指标与在线结果之间存在强相关性,表明评估驱动开发可靠地预测了生产影响。在大多数用例中,AI满意度达到了与专家人类代理相差几个百分点的水平。

英文摘要

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

2606.11520 2026-06-16 cs.CL cs.AI cs.LG 版本更新

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) SenseTime Research(字节跳动研究院)

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

Comments 13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace

详情
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.

2602.05060 2026-06-16 cs.LG cs.CL 版本更新

StagePilot: Stage-Level Planning for Long-Horizon Dialogue Simulation in Cybergrooming

StagePilot: 网络诱骗中长程对话模拟的阶段级规划

Heajun An, Qi Zhang, Minqian Liu, Xinyi Zhang, Sang Won Lee, Lifu Huang, Pamela J. Wisniewski, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of California, Davis(加州大学戴维斯分校) International Computer Science Institute(国际计算机科学研究所)

AI总结 提出StagePilot框架,通过分离阶段级规划与响应生成,结合强化学习学习阶段策略,实现网络诱骗对话的结构化、连贯模拟,相比基线减少对话停滞,IQL+AWAC变体最终阶段到达率提升43%。

Comments Accepted at the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026)

详情
AI中文摘要

网络诱骗是对青少年的一种不断演变的威胁,需要主动的教育干预。我们通过将对话进展建模为阶段式交互上的结构化规划问题来解决这一问题。我们提出StagePilot,一个将阶段级规划与响应生成分离的对话框架,其中模型在受约束的转换下选择下一阶段,并基于该阶段生成响应,从而实现连贯且逼真的进展。使用强化学习从离线数据中学习阶段级策略,优化情感对齐和目标一致进展。我们的实证实验表明,与基线相比,StagePilot生成更结构化、更连贯的对话轨迹,并减少对话停滞;值得注意的是,IQL+AWAC变体更频繁地到达最终阶段,同时保持超过70%的正面或中性响应,实现了43%的相对改进。

英文摘要

Cybergrooming is an evolving threat to youth, requiring proactive educational interventions. We address this by modeling dialogue progression as a structured planning problem over stage-wise interactions. We propose StagePilot, a dialogue framework that separates stage-level planning from response generation, in which the model selects the next stage under constrained transitions and generates responses conditioned on it, enabling coherent and realistic progression. Reinforcement learning is used to learn stage-level policies from offline data, optimizing for both emotional alignment and goal-consistent progression. Our empirical experiments show that StagePilot generates more structured, coherent dialogue trajectories and reduces conversational stagnation compared to baselines; notably, the IQL+AWAC variant reaches the final stage more often while maintaining over 70% positive or neutral responses, yielding a 43% relative improvement.

2605.01101 2026-06-16 cs.AI cs.CL cs.SD eess.AS 版本更新

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

虚拟言语治疗师:一种临床医生参与的AI言语治疗代理,用于个性化和监督式治疗

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

发表机构 * The Kashmir Hub for Artficial Intelligence(喀布尔人工智能中心) Microsoft / Vocametrix(微软 / Vocametrix) IAI, TCG CREST(IAI,TCG CREST) Université de Lorraine, CNRS, Inria, LORIA(洛林大学,CNRS,Inria,LORIA) Laboratoire Praxiling, UMR5267, CNRS et Université Paul-Valéry Montpellier 3(Praxiling实验室,UMR5267,CNRS及蒙彼利埃Paul-Valéry大学) Speechcare iStutter, Portuguese Catholic University(Speechcare iStutter,葡萄牙天主教大学) CHI – Chair of Health Informatics, TUM University Hospital(健康信息学系,TUM大学医院) GLAM – Group on Language, Audio, & Music, Imperial College London(语言、音频与音乐小组,伦敦帝国理工学院)

AI总结 提出虚拟言语治疗师(VST)平台,集成深度学习口吃分类与多智能体大语言模型推理,自动生成个性化治疗方案,并通过临床医生反馈优化,实验证明其高质量推荐。

Comments Under Review

详情
AI中文摘要

本文开发了虚拟言语治疗师(VST),这是一个基于智能体的平台,通过自动化和自适应的AI驱动工作流程,简化口吃评估并提供定制化的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多智能体大语言模型(LLM)推理,以支持循证临床决策。VST首先获取并提取患者语音样本的特征,然后对口吃类型进行稳健分类。基于这些输出,VST启动一个智能体推理过程,其中专门的LLM智能体自主生成、批评并迭代优化个性化治疗计划。一个专门的批评智能体评估所有生成的治疗计划,以确保临床安全性、方法学合理性,并与同行评审的证据和既定专业指南保持一致。最终输出是一个全面的、针对患者的治疗草案,供临床医生审查。系统结合临床医生的反馈,生成最终的治疗计划,适用于患者交付,从而保持临床医生参与的范式。由专家言语治疗师进行的实验评估证实,VST持续生成高质量、基于证据的治疗建议。这些发现表明该系统具有增强临床工作流程、减轻临床医生负担并改善言语障碍患者治疗效果的潜力。所提出系统的交互式用户界面可在以下网址在线获取:this https URL,支持实时口吃评估和个性化治疗计划。

英文摘要

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.

2605.05855 2026-06-16 cs.IR cs.CL 版本更新

Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling

桥接被动与主动:通过主动表达建模增强对话启动推荐

Yiqing Wu, Haoming Li, Guanyu Jiang, Jiahao Liang, Yongchun Zhu, Jingwu Chen, Feng Zhang

发表机构 * Bytedance Beijing China(字节跳动北京中国)

AI总结 针对LLM驱动的对话搜索中被动推荐陷入回声室的问题,提出PA-Bridge框架,通过对抗分布对齐器桥接被动推荐与主动表达之间的分布差异,并引入语义离散化器实现流行度去偏,在线实验显著提升特征渗透率和用户活跃天数。

Comments Accepted by SIGIR 2026

详情
AI中文摘要

大型语言模型(LLM)驱动的对话搜索正在将信息检索从被动关键词匹配转变为主动、开放式的对话。在此背景下,对话启动器被广泛部署,以提供个性化查询推荐,帮助用户发起对话。传统上,推荐这些启动器依赖于一个封闭的“曝光-点击”循环。然而,这种反馈循环机制使系统陷入回声室,加上数据稀疏性,无法捕捉由开放世界塑造的对话搜索意图的动态特性。结果,系统偏向于流行但通用的建议。在这项工作中,我们揭示了一个未被利用的范式转变,以打破这种有害的反馈循环:通过用户的主动表达来利用用户的“自由意志”。与传统推荐不同,对话搜索使用户能够通过手动输入查询完全绕过菜单。主动查询中的开放世界意图是打破这一循环的关键。然而,整合它们并非易事:(1)主动查询与制定的启动器之间存在固有的分布偏移。(2)此外,开放文本的“非ID化”特性使得传统的基于项目的流行度统计在大规模工业流式训练中无效。为此,我们提出了被动-主动桥接(PA-Bridge),一种新颖的框架,采用对抗分布对齐器来桥接被动推荐的启动器与主动表达之间的分布差距。此外,我们引入了一个语义离散化器,以实现流行度去偏算法的部署。在我们平台上的在线A/B测试表明,PA-Bridge显著提升了特征渗透率0.54%和用户活跃天数0.04%。

英文摘要

Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions. In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days by 0.04%.

2605.29796 2026-06-16 cs.AI cs.CL cs.LG 版本更新

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

SAAS:面向智能体搜索中过度搜索缓解的自我感知强化学习

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出SAAS强化学习框架,通过搜索边界建模、边界感知奖励和分阶段优化策略,使LLM智能体具备动态自我感知能力,在不降低准确率的前提下显著减少过度搜索。

详情
AI中文摘要

智能体搜索使LLM能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效,但这些系统在实践中常受限于一个关键缺陷:智能体无法识别自身知识边界,在内部知识足够时盲目触发搜索,甚至在已收集足够证据时未能终止搜索。缺乏自我感知导致严重的 extbf{过度搜索},带来大量推理延迟和过高的计算成本。为此,我们提出SAAS,一种新颖的强化学习框架,旨在培养动态自我感知能力,精确调节搜索行为而不损害准确性。SAAS引入三个关键组件:(i) 搜索边界建模机制,通过对比禁用搜索和启用搜索的轨迹,识别策略演化下的搜索边界;(ii) 边界感知奖励模块,将这种边界意识转化为轨迹级惩罚,抑制不必要和冗余的搜索;(iii) 分阶段优化策略,利用顺序课程优先考虑推理而非搜索正则化,从而避免奖励黑客。大量实验表明,SAAS在保持准确性的同时大幅减少了过度搜索。我们的代码和实现细节已在https://github.com/XMUDeepLIT/SAAS发布。

英文摘要

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

5. 文本生成、摘要与编辑 12 篇

2606.15069 2026-06-16 cs.CL 新提交

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

CoCoGEC:面向鲁棒语法纠错的反事实生成

Qianyu Wang, Xiaoman Wang, Yuanyuan Liang, Xinyuan Li, Yunshi Lan

发表机构 * East China Normal University(华东师范大学)

AI总结 提出CoCoGEC框架,通过生成词级和句级反事实样本并筛选高互信息实例,提升语法纠错模型在上下文扰动下的稳定性,在三个扰动数据集上取得显著F0.5提升。

详情
AI中文摘要

语法纠错(GEC)系统通常在GEC基准上进行训练和评估,但一旦周围上下文发生轻微扰动或扩展,其性能往往会急剧下降。这表明现有的GEC模型通常无法理解变化上下文中的错误模式。在本文中,我们深入研究了GEC任务的反事实,其中上下文的细微变化可能导致标签翻转问题。我们提出了CoCoGEC,一个反事实生成框架,该框架创建训练实例的副本,并改变与错误无关的上下文。我们的框架通过以下方式系统地生成反事实:(1)通过改变词级和句级上下文,生成保持原始实例错误模式及语法的句内和句间反事实;(2)通过选择具有翻转标签和高GEC互信息(MI)系数的实例来修正生成的反事实。大量实验表明,我们的方法显著提高了GEC模型的稳定性,优于一组数据增强基线。特别是,在扰动的BEA-19*、CoNLL-14*和TEM-8*数据集上,它分别实现了+9.9、+11.3和+20.8个点的绝对F0.5增益。我们的代码已发布在https://github.com/Quinnok/CoCoGEC。

英文摘要

Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC

2606.15416 2026-06-16 cs.CL 新提交

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

编码错误:多语言语法错误纠正中上下文示例的表征检索

Guangyue Peng, Wei Li, Wen Luo, Houfeng Wang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出从LLM内部状态提取语法错误表征(GER)用于检索上下文示例,显著提升多语言语法错误纠正的少样本性能,在低资源语言上F0.5提升达1.20倍。

Comments 15 pages, 6 figures

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2025, pages 21166-21180, Vienna, Austria. Association for Computational Linguistics, 2025
AI中文摘要

语法错误纠正(GEC)涉及检测和纠正语法的错误使用。虽然具有上下文学习(ICL)能力的大型语言模型(LLM)在各种自然语言处理(NLP)任务上取得了显著进展,但它们在GEC上的少样本性能仍然次优。这主要是由于难以检索到能够捕捉错误模式而非语义相似性的合适上下文示例。在本文中,我们证明LLM可以通过其内部状态固有地捕捉与语法错误相关的信息。从这些状态中,我们提取了语法错误表征(GER),这是一种信息丰富且语义中立的语法错误编码。我们基于GER的新型检索方法显著提升了多语言GEC数据集上ICL设置的性能,提高了纠正的精确度。对于高资源语言,我们在8B大小的开源模型上的结果与Deepseek2.5和GPT-4o-mini等闭源模型相当。对于低资源语言,我们的F0.5分数比基线高出最多1.20倍。该方法为多语言GEC提供了一种更精确且资源高效的解决方案,为可解释的GEC研究提供了一个有前景的方向。

英文摘要

Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our $F_{0.5}$ scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.

2606.15741 2026-06-16 cs.CL cs.AI 新提交

A Self Consistency Based Reranking for Narrative Question Answering

基于自一致性的叙事问答重排序

Molham Mohamed, Ali Hamdi

发表机构 * GitHub

AI总结 提出自一致性重排序框架,通过生成多个候选答案并基于语义一致性选择最终答案,提升叙事问答的鲁棒性和准确性。

详情
AI中文摘要

叙事问答(NQA)是自然语言处理中一项具有挑战性的任务,要求模型理解长文本上下文、捕捉事件间关系并生成连贯的响应。尽管预训练语言模型近期取得了进展,但大多数现有方法在推理时依赖单一解码输出,使其对生成变异性敏感,常导致答案不完整或不一致。为解决这一局限,我们提出了一种基于自一致性的自集成重排序框架用于叙事问答。该方法为每个故事-问题对生成多个候选答案,并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述,同时通过基于共识的选择提高鲁棒性,而无需修改底层架构。该框架将预训练和微调的语言生成与多答案推理及基于相似度的重排序相结合。我们在NarrativeQA数据集上使用多种模型(包括FLAN-T5 Base和Small以及Pegasus-Large)在基线和微调设置下评估了所提方法。实验结果表明,该方法在所有模型上均持续提升了性能。特别是,FLAN-T5-Base在结合自集成推理后,性能从82.32%提升至86.66%(+4.34%),取得了最佳整体性能。此外,Pegasus-Large的提升最大,从72.50%提升至87.07%(+14.57%),凸显了所提策略的有效性。

英文摘要

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

2606.15783 2026-06-16 cs.CL 新提交

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

ttda704 在 SemEval-2026 任务 4:通过假名化和多视角句子对齐建模叙事结构

Tai Tran Tan, An Dinh Thien

发表机构 * University of Information Technology, Ho Chi Minh City, Vietnam(胡志明市信息技术大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

AI总结 提出基于对比学习和微调句子变换器的叙事相似度方法,包括单视角(智能层冻结)和多视角(主题/情节/结局投影头+自监督对齐)两条流水线,在合成数据上训练。

详情
AI中文摘要

我们介绍了对 SemEval 2026 任务 4:叙事故事相似性与叙事表示学习的方法。我们的解决方案使用对比学习与微调句子变换器来捕捉跨抽象主题、行动过程和结果的叙事相似性。我们开发了两条流水线:(Track A)单视角方法,通过智能层冻结编码完整叙事以减少过拟合;(Track B)多视角方法,使用视角特定的投影头和自监督对齐对主题、情节和结局进行建模。两条流水线均基于句子变换器模型,并在合成数据上使用对比损失进行训练。代码可在以下 GitHub 仓库获取:https://github.com/dinhthienan33/SemEval2026-Task4-ttda704。

英文摘要

We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: https://github.com/dinhthienan33/SemEval2026-Task4-ttda704.

2606.16281 2026-06-16 cs.CL cs.AI 新提交

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

现在谁应该主导解码?跟踪可靠轨迹以集成掩码扩散语言模型

Heecheol Yun, Joonhyung Park, Joowon Kim, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS

AI总结 针对掩码扩散语言模型集成问题,提出TIE框架,通过跟踪答案相关位置的置信度动态,迭代识别并传递可靠解码轨迹,实现多模型协同生成。

Comments preprint

详情
AI中文摘要

掩码扩散语言模型(MDLM)已成为序列生成的一种独特范式。随着MDLM在能力和知识覆盖范围上变得多样化,一个重要问题是如何结合它们的知识。为此,我们首先研究了MDLM独特的解码动态。我们发现,成功的生成在答案相关位置上表现出稳定的置信度动态,而不可靠的轨迹通常可以通过注入来自其他模型的有希望的中间状态来纠正。受此观察启发,我们提出了$\textbf{TIE}$(基于轨迹的迭代集成),这是一个知识融合框架,其中MDLM迭代地识别可靠的解码轨迹并在模型之间传递它们。TIE跟踪答案相关位置上的置信度动态,以确定哪个模型当前遵循更可靠的轨迹,并选择性地跨模型传递部分去噪的序列。由于处于更有希望轨迹上的模型在去噪步骤中经常变化,TIE允许不同模型在生成的不同阶段贡献互补的优势。在多种推理任务上的强劲表现以及我们的分析表明,TIE为MDLM集成这一尚未充分探索的问题提供了一种实用方法。

英文摘要

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

2606.16322 2026-06-16 cs.CL 新提交

PaperJury: Due-Process Review for Bounded LaTeX Revision

PaperJury: 有界LaTeX修订的正当程序审查

Yiran Wang, Ruixuan An, Biao Wu, Wenhao Wang

AI总结 提出PaperJury系统,通过确定性编排与语义代理分离,实现LaTeX论文的对抗性审查、裁决和修订,确保有界安全编辑和终止结果。

Comments 10 pages, 5 figures

详情
AI中文摘要

对人工撰写的LaTeX计算机科学论文进行提交前加固不同于起草辅助,因为它需要对抗性全文审查、明确的不修复结果以及有界工件安全修订。现有的写作助手、批评生成器和以评审者为中心的循环缺乏跨轮次的持久问题标识、从批评到裁决的确定性路由,以及能够拒绝无效关注或推迟依赖作者关注的手稿控制。我们提出PaperJury,一个闭环的审查-裁决-修订-验证系统,建立在确定性-语义分离之上:确定性编排管理分解、冻结的声明主干、持久账本、路由、停止和精确一次补丁应用,而语义代理仅限于有界审查、判断和修复。PaperJury结合了有界整体审查、基于可争议性的路由、正当程序审判以及用于锚定有界编辑的风险比例防护链,产生无效丢弃、有效可修复和作者依赖的终止结果。在针对Vision、自然语言处理和机器学习论文的保留集上进行的两组专家评审评估中,我们评估了问题质量、裁决和路由质量、编辑安全性、收敛行为和成本,支持了承载安全性和完成逻辑应位于确定性编排而非模型自由裁量权的论点。PaperJury可在https://github.com/u7079256/paperjury获取。

英文摘要

Pre-submission hardening of human-authored LaTeX computer science papers differs from drafting assistance because it requires adversarial whole-paper review, explicit no-fix outcomes, and bounded artifact-safe revision. Existing writing assistants, critique generators, and judge-centered loops lack durable issue identity across rounds, deterministic routing from critique to adjudication, and manuscript control that can reject invalid concerns or defer author-dependent ones. We present PaperJury, a closed-loop review-verdict-revise-verify system built on a deterministic-versus-semantic split: deterministic orchestration manages decomposition, a frozen claim spine, a durable ledger, routing, stopping, and exact-once patch application, while semantic agents are limited to bounded review, judgment, and repair. PaperJury combines bounded holistic review, contestability-based routing, a due-process trial, and risk-proportional guard chains for anchor-bounded edits, yielding terminal outcomes of invalid-drop, valid-fixable, and author-required. In a two-arm expert-review evaluation on held-out Vision, natural language processing, and machine learning papers against four baselines, we assess issue quality, verdict and routing quality, edit safety, convergence behavior, and cost, supporting the thesis that load-bearing safety and completion logic should reside in deterministic orchestration rather than model discretion. PaperJury is available at https://github.com/u7079256/paperjury.

2606.16700 2026-06-16 cs.CL 新提交

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

多轮反射掩码激发掩码扩散模型中的推理能力

Yanming Zhang, Yihan Bian, Jingyuan Qi, Yuguang Yao, Lifu Huang, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) Virginia Tech(弗吉尼亚理工大学) Intuit UC Davis(加州大学戴维斯分校) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出反射掩码(RM)方法,通过轻量级后训练使掩码扩散模型具备多轮掩码与去噪能力,实现迭代局部修正,无需架构改变,在文本生成、数独和图像编辑等任务中优于基线。

Comments 22 pages, 6 figures, 5 tables

详情
AI中文摘要

尽管自回归(AR)模型通常通过思维链推理和反思来执行推理,但它们对先前输出的改进仍然依赖于完全顺序生成,即使只需要局部编辑。相比之下,掩码扩散模型(MDMs)中的掩码机制自然支持对先前输出进行显式局部编辑,允许选择性细化而无需丢弃先前答案并从头生成另一个。虽然这一特性更接近人类通过迭代局部修正来纠正错误的方式,但现有的MDMs不支持多轮掩码和去噪。我们提出反射掩码(RM),通过轻量级后训练激发MDMs中这种内在的推理能力。RM提供了一种原生的测试时扩展,其中MDM基于不断演化的上下文迭代地重新审视和修正其先前的输出。为了利用来自先前轮次的见解(如AR推理),我们进一步引入了历史参考,这是一种参数无关的机制,在修正过程中利用中间去噪状态。我们的方法不需要架构改变,并且易于应用于现有的MDMs。在包括文本生成、数独和图像编辑在内的多种任务和模态中,反射掩码始终优于基于标准掩码的基线,并展现出强大的通用性,将RM定位为MDMs上推理的基本原语。

英文摘要

While reasoning on autoregressive (AR) models is often performed by chain-of-thought reasoning and reflection, their refinement of previous outputs still relies on fully sequential generation, even when only local edits are needed. In contrast, the masking mechanism in Mask Diffusion Models (MDMs) naturally supports explicit local edits on previous outputs, allowing selective refinement without discarding previous answers and generating another from scratch. While this property more closely aligns with how humans correct mistakes by iterative local refinement, existing MDMs do not support multi-turn masking and denoising. We propose Reflective Masking (RM), which elicits such an intrinsic reasoning capability in MDMs via lightweight post-training. RM provides a native test-time scaling, where an MDM iteratively revisits and revises its prior outputs based on evolving context. To exploit insights from previous turns like AR reasoning, we further introduce History Reference, a parameter-free mechanism that leverages intermediate denoising states during revision. Our approach requires no architectural changes and is easily applicable to existing MDMs. Across diverse tasks and modalities, including text generation, Sudoku, and image editing, Reflective Masking consistently outperforms standard masking-based baselines and demonstrates strong generality, positioning RM as a fundamental primitive for reasoning on MDMs.

2606.16845 2026-06-16 cs.CL cs.AI 新提交

Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

鲁棒双信号融合:混合神经符号门控与压缩链式思维精炼用于社交媒体文本讽刺检测

Ankit Bhattacharjee, Krityapriya Bhaumik

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院克勒格布尔分校)

AI总结 提出RDS融合框架,结合神经符号架构与压缩链式思维推理,在TweetEval和iSarcasm数据集上达到与微调BERTweet相当的性能,并显著优于监督方法。

Comments 11 pages total, 10 figures

详情
AI中文摘要

大型语言模型(LLM)默认倾向于字面语义解释,使得零样本讽刺检测成为一个持续的挑战。我们引入了鲁棒双信号(RDS)融合框架,这是一种混合神经符号架构,无需监督微调(SFT)即可压缩链式思维(CoT)推理轨迹。在严格保留的TweetEval测试集(N=734)上,RDS达到了78.1%的准确率和0.777的宏F1分数,与微调BERTweet的绝对性能上限相匹配。在高度不平衡的iSarcasm数据集上,冻结的CoT管道过滤了22.5%的分布外幻觉,实现了0.6726的零样本宏F1和0.4821的讽刺F1,优于多个强监督的SemEval Transformer集成。统计消融实验证实了这种结构协同作用:将符号先验添加到神经基线没有显著提升(p=0.242),而将CoT管道添加到该先验的边际收益被高度压缩(p=0.149)。只有所有三个信号的完整并发融合才能实现相对于基线的统计验证改进(p=0.005)。

英文摘要

Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

2606.15591 2026-06-16 cs.AI cs.CL cs.MA 交叉投稿

Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

智能检索与强化学习方程链:面向复杂新颖物理文字题的可控生成框架

Tirthankar Mittra

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ARVRE两阶段框架,通过离线时序差分学习构建有效物理方程链,结合智能检索增强生成控制问题结构与难度,再由大语言模型生成自然语言问题,实现复杂、新颖且可解的物理文字题生成。

详情
AI中文摘要

生成高质量、新颖、复杂且可解的物理文字题(PWPs)在教育内容生成中仍是一个具有挑战性且未被充分探索的问题。现有方法多改编自数学文字题(MWP)生成,常产生模糊、不可解或结构简单且语言多样性有限的问题。我们提出ARVRE(智能检索值强化方程链),一个用于生成多样且数学有效的PWPs的两阶段框架。在第一阶段,使用一种离线时序差分学习形式构建有效的物理方程链,同时一个智能检索增强生成(RAG)框架动态选择主题特定的概念和词汇。这种设计能够显式控制问题结构和难度。在第二阶段,大语言模型(LLM)将方程链和检索到的概念转换为自然语言的物理问题。通过将生成过程基于有效方程链,我们的方法在保持数学正确性的同时,促进了语言多样性和上下文丰富性。人工和自动评估表明,ARVRE生成的PWPs比现有方法更复杂、新颖且可解。这些结果凸显了结合强化学习、检索和LLM用于可靠生成教育物理内容的潜力。

英文摘要

Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

2403.13089 2026-06-16 cs.CL 版本更新

Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning

通过提示调优使用大型语言模型自动总结医患对话

Mengxian Lyu, Cheng Peng, Xiaohan Li, Patrick Balian, Jiang Bian, Yonghui Wu

发表机构 * University of Florida(佛罗里达大学) University of Florida Health Cancer Center(佛罗里达大学健康癌症中心)

AI总结 本研究提出通过提示调优生成式大型语言模型来自动总结医患对话,实验表明GatorTronGPT-20B模型在所有评估指标上表现最佳,且计算成本低。

详情
AI中文摘要

自动文本摘要是辅助临床医生提供连续协调护理的新兴技术。本研究提出了一种使用生成式大型语言模型总结医患对话的方法。我们开发了提示调优算法来指导生成式LLM总结临床文本。我们研究了提示调优策略、软提示的大小以及GatorTronGPT(一个使用2770亿临床和通用英语词汇、参数高达200亿的生成式临床LLM)的少样本学习能力。我们将GatorTronGPT与基于广泛使用的T5模型微调的先前解决方案进行了比较,使用了临床基准数据集MTS-DIALOG。实验结果表明,GatorTronGPT-20B模型在所有评估指标上均取得了最佳性能。所提出的解决方案计算成本低,因为在提示调优期间LLM参数不更新。本研究证明了通过提示调优使用生成式临床LLM进行临床自动文本摘要的效率。

英文摘要

Automatic text summarization (ATS) is an emerging technology to assist clinicians in providing continuous and coordinated care. This study presents an approach to summarize doctor-patient dialogues using generative large language models (LLMs). We developed prompt-tuning algorithms to instruct generative LLMs to summarize clinical text. We examined the prompt-tuning strategies, the size of soft prompts, and the few-short learning ability of GatorTronGPT, a generative clinical LLM developed using 277 billion clinical and general English words with up to 20 billion parameters. We compared GatorTronGPT with a previous solution based on fine-tuning of a widely used T5 model, using a clinical benchmark dataset MTS-DIALOG. The experimental results show that the GatorTronGPT- 20B model achieved the best performance on all evaluation metrics. The proposed solution has a low computing cost as the LLM parameters are not updated during prompt-tuning. This study demonstrates the efficiency of generative clinical LLMs for clinical ATS through prompt tuning.

2512.03503 2026-06-16 cs.CL 版本更新

Understanding LLM Reasoning for Abstractive Summarization

理解大语言模型在抽象摘要中的推理能力

Haohan Yuan, Haopeng Zhang

发表机构 * ALOHA Lab, University of Hawaii at Manoa(夏威夷大学马诺亚分校ALOHA实验室)

AI总结 本研究通过大规模比较8种推理策略和3种大型推理模型在8个数据集上的表现,发现推理并非万能,其效果依赖于策略和摘要设置,且存在质量与事实准确性之间的权衡。

Comments 27 pages,15 figures

详情
AI中文摘要

推理在数学和代码生成等分析任务上显著提升了大型语言模型(LLMs),但其对抽象摘要的价值仍不明确。为填补这一空白,我们将通用推理策略适配到摘要场景,并在8个多样化数据集上对8种推理策略和3种大型推理模型(LRMs)进行了大规模比较研究,评估了摘要质量和事实忠实度。结果表明,推理并非通用解决方案,其有效性强烈依赖于策略和摘要设置。特别是,我们发现摘要质量与事实忠实度之间存在权衡。显式推理策略通常能提升基于参考的质量,但可能削弱事实基础,而LRMs中的隐式推理则表现出相反趋势。我们进一步发现,增加LRM的内部推理预算并不能可靠地改善摘要,甚至可能降低事实一致性。这些发现表明,对于摘要而言,更多的推理并不总是更好。有效的推理应保留忠实的压缩,而非引发过度阐述。我们的源代码已公开。

英文摘要

Reasoning has substantially improved Large Language Models (LLMs) on analytical tasks such as mathematics and code generation, but its value for abstractive summarization remains unclear. To address this gap, we adapt general reasoning strategies to the summarization setting and conduct a large-scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, evaluating both summary quality and factual faithfulness. Our results show that reasoning is not a universal solution and its effectiveness depends strongly on the strategy and the summarization setting. In particular, we find a trade-off between summary quality and factual faithfulness. Explicit reasoning strategies often improve reference-based quality, but may weaken factual grounding, whereas implicit reasoning in LRMs shows the opposite tendency. We further find that increasing an LRM's internal reasoning budget does not reliably improve summarization and can even reduce factual consistency. These findings suggest that, for summarization, more reasoning is not always better. Effective reasoning should preserve faithful compression rather than induce over-elaboration. Our source code is publicly available.

2606.05742 2026-06-16 cs.CL 版本更新

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

AdaPLD: 自适应检索与重用实现高效无模型推测解码

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Department of Mathematical Sciences, Tsinghua University(清华大学数学科学部) JDT AI Infra(京东AI基础设施)

AI总结 针对现有基于重用的推测解码方法在词汇匹配失败时召回率低和确定性复制脆弱的问题,提出无需训练的自适应方法AdaPLD,通过语义相似性恢复重用机会并构建分支假设,实现最高3.10倍解码加速。

详情
AI中文摘要

推测解码通过在单次目标模型前向传播中验证多个草拟令牌来加速生成,减少了顺序解码迭代。无模型变体通过重用生成过程中已有的文本和模型状态来避免辅助草稿模型,但其加速效果取决于构建的草稿的可靠性。我们指出现有基于重用的方法存在两个局限性:基于词汇锚定的检索在表面形式变化下召回率有限,以及当检索上下文不能唯一确定续写时,确定性跨度复制可能脆弱。我们提出\emph{AdaPLD},一种无需训练的方法,自适应地改进检索和草稿构建。AdaPLD保留高精度的词汇重用,同时利用语义相似性在词汇匹配失败时恢复额外的重用机会。它进一步构建分支重用假设以考虑续写的不确定性,而不是依赖单个复制的跨度。在多个基准测试中,AdaPLD减少了目标模型前向传播次数,并实现了高达$3.10 imes$的解码加速。

英文摘要

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

6. 语义、语法与语言学分析 9 篇

2606.15510 2026-06-16 cs.CL cs.DL 新提交

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC:一个开放的历时希腊语树库及其印欧语平行语料

Nikolaos Lavidas, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, Evangelos Argyropoulos

发表机构 * National and Kapodistrian University of Athens(国家与kapodistrian大学)

AI总结 提出首个跨越八个历时时期的开放许可依存句法树库,采用统一PROIEL XML 2.0模式,并与拉丁语、哥特语等印欧语进行跨对齐。

Comments 16 pages. Data paper for the v0.4 release of AthDGC. Concept DOI: 10.5281/zenodo.20439182. Companion site: https://athdgc.github.io

详情
AI中文摘要

AthDGC(“Athens-PROIEL”)是一个开放的端到端工作流和数据集。据我们所知,它是第一个公开许可的依存句法分析树库,涵盖希腊语的八个历时时期,即古风时期、古典时期、通用希腊语时期、晚期古代、拜占庭时期、晚期拜占庭时期、早期现代和现代希腊语,采用单一的PROIEL XML 2.0模式,并将《新约》按诗句级别与拉丁语(武加大译本)、哥特语(乌尔菲拉译本)、古教会斯拉夫语(Marianus译本)和古典亚美尼亚语进行交叉对齐。AthDGC建立在PROIEL树库家族(Haug and Johndal 2008; Eckhoff et al. 2018)之上,该家族为项目建立了模式和通用希腊语参考集。标注使用Stanford Stanza PROIEL训练的工作流;句子级对齐使用LaBSE,一种多语言句子嵌入模型;词级对齐通过AwesomeAlign程序使用多语言BERT注意力。v0.4版本提供精选样本和开源工具包;完整注释语料库分区仍在希腊国家HPC上进行v0.5审计。定量规模、每个见证的诗句计数和每个时期的注释行计数将在审计通过后的v0.5发布说明中报告。概念DOI:10.5281/zenodo.20439182。

英文摘要

AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.

2606.16047 2026-06-16 cs.CL 新提交

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

从论证组件到图:一种具有置信门控的多智能体辩论方法用于论证关系识别

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(华沙理工大学电子与信息技术学院)

AI总结 提出一种多智能体辩论框架,通过置信门控机制仅在不确定时进行辩论,在UKP语料上达到训练无关方法最高Macro F1,并生成可读辩论记录。

Comments Accepted for publication in the proceedings of KES 2026

详情
AI中文摘要

大型语言模型(LLMs)凭借其强大的通用推理能力,在论证挖掘(AM)领域受到越来越多的评估和应用。然而,标准的无训练模型常常遗漏复杂细节,特别是在需要将文本的两个部分一起分析的上下文中。此外,自我纠正机制往往会强化推理中的初始幻觉。克服这些限制通常需要昂贵的、领域特定的监督微调。最近的研究表明,多智能体范式可以通过支持者-反对者-裁判架构的辩证改进来解决组件分类任务中的此类弱点,为该领域的无训练方法指明了有希望的方向。在本文中,我们将该框架扩展并评估于论证关系识别与分类(ARIC)任务,将其重新表述为组件对之间的辩论。此外,我们引入了一种置信门控机制,使得仅在不确定的情况下进行辩论,而在置信度高时接受初始预测。在UKP Argument Annotated Essays v2语料库上,我们证明了选择性辩论在所有无训练方法中取得了最高的Macro F1,而对所有样本进行辩论则使性能低于其中一个基线。所有生成方法在Macro F1上也优于微调的RoBERTa模型,这表明Attack类的代表性不足对监督微调的损害大于对仅推理模型的影响。此外,我们的框架生成人类可读的辩论记录,提供了单智能体和监督分类器所缺乏的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

2606.16407 2026-06-16 cs.CL cs.LG 新提交

A Mechanistic Understanding of Pronoun Fidelity in LLMs

对大型语言模型中代词忠实性的机制理解

Katharina Trinley, Jesujoba O. Alabi, Dietrich Klakow, Vagrant Gautam

发表机构 * Saarland University(萨尔大学) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 通过因果分析发现,代词忠实性由组实体绑定、近因偏差和刻板印象偏差三种因果子空间共同作用,解释了91-99.5%的行为。

详情
AI中文摘要

忠实且稳健的代词使用对于公平和连贯的生成至关重要,然而当多个指代对象使用不同代词时,大型语言模型大多会失败。为了研究推理、重复和偏差在此任务中的相互作用,先前的工作完全依赖行为方法,这可能无法反映模型的内部运作。因此,我们提供了关于代词忠实性的机制性、模型内部视角,测试了三种机制——组实体绑定(G)、近因偏差(R)和刻板印象偏差(S)——是否在多个SOTA语言模型中因果实现。使用无界分布式对齐搜索,我们发现三者作为因果子空间共存,分布在网络深度上。没有单一机制能完全解释模型行为,但三者的组合一致地解释了91-99.5%。注意力头分析进一步揭示了两种竞争的复制路径;组绑定和刻板印象共享一个局部化的概念级路径,检索绑定的职业-代词单元,而近因使用分布式的令牌级路径,重复表面形式。总之,代词忠实性源于同时活跃的因果子空间之间的竞争。

英文摘要

Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model's internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms -- group entity binding (G), recency bias (R), and stereotypical bias (S) -- are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

2606.16836 2026-06-16 cs.CL 新提交

Does Traversal Order Matter? A Systematic Study of Tree Traversal Methods in Transformer Grammars

遍历顺序重要吗?Transformer语法中树遍历方法的系统研究

Zongru Liu, Pengyu Ji, Pengcheng Wang, Kewei Tu

发表机构 * School of Information Science and Technology, ShanghaiTech University(上海科技大学信息科学与技术学院)

AI总结 本文系统研究了Transformer语法中不同树遍历方法(深度优先、广度优先及新提出的产生式规则遍历)对语言建模、句法泛化和摘要生成的影响,揭示了嵌套组合与全局前瞻之间的权衡。

详情
AI中文摘要

Transformer语法(TGs)通过融入句法树结构增强了语言建模。尽管句法树在TGs中的线性化方式可能对模型性能产生显著影响,但现有研究仅依赖深度优先遍历(DFT)进行线性化。在本文中,我们通过探索广度优先遍历(BFT)和一种新颖的混合遍历策略——产生式规则遍历(PRT)来扩展遍历设计空间,该策略结合了BFT的结构前瞻性和DFT的早期词汇生成。我们将这些遍历方法与不同的树配置和掩码策略相结合,并在语言建模、句法泛化和摘要生成上实证评估其性能。我们揭示了嵌套组合与全局前瞻之间的固有权衡,为设计任务感知的Transformer语法提供了可操作的建议。

英文摘要

Transformer Grammars (TGs) enhance language modeling by incorporating syntactic tree structures. Despite the potentially significant impact on model performance of how syntactic trees are linearized in TGs, existing studies rely solely on Depth-First Traversal (DFT) for linearization. In this paper, we expand the traversal design space by exploring Breadth-First Traversal (BFT) and a novel hybrid traversal strategy, Production-Rule Traversal (PRT), which combines the structural lookahead of BFT with the early lexical generation of DFT. We integrate these traversal methods with varying tree configurations and masking strategies, and empirically evaluate their performance on language modeling, syntactic generalization and summarization. We reveal the inherent trade-offs between nested composition and global lookahead, providing actionable recommendations for designing task-aware Transformer Grammars.

2606.16867 2026-06-16 cs.CL 新提交

Revisiting the Systematicity in Negation in the Era of In-Context Learning

重新审视上下文学习时代否定中的系统性

Hitomi Yanaka, Taisei Yamamoto

发表机构 * The University of Tokyo(东京大学) Riken(理化学研究所) Tohoku University(东北大学)

AI总结 通过行为与表征系统性分析,发现大型语言模型在上下文学习中能部分识别否定表达和范围,但无法完美执行,且功能向量在否定线索提取任务中可组合,但范围识别更具挑战。

Comments Accepted to the 6th Workshop Natural Language Meets Logic and Machine Learning (NALOMA2026) at ESSLLI2026

详情
AI中文摘要

理解否定句的含义仍然是语言模型面临的挑战之一,即使在大语言模型(LLMs)时代也是如此。我们从两个角度分析LLM对否定理解的系统性:行为系统性和表征系统性。对于行为系统性,我们确认通过示例和上下文学习,LLMs可以在一定程度上识别句子中的否定表达和范围,但无法达到完美性能。特别是,模型识别否定范围的难度因输出格式而异。对于表征系统性,我们分析对于理解否定至关重要的任务,功能向量可以从上下文示例中稳健构建的程度。实验表明,虽然功能向量可以针对否定线索提取任务进行组合,但提取用于识别范围的功能向量更具挑战性。

英文摘要

Understanding the meaning of negated sentences remains one of the challenges for language models, even in the era of large language models (LLMs). We analyze systematicity regarding LLM understanding of negation from two perspectives: behavioral systematicity and representational systematicity. For behavioral systematicity, we confirm that through demonstrations and in-context learning, LLMs can recognize negation expressions and scope within sentences to some extent, but they fail to achieve perfect performance. In particular, the difficulty of the negation scope recognition for models varies depending on the output format. For representational systematicity, we analyze the extent to which function vectors can be robustly constructed from in-context examples for tasks that are essential to understanding negation. The experiments suggest that while function vectors can be composed for negation cue extraction tasks, extracting function vectors for recognizing scope is more challenging.

2606.16084 2026-06-16 cs.AI cs.CL 交叉投稿

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

深海的韵律:抹香鲸叫声中双重模式的计算语言学检验

Mudit Sinha, Sanika Chavan

发表机构 * Independent Researchers(独立研究员)

AI总结 使用1483个抹香鲸叫声,通过计算语言学方法检验其是否具有双重模式结构,发现下层由点击节奏构成,上层显示序列依赖,下层为节奏型而非分段型。

Comments 22 pages, 2 figures, 4 tables. Preprint

详情
AI中文摘要

人类语言常被描述为在两个层次上结合结构:低层单元组合成更大的单元,然后这些单元再组合成更大的序列。我们使用多米尼加抹香鲸项目的1483个叫声,测试抹香鲸叫声中是否具有这种设计特征——双重模式。由于声学相似性可以模仿符号结构,我们将问题视为从连续音频中进行计算语言学结构发现,而不是直接关于语言或意义的断言。我们使用冻结音频编码器的共识、保留的结构测试、每统计量零假设和声学零假设可恢复性门控。证据支持一个狭窄的两层架构。在低层,点击组合成叫声不是通过稳定的有序规则,而是通过哪些点击存在以及它们之间的点击间节奏。在高层,叫声令牌显示回合级序列依赖,NSB二阶转移熵提升0.132比特(p = 0.002)。在节奏缩放下,编码器派生的点击身份强烈受速率限制,而叫声身份保持更稳定,在点击到叫声步骤中产生可测量的抽象梯度。仅节奏基线恢复了大量低层结构,但未能重现上层序列依赖信号。我们不声称语言、语义、感知或类似人类的音素。相反,我们报告了表示级别的证据,表明存在一种类似双重模式的架构,其低层是节奏型而非分段型,并提供了一个可移植的零假设控制框架,用于测试诱导声学令牌系统中的组合结构。

英文摘要

Human language has often been described as combining structure at two levels: lower-level units combine into larger units, which then combine into larger sequences. We test for this design feature, duality of patterning, in sperm whale codas using 1,483 codas from the Dominica Sperm Whale Project. Because acoustic similarity can imitate symbolic structure, we treat the problem as computational-linguistic structure discovery from continuous audio rather than as a direct claim about language or meaning. We use a consensus of frozen audio encoders, held-out structural tests, per-statistic nulls, and acoustic-null recoverability gates. The evidence supports a narrow two-tier architecture. At the lower tier, clicks compose into codas not by a stable ordered rule, but by which clicks are present together with their inter-click rhythm. At the upper tier, coda tokens show bout-level sequential dependence, with an NSB second-order transfer-entropy lift of 0.132 bits (p = 0.002). Under tempo scaling, encoder-derived click identity is strongly rate-bound, while coda identity remains substantially more stable, yielding a measurable abstraction gradient across the click-to-coda step. Rhythm-only baselines recover substantial lower-tier structure but fail to reproduce the upper-tier sequential-dependence signal. We do not claim language, semantics, perception, or human-like phonemes. Instead, we report representation-level evidence for a duality-of-patterning-like architecture whose lower tier is rhythmic rather than segmental, and provide a portable null-controlled framework for testing combinatorial structure in induced acoustic token systems.

2606.16687 2026-06-16 cs.AI cs.CL 交叉投稿

From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

从情感预测到情感预报:纵向文本中不同信息源的证据

Sadia Noor, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)(国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University(中瑞典大学通信、质量管理和信息系统系)

AI总结 本文区分当前情感估计与未来情感变化预报,提出TSAP框架和ACF-Hybrid模型,实验表明文本语义支持当前预测,而数值轨迹动力学更适用于未来变化预报。

详情
AI中文摘要

对纵向文本中的维度情感建模需要区分当前情感估计与未来情感变化预报。现有方法通常将每个文本视为独立观测,并对两个任务应用类似假设,而不检验它们是否依赖不同的信息源。本文利用纵向自我报告生态短文和情感词条目研究这一区别。我们提出特质-状态情感预测(TSAP)框架及其时间扩展E-TSAP用于逐文本效价和唤醒度预测,在来自91名用户的1737条条目的保留预测测试集上评估。我们进一步提出情感变化预报混合模型(ACF-Hybrid)用于下一步情感变化预报,在来自46名用户的保留预报测试集上评估。对于预测,E-TSAP在效价上达到复合皮尔逊相关系数0.670,在唤醒度上达到0.449。对于预报,文本表示的表现不如紧凑的数值轨迹基线:包含文本的模型在效价上仅达到r=0.316,在唤醒度上达到r=0.284,而简单的先前状态基线分别达到r=0.615和r=0.670。ACF-Hybrid使用维度特定的数值轨迹特征,在效价上达到r=0.659,在唤醒度上达到r=0.658。这些结果表明,文本语义支持当前情感预测,而未来情感变化通过先前数值轨迹动力学能更好地捕获。

英文摘要

Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

2606.16893 2026-06-16 cs.AI cs.CL cs.LO 交叉投稿

Symbolic Informalization: Fluent, Productive, Multilingual

符号非形式化:流畅、高效、多语言

Aarne Ranta

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学与哥德堡大学计算机科学与工程系)

AI总结 提出符号非形式化方法,将形式数学可靠地转换为自然语言,基于Dedukti和Grammatical Framework的中间语言架构,实现多证明系统与多自然语言的流畅转换。

详情
AI中文摘要

符号非形式化能够将形式数学可靠地转换为自然语言。它有望使机器验证的内容在不损失精确性的情况下对人类可读。在传统证明系统使用中,符号非形式化将语法糖的有限机制推广为数学的普通语言。在由人工智能和自动形式化构建证明的场景中,符号非形式化可以解释具体构建了什么。本文概述了Informath项目,旨在展示符号非形式化如何以合理的开发工作量产生流畅的文本,并处理多种形式语言和自然语言。Informath基于中间语言架构,其中Dedukti作为不同证明系统(Agda、Lean、Rocq)之间的枢纽,而Grammatical Framework(GF)负责不同自然语言的语言正确性和变体。

英文摘要

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

2502.18795 2026-06-16 cs.CL 版本更新

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

什么都行?语言模型中(不)可能语言学习的跨语言研究

Xiulin Yang, Tatsuya Aoyama, Yuekun Yao, Ethan Gotlieb Wilcox

发表机构 * Georgetown University(乔治城大学) Saarland University(萨尔兰大学)

AI总结 通过训练语言模型学习不可能和类型学上未证实的语言,跨12种语言实验发现GPT-2小模型能部分区分自然语言与不可能语言,但弱于人类归纳偏置。

Comments ACL 2025

详情
AI中文摘要

语言模型(LMs)能否为人类语言学习提供见解?反对这一观点的常见论点是,由于LMs的架构和训练范式与人类截然不同,它们可以像学习自然语言一样轻松地学习任意输入。我们通过训练LMs建模不可能和类型学上未证实的语言来检验这一说法。与以往仅关注英语的工作不同,我们使用两个新构建的平行语料库,对来自4个语系的12种语言进行了实验。结果表明,虽然GPT-2小模型在很大程度上能够区分已证实语言与其不可能对应物,但并未在所有已证实语言和所有不可能语言之间实现完美分离。我们进一步通过基于Greenberg的普遍性20调整词序,测试GPT-2小模型是否区分类型学上已证实和未证实的具有不同NP顺序的语言。我们发现,模型的困惑度分数不能区分已证实与未证实的词序,但其在泛化测试上的表现却能做到。这些发现表明,LMs表现出一些类似人类的归纳偏置,尽管这些偏置比人类学习者中的要弱。

英文摘要

Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, while its performance on the generalization test does. These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.

7. 多模态语言处理 12 篇

2606.15714 2026-06-16 cs.CL cs.RO 新提交

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

超越英语:揭示视觉-语言-动作模型中的多语言差距

Hanyang Chen, Hongliang Li, Jiarui Cao, Yang Li, Yang Jiang, Haonan Wen, Kaiyu Huang, Shengnan Guo, Huaiyu Wan

发表机构 * Beijing Jiaotong University(北京交通大学)

AI总结 本研究首次系统探究VLA模型的多语言指令跟随能力,发现英语训练模型在其他语言上性能显著下降,并提出多语言主成分对齐方法缩小差距。

详情
AI中文摘要

视觉-语言-动作模型最近展示了从大规模多模态数据学习通用机器人策略的能力。然而,大多数现有的VLA系统主要使用英语指令进行训练和评估,使得它们理解和执行其他语言指令的能力在很大程度上未被探索。虽然底层的大语言模型通常具备多语言能力,但这些多语言能力在训练过程中是否能迁移到VLA尚不清楚。在这项工作中,我们首次对VLA模型中的多语言指令跟随进行了系统研究。我们首先通过扩展现有基准测试并翻译其指令来构建多语言指令。利用这些指令,我们在模拟环境中评估了几个代表性的VLA模型在一系列任务上的表现。我们的实验揭示了一个显著的多语言差距:主要用英语指令训练的模型在评估其他语言时表现出显著的性能下降,即使底层语言骨干是多语言的。我们提供了若干发现和分析来理解多语言差距。跨语言迁移行为分析表明,性能下降与指令理解和动作执行都相关。表示分析表明,多语言指令引起的表示偏移可能导致了多语言差距。受这些发现的启发,我们进一步探索了提高VLA多语言性能的策略。我们提出了一种简单而有效的多语言微调方法——多语言主成分对齐,该方法利用主成分分析获取主成分子空间并对齐投影后的多语言表示,有效缩小了多语言性能差距。

英文摘要

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

2606.15910 2026-06-16 cs.CL 新提交

Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

校准的分诊,而非自主:医学视觉-语言模型的置信度估计

Reza Khanmohammadi, Kundan Thind, Mohammad M. Ghassemi

发表机构 * Michigan State University(密歇根州立大学)

AI总结 针对医学视觉-语言模型在回答时可能忽略图像而依赖语言先验的问题,提出使用置信度估计进行校准分诊,通过评估七种置信度估计器,发现高置信度区域是区分可用估计器的关键,最佳探针可将错误率从41-45%降至1-4%,但无估计器在所有领域和模型中一致最优。

详情
AI中文摘要

视觉-语言模型可以流畅且自信地回答关于医学图像的问题,但几乎不使用图像,而是依赖语言先验。在医学中,这是最严重的失败,因为答案看起来可信而实际不可信,唯一的保护是足够可靠的置信度分数,以告知系统何时应该弃权。我们提出一个部署问题而非准确性问题:模型可以安全地单独处理多少成像工作,以及哪种置信度信号使其成为可能。我们在五个开放权重的LVLM和三个涵盖广泛临床成像、放射学和病理学的医学视觉问答数据集上评估了七种置信度估计器,每个探针仅在自然图像上训练且未经适应应用。重新表述为有界选择性预测(仅在置信度超过阈值时自动化案例,其余推迟),比较结果是警示性的。标准指标是糟糕的指南:辨别力几乎无法区分方法,而廉价自我报告的弱校准可以通过域外温度缩放廉价地去除,而不改变可部署的产量。区分可用估计器的是临床医生所依赖的高置信度区域:最弱的基线在其错误的41%到45%上自信地错误,而最佳探针为1%到4%,并且没有估计器在领域或模型上可靠地最佳。安全交接在两个层面控制:基础模型能力设定上限,因此校准良好的分数在20%错误容忍度下可以恢复大约三分之一的放射学案例,但几乎无法恢复病理学案例;然后置信度层决定可以达到该上限的多少。今天可用的角色是校准的分诊,而非自主:自动化校准分数标记为安全的案例,其余路由给临床医生。我们发布所有输出、正确性判断和置信度分数,以及代码。

英文摘要

A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer looks trustworthy and is not, and the only protection is a confidence score reliable enough to tell the system when to abstain. We ask a deployment question rather than an accuracy one: how much imaging work a model can safely handle alone, and which confidence signal makes that possible. We evaluate seven confidence estimators across five open-weight LVLMs and three medical visual-question-answering datasets spanning broad clinical imaging, radiology, and pathology, with every probe trained only on natural images and applied without adaptation. Recast as bounded selective prediction (automate a case only when confidence clears a threshold, defer the rest), the comparison is cautionary. The standard metrics are poor guides: discrimination barely separates the methods, and the weak calibration of a cheap self-report is cheaply removed by off-domain temperature scaling without changing deployable yield. What distinguishes a usable estimator is the high-confidence region a clinician acts on: the weakest baselines are confidently wrong on 41 to 45 percent of their errors against 1 to 4 percent for the best probe, and no estimator is reliably best across domains or models. Safe handoff is governed at two levels: base-model competence sets a ceiling, so a well-calibrated score recovers roughly a third of radiology cases at a 20 percent error tolerance but almost none of pathology; the confidence layer then decides how much of that ceiling is reachable. The usable role today is calibrated triage, not autonomy: automate the cases a calibrated score marks safe, route the rest to a clinician. We release all outputs, correctness judgments, and confidence scores, with code.

2606.16494 2026-06-16 cs.CL cs.AI cs.CV 新提交

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

迷失在末尾:多模态检索增强问答中的首因偏差

Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) The Ohio State University(俄亥俄州立大学)

AI总结 研究多模态知识型视觉问答中检索上下文的位置依赖,发现不同于纯文本的U形效应,出现首因偏差(开头优于末尾),并通过消融实验定位原因为指令调优阅读器的提示槽0。

Comments 15 pages, 9 figures. Under review at EMNLP 2026

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)通过将阅读器条件化于从维基百科规模知识库检索的段落,使视觉-语言系统能够回答超出其参数知识的问题。在纯文本长上下文LLM中,检索上下文的使用遵循Liu等人(2024)的U形“迷失在中间”效应:上下文开头和结尾的信息被使用,中间部分被忽略。这种效应是否会迁移到部署的多模态KB-VQA中尚不清楚。为填补这一空白,我们设计了首个针对多模态KB-VQA中阅读器侧位置依赖的受控探针:一种黄金位置协议,其中只有黄金段落的提示槽在问题内变化。我们在三个开源7B/8B VLM阅读器和两个KB-VQA基准上运行,k最大为20。形状从U形翻转为首因:在每个阅读器-基准组合上,黄金在开头比黄金在结尾高出16到26个点,我们称这种效应为“迷失在末尾”。三项针对性消融实验缩小了原因:纯文本对照显示多模态设置将已存在的文本模式首因放大了2.2到4.5倍,图像位置和干扰物洗牌消融共同将根源定位到指令调优阅读器的提示槽0。在冻结的阅读器上,三种检索侧修复(MMR、神权重排序、基于排名的重排序)均未缩小差距(无显著改进)。我们的发现表明,recall@k是部署KB-VQA的错误指标,缩小差距需要阅读器侧干预;我们发布该协议作为评估此类干预的受控工具。

英文摘要

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

2606.16843 2026-06-16 cs.CL 新提交

Data-Driven Decoding of Russell's Circumplex Model of Affect

基于数据驱动的Russell情感环状模型解码

Amdjed Belaref, Samir Sadok, Zineb Noumir, Renaud Seguier

发表机构 * Alten CentraleSupélec IETR UMR CNRS 6164(中央理工-高等电力学院 IETR CNRS 6164 联合研究单位) Inria at Univ. Grenoble Alpes, CNRS, LJK(法国国家信息与自动化研究所,格勒诺布尔阿尔卑斯大学,CNRS,LJK)

AI总结 本文研究Transformer嵌入是否恢复Russell环状模型的几何规律,通过文本和语音模型实验,发现多模态融合完美对齐情感排序,零样本下细粒度情感词接近人类映射坐标。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

情感计算日益依赖深度学习来表示情感,然而潜在空间通常是不透明的高维黑箱。本文研究Transformer的嵌入是否恢复Russell环状模型的几何规律。我们统一了两个互补实验,检验以下假设:在文本和语音上训练模型后,其潜在空间编码了与效价-唤醒一致的拓扑结构,并再现了类似人类的邻域关系。具体而言,我们评估了基于Transformer的文本(RoBERTa)和语音(wav2vec 2.0)编码器以及多模态Transformer融合架构提取的深度表示,使用了MSP-Podcast等自然数据集和受控的LLM生成刺激。我们的分析表明,文本和音频的多模态融合与Russell的主要情感排序实现了完美的拓扑对齐。此外,在零样本设置中,使用通用文本嵌入,投影的细粒度情感术语接近其已建立的人类映射坐标。我们的贡献是一个新颖的数据驱动框架,用于验证情感模型,证明Russell环状结构内在地编码于这些模态的嵌入中,而不仅仅是人类标注的产物,从而弥合了心理学理论与表示学习之间的差距。

英文摘要

Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers' embeddings recover the geometric regularities of Russell's circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell's primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell's circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.

2606.16158 2026-06-16 cs.CV cs.CL 交叉投稿

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

必要时聚焦:用于无训练视觉定位的自适应路由与协作定位

Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei

发表机构 * East China University of Science and Technology(华东理工大学) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学)

AI总结 提出LazyMCoT动态框架,通过自适应路由评估不确定性,对简单查询跳过处理,对困难样本利用协作定位模块进行两阶段精炼,在提升推理精度的同时降低平均推理延迟。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在跨模态推理方面表现出色,但它们通常难以感知复杂高分辨率图像中的细粒度细节。最近的无训练方法通过图像缩放和局部裁剪来解决这一问题。然而,不加区分地应用这些操作会导致简单查询的计算冗余,并且可能因截断必要的全局上下文或引入无关的背景噪声而降低准确性。为此,我们提出了LazyMCoT,一个动态且无需训练的框架,能够根据样本难度自适应地分配视觉定位工作。该框架具有自适应路由机制,通过单次前向传递的首词统计量来评估预测不确定性。这有效地绕过了置信度高的案例,同时通过保形校准确保困难样本的召回。对于这些具有挑战性的案例,协作定位模块通过两阶段精炼过程,将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部显示,以恢复小目标或被遮挡的目标。在多个基准上的大量实验表明,LazyMCoT通过同时提高推理精度和降低平均推理延迟,与基于训练的方法相媲美。我们的代码可在https://github.com/TencentBAC/LazyMCoT获取。

英文摘要

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.

2606.16295 2026-06-16 cs.CV cs.CL 交叉投稿

VisualClaw: A Real-Time, Personalized Agent for the Physical World

VisualClaw:面向物理世界的实时个性化智能体

Haoqin Tu, Jianwen Chen, Zijun Wang, Siwei Han, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) UNC-Chapel Hill(北卡罗来纳大学教堂山分校) Google(谷歌) UC Berkeley(加州大学伯克利分校)

AI总结 提出VisualClaw,一种自进化多模态智能体,通过混合编码和技能进化机制降低部署成本并提升准确性,在多个视频QA基准上实现平均-98%的API成本削减和最高+15.80%的准确率提升。

Comments H. T. and J. C. contribute to this project equally

详情
AI中文摘要

视觉语言模型正作为复杂多模态任务的通用接口。然而,部署仍面临三个差距:VLMs在处理密集视频帧和长提示时通常产生高延迟和成本,智能体框架在部署后保持静态,标准视频QA基准不测试智能体是否能在工具使用工作区内使用视觉证据。我们提出VisualClaw,一个围绕两个原则构建的自进化多模态智能体。首先,混合编码通过级联门过滤信息较少的流式帧,并通过热/冷top-k注入压缩文本技能库,从而降低部署成本。其次,技能进化让智能体从失败中学习:检索的记忆作为直接拼接上下文或引导证据条件化进化器,产生技能库更新以帮助未来问题。在4个视频QA基准上使用2个VLM,VisualClaw相比全帧上传平均降低每问题API成本-98%,相比离线均匀8帧基线降低-25.9%,同时在大多数设置中提升准确率,例如在EgoSchema上使用Gemini 3 Flash平均+3.85%,峰值+15.80%。为解决这一差距,我们整理了VisualClawArena,一个通过严格五阶段流程构建的200场景多模态智能体基准;模型必须使用视频证据、文档、动态更新和工作区内的可执行检查。在VisualClawArena上,相同的框架配合计算机使用智能体后端,相比无进化基线,Codex (GPT-5.5)的宏观准确率提升+2.9%,Claude Code (Sonnet 4.6)提升+3.2%,相比均匀采样基线成本降低-9.5%。这些特性使VisualClaw自然适用于边缘应用,其中级联将1小时流式会话从约3,600次API上传减少到仅5-20次调用,自进化使其成为完美的个性化助手。

英文摘要

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

2602.22391 2026-06-16 cs.CL 版本更新

Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

检测孟加拉语模因中的仇恨和煽动性内容:一个新的多模态数据集和共注意力框架

Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 针对孟加拉语模因中仇恨和煽动性内容检测的研究空白,构建了首个区分煽动性内容与直接仇恨言论的数据集Bn-HIB,并提出多模态共注意力融合模型MCFM,通过联合分析视觉和文本特征实现更准确分类。

Comments Added public link to dataset and fixed typo in abstract

详情
AI中文摘要

互联网模因已成为社交媒体上的一种主要表达形式,包括在孟加拉语社区中。虽然模因通常具有幽默性,但也可能被利用来传播针对个人和群体的攻击性、有害和煽动性内容。由于其讽刺性、微妙性和文化特异性,检测这类内容异常困难。对于孟加拉语等低资源语言,这一问题更为突出,因为现有研究主要关注高资源语言。为填补这一关键研究空白,我们引入了Bn-HIB(孟加拉语仇恨煽动良性)数据集,包含3,247个手动标注的孟加拉语模因,分为良性、仇恨或煽动性三类。值得注意的是,Bn-HIB是首个在孟加拉语模因中区分煽动性内容与直接仇恨言论的数据集。此外,我们提出了MCFM(多模态共注意力融合模型),一种简单而有效的架构,可共同分析模因的视觉和文本元素。MCFM采用共注意力机制来识别并融合来自每种模态的最关键特征,从而实现更准确的分类。我们的实验表明,MCFM在Bn-HIB数据集上显著优于多个最先进模型,证明了其在此精细任务中的有效性。为促进可重复性和未来研究,Bn-HIB数据集已通过Mendeley Data公开发布。警告:本文包含可能令部分观众不安的内容,请观众自行判断。

英文摘要

Internet memes have become a dominant form of expression on social media, including within the Bengali speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is exceptionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource languages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyses both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task. To facilitate reproducibility and future research, the Bn-HIB dataset has been made publicly available through Mendeley Data. Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 版本更新

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA:在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对科学实验室中机器人执行协议面临的数据和实体瓶颈,提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA,在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情
AI中文摘要

科学实验室越来越依赖AI系统来推理实验,但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议,但实验台前的协议执行仍需人类操作员。视觉-语言-动作(VLA)模型为书面协议与机器人执行之间提供了一种可能的接口,但现有策略主要在家庭和桌面演示上训练,很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架,以适应执行实验协议所使用的不同机器人实体。因此,我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题,我们构建了RoboGenesis,这是一个基于模拟的工作流和数据引擎,能够从原子技能组合配置的实验室工作流,验证和过滤 rollout,并跨支持的机器人配置文件导出结构化演示。在策略方面,我们提出了LabVLA,采用两阶段训练方案:首先进行FAST动作标记预训练,使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识;然后进行流匹配后训练,在知识隔离下附加一个DiT动作专家。在LabUtopia基准上,LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

2308.06035 2026-06-16 cs.AI cs.CL 版本更新

Attention, not scale, drives human-AI alignment in multimodal language prediction

注意力,而非规模,驱动多模态语言预测中的人机对齐

Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher Edwards, Quitterie Lacome D'Elascombe, Akilles Rechardt, Jeremy I Skipper, Gabriella Vigliocco

发表机构 * Psychology and Language Science, Experimental Psychology, University College London, London, UK(心理学与语言科学、实验心理学,伦敦大学学院,伦敦,英国) Google Deepmind, Mountain View, US(谷歌DeepMind,山景城,美国) Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA(普林斯顿神经科学研究所,普林斯顿大学,普林斯顿,新泽西州,美国) Computer Science Department, Exeter University(计算机科学系,埃克塞特大学)

AI总结 本研究通过比较五种视觉-语言模型与600名人类在视觉世界范式中的表现,发现添加视觉上下文显著提升模型与人类在预测评分上的一致性,且注意力机制而非模型规模是主要驱动因素。

Comments 39 pages, 6 Figures, published in NPJ Artificial Intelligence

详情
AI中文摘要

人类通常利用视觉上下文来预测即将出现的词语。目前视觉-语言模型在多大程度上产生类似行为尚不清楚。在这里,我们将五个最先进的预训练系统与600名人类参与者并排放置在基于网络的视觉世界范式中。在100个六秒电影片段中,模型和参与者接收纯文本或同步视频和文本,并判断指定目标词接下来出现的可能性;全程记录人类眼动。添加视觉上下文在所有架构中均增加了模型与人类在可预测性评分上的一致性(平均Delta r = 0.18),且参数大小无影响。当视觉上下文信息丰富时,Transformer注意力显著提高了一致性。两个Transformer模型的注意力图与人类注视相对应,当场景包含信息性线索时,解释了高达70%的参与者间方差。值得注意的是,跨模态注意力可靠地追踪了语义线索上的预期性人类注视。这些结果表明,当前基于Transformer的视觉-语言模型可以在语言预测期间近似利用视觉上下文的人类行为——并且对信息性线索的选择性注意力,而非纯粹的模型规模,是这种对齐的主要驱动因素。

英文摘要

Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context increased model-human alignment in predictability ratings across all architectures (average Delta r = 0.18) with no impact of parameter size. When visual context was informative, transformer attention significantly increased alignment. Attention maps from two transformer models corresponded with human gaze, explaining up to 70% of the inter-participant variance when the scene contained informative cues. Notably, cross-modal attention reliably tracked anticipatory human fixations on semantic cues. These results suggest that current transformer-based vision-language models can approximate human behaviour exploiting visual context during language prediction - and that selective attention to informative cues, not sheer model scale, is the principal driver of this alignment.

2510.01444 2026-06-16 cs.AI cs.CL cs.LG 版本更新

Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning

双不确定性引导的多模态推理策略学习

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

发表机构 * Tencent Hunyuan(腾讯文汇) University of Maryland(马里兰大学) University of North Carolina(北卡罗来纳大学)

AI总结 提出DUPL方法,通过量化感知不确定性和输出不确定性来引导策略更新,在多个多模态推理基准上显著提升模型准确率,优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已经提升了多模态大语言模型的推理能力。然而,现有方法通常将视觉输入视为确定性的,忽略了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂推理还是模糊感知,从而无法有针对性地分配探索或学习信号。为了解决这一问题,我们引入了\textbf{DUPL},一种用于多模态RLVR的双不确定性引导策略学习方法,该方法量化并利用感知不确定性(通过对称KL散度)和输出不确定性(通过策略熵)来指导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先级机制,DUPL重新校准策略优势,将学习重点放在具有高感知或决策模糊性的状态上,从而实现超越被动数据增强的有效目标探索。在涵盖数学和通用领域的多个多模态推理基准上,DUPL取得了显著提升。它将Qwen2.5-VL的准确率提升了高达$\textbf{12.3%}$(3B)和$\textbf{7.9%}$(7B),将Qwen3-VL-Instruct的准确率提升了高达$\textbf{10.7%}$(4B)和$\textbf{12.4%}$(8B),持续优于GRPO,同时无缝泛化到其他算法(DAPO,平均$\textbf{+6.5%}$)和架构(LLaVA-OneVision-1.5,平均$\textbf{+4.7%}$)。这些结果表明,DUPL是一种有效且可泛化的多模态RLVR方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce \textbf{DUPL}, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Evaluated on diverse multimodal reasoning benchmarks spanning mathematical and general domains, DUPL achieves solid gains. It improves Qwen2.5-VL accuracy by up to $\textbf{12.3%}$ (3B) and $\textbf{7.9%}$ (7B), and Qwen3-VL-Instruct by up to $\textbf{10.7%}$ (4B) and $\textbf{12.4%}$ (8B), consistently outperforming GRPO, while seamlessly generalizing to alternative algorithms (DAPO, $\textbf{+6.5%}$ avg) and architectures (LLaVA-OneVision-1.5, $\textbf{+4.7%}$ avg). These results demonstrate that DUPL is an effective and generalizable approach for multimodal RLVR.

2602.00344 2026-06-16 cs.CV cs.AI cs.CL 版本更新

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

当RAG有害:诊断和缓解检索增强LVLMs中的注意力分散

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文发现检索增强生成(RAG)在LVLMs中导致注意力分散(AD)问题,即检索文本抑制视觉注意力并偏离问题相关区域,提出MAD-RAG方法通过双问题公式和注意力混合来解耦视觉定位与上下文整合,在三个基准上提升性能并纠正大部分失败案例。

Comments 19 pages, 13 figures

详情
AI中文摘要

虽然检索增强生成(RAG)是增强大型视觉语言模型(LVLMs)在基于知识的VQA任务上的主导范式之一,但最近的工作将RAG失败归因于对检索上下文的注意力不足,并提出减少分配给图像令牌的注意力。在这项工作中,我们识别了先前研究忽略的一个不同失败模式:注意力分散(AD)。当检索上下文足够(高度相关或包含正确答案)时,检索文本全局抑制视觉注意力,并且图像令牌上的注意力从问题相关区域转移。这导致模型在原本无需检索文本就能正确回答的问题上失败。为了缓解这个问题,我们提出了MAD-RAG,一种无需训练的干预方法,通过双问题公式解耦视觉定位与上下文整合,并结合注意力混合以保留图像条件证据。在OK-VQA、E-VQA和InfoSeek上的大量实验表明,MAD-RAG在不同模型家族中始终优于现有基线,相对于原始RAG基线分别取得了高达4.76%、9.20%和6.18%的绝对增益。值得注意的是,MAD-RAG纠正了高达74.68%的失败案例,且计算开销可忽略不计。

英文摘要

While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

2605.10157 2026-06-16 cs.CV cs.CL 版本更新

MolSight: Molecular Property Prediction with Images

MolSight: 基于图像的分子属性预测

Aaditya Baranwal, Akshaj Gupta, Yogesh S Rawat, Shruti Vyas

发表机构 * University of Central Florida(中央佛罗里达大学) Birla Institute of Technology and Science(比拉理工学院和科学学院)

AI总结 MolSight首次系统研究基于视觉的分子属性预测,通过10种视觉架构和7种预训练策略,在10个下游任务中展示性能,提出化学引导课程提升效果,以更低的FLOPs实现优异结果。

详情
AI中文摘要

每种合成分子均可绘制为2D骨架图,但现代属性预测更关注分子图、3D构象或大参数语言模型。我们提出MolSight,首次系统研究基于视觉的分子属性预测。使用10种视觉架构、7种预训练策略和2M分子图像,在10个下游任务中评估性能,涵盖物理性质回归、药物发现分类和量子化学预测。为应对预训练分子结构复杂度差异,提出化学引导课程:五种结构复杂度描述符将语料库分为五个难度递增的层级,持续优于非课程基线。证明单个渲染的bond-line图像经视觉编码器处理即可实现竞争性的分子属性预测,即仅凭视觉获得化学洞察。最佳课程训练配置在10个基准中的5个达到顶结果,全部达到前两名,FLOPs仅为最近多模态竞争者的80倍更低。

英文摘要

Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.

8. 语音语言联合与音频文本 13 篇

2606.15059 2026-06-16 cs.CL 新提交

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

长语音同声翻译的实用评估方法

Yulin Xue, Siqi Ouyang, Lei Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对长语音同声翻译评估困难的问题,提出一种基于ASR、强制对齐和句子嵌入对齐的实用方法,实现句子级延迟与质量度量,揭示现有系统在长语音上延迟累积严重。

Comments Accepted to IWSLT 2026 Scientific Track

详情
AI中文摘要

同声语音翻译(SimulS2ST)实现了实时跨语言通信,但现有评估主要关注短语音或预分割语音,而非长语音连续输入。先前的方法难以复现,且其假设不适用于端到端系统。我们提出一种针对长语音SimulS2ST的实用评估方法。给定源语音、预分割的源文本和参考翻译,我们对生成的目标语音运行自动语音识别(ASR)和强制对齐以恢复词级时间戳,然后应用基于句子嵌入的对齐器将目标文本与其对应的源句子匹配。这使得能够计算句子级的延迟和质量指标,包括YAAL和xCOMET,这些指标随后聚合为最终的系统级分数。在代表性SimulS2ST系统上的实验表明,该方法在实践中有效,并揭示了当前系统在长语音上遭受显著的延迟累积。

英文摘要

Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

2606.15266 2026-06-16 cs.CL 新提交

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

评估与保留英汉语音翻译中的词汇重音

Yuchen Song, Xi Chen, Mingze Li, Satoshi Nakamura

发表机构 * The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) Shenzhen Loop Area Institute, China(深圳环域研究所)

AI总结 针对英汉语音翻译中词汇重音跨语言传递不足的问题,构建重音标注数据集和普通话重音检测器,提出跨语言重音评估指标,并微调CosyVoice3构建重音感知S2ST系统,实验表明该系统在重音翻译能力上显著优于现有系统。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

语音到语音翻译(S2ST)系统在语义准确性和语音自然度方面取得了显著进展。然而,词汇重音(强调和说话者意图的关键线索)的跨语言传递仍然严重缺乏探索,加之缺乏针对汉语等声调语言的可靠自动评估指标。我们通过构建一个重音标注的中文数据集和一个基于XLS-R的普通话重音检测器,研究了英汉S2ST重音传递。结合英语EmphAssess系统,我们提出了一种新的跨语言重音评估客观指标。此外,我们微调了CosyVoice3以构建一个重音感知的S2ST系统。实验表明,我们提出的S2ST架构在重音翻译能力上显著优于现有系统,同时保持了有竞争力的翻译质量。此外,我们的评估指标与人类主观判断具有强相关性。

英文摘要

Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.

2606.15984 2026-06-16 cs.CL 新提交

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

ROMPAR:罗马尼亚口音语音识别的形态补全与人口统计去偏

Andrei-Marius Avram, Aureliu-Valentin Antonie, Ştefan-Bogdan Badea, Andrei Florea, Robert-Nicolae Zaharoiu, Dumitru-Clementin Cercel

发表机构 * National University of Science and Technology POLITEHNICA Bucharest(布加勒斯特理工大学)

AI总结 提出ROMPAR语料库和多任务对抗训练框架,通过指数衰减机制和LLM引导解码,实现人口统计不变性和形态补全,WER显著降低,形态重建F1达96.6%。

详情
AI中文摘要

由于人口统计偏差、方言变异以及分割过程中话语截断等技术伪影,议会程序的自动转录面临重大挑战。本文介绍了罗马尼亚议会语音语料库(ROMPAR)数据集,这是一个17.80小时的罗马尼亚和摩尔多瓦议会语音语料库,具有双重标注的真实数据和重构词片段的显式标签。为了构建鲁棒的ASR系统,我们提出了一个多任务对抗训练框架,强制实现跨年龄、性别和方言的人口统计不变性。我们通过引入对抗系数的指数衰减机制来解决生成架构中对抗目标固有的不稳定性。此外,我们实现了一种具有位置依赖权重的LLM引导解码策略,以促进截断终端词的形态补全。我们的结果表明,所提出的框架显著降低了WER,并在形态重建中达到了96.6%的F1分数。

英文摘要

Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

2606.16019 2026-06-16 cs.CL cs.LG cs.SD 新提交

Scaling Human and G2P Supervision for Robust Phonetic Transcription

扩展人类与G2P监督以实现鲁棒语音转录

Alexander Metzger, Aruna Srivastava, Ruslan Mukhamedvaleev

发表机构 * Koel Labs LLC

AI总结 研究自动语音转录中人类标注与G2P监督的扩展规律,发现当人类标注少于20-30小时时G2P有效,超过后无益甚至降低鲁棒性,而ASR预训练可显著提升性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

专家语音标注成本高昂,尤其对于非标准方言和非典型语音。一种常见替代方法是使用字素到音素(G2P)模型从文本转录中自动生成语音标签。我们研究了自动语音转录性能如何随英语中人类和G2P监督的扩展而变化。使用一个涵盖母语、非母语和卒中后语音的精心策划的80小时基准测试,我们确定了一个监督质量阈值:只有当人类标注少于20-30小时时,G2P监督才有帮助。超过此阈值,它不提供显著益处,并可能降低跨方言鲁棒性。在此阈值之后有效的是ASR预训练,我们使用它实现了比先前系统加权音素特征错误率降低2.3倍,在非母语和失语症语音上取得了强劲提升。这些结果表明,数量驱动的G2P扩展可能对鲁棒泛化产生递减收益。

英文摘要

Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

2606.16472 2026-06-16 cs.CL 新提交

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

从意识到遵循:通过上下文感知解码弥合口语对话系统中的上下文鸿沟

Che Hyun Lee, Heeseung Kim, Sungroh Yoon

发表机构 * ECE, Seoul National University(首尔大学电气与计算机工程系) IPAI, Seoul National University(首尔大学IPAI研究所) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出音频适配的上下文感知解码方法,通过对比有无关键上下文的输出分布,放大多模态上下文信号,解决口语对话系统中上下文遵循问题。

Comments Interspeech 2026 Main Track

详情
AI中文摘要

尽管端到端口语对话系统取得了成功,但在多轮对话中保持严格的上下文遵循仍然是一个挑战。先前的工作将这些失败归因于模型忘记对话历史,但我们强调了一个同样关键但被忽视的瓶颈:潜在上下文意识与主动遵循之间的鸿沟。尽管模型内部能识别相关的历史话语,但强大的参数先验在解码时常会掩盖这些信号。为弥合这一鸿沟,我们提出了一种音频适配的上下文感知解码方法。通过利用内部注意力机制隔离关键历史轮次,我们的方法在推理时对比有无此关键上下文的输出分布,直接放大多模态上下文信号。在Audio MultiChallenge基准上的评估表明,在语义记忆和自我一致性子任务上取得了显著改进,成功实现了严格、忠实于上下文的遵循。

英文摘要

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

2606.16568 2026-06-16 cs.CL cs.AI 新提交

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

快速判断何时,谨慎决定谁:基于扩散增强的双过程多轮对话

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

发表机构 * Deakin University(迪肯大学) Griffith University(格里菲斯大学)

AI总结 针对多说话人对话中的轮次转换问题,提出音频两阶段流水线,先快速检测轮次边界,再轻量验证决定是否转移并预测下一说话人,扩散增强进一步改善检测性能。

详情
AI中文摘要

可靠的轮次转换对于口语对话系统至关重要。然而,现有方法大多针对双说话人交互设计,难以处理包含重叠和快速说话人切换的现实多说话人音频。我们在VoxConverse数据集上研究多说话人轮次转换,并提出一个纯音频的两阶段流水线,将何时触发轮次边界与是否实际转移话语权分开。一个快速触发器扫描音频并提出候选的结束轮次时间,而一个轻量验证器仅在这些时间运行,以决定\textsc{Hold}或\textsc{Shift},并支持下一说话人预测。我们报告了完整多说话人设置下的结果,以及为可比性而控制的二元顶2投影结果。我们还研究了基于扩散的、保留标签的背景音频混合作为数据增强策略。结果显示,与基线相比,转移检测有所改善,扩散增强进一步提升了性能。

英文摘要

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

2606.16807 2026-06-16 cs.CL 新提交

Connecting Speech to Words through Images

通过图像连接语音与文字

Gabriel Pirlogeanu, Dan Oneata, Horia Cucu, Herman Kamper

AI总结 提出一种基于视觉的方法,利用图像和语音描述构建口语词汇表,无需文本监督,在口语词检索和关键词检测中优于神经基线。

Comments Accepted at EUSIPCO 2026 - 5 pages, 3 figures, 2 tables

详情
AI中文摘要

在没有明确文本监督的情况下,我们如何学习书面单词与其口语对应词之间的映射?我们提出了一种基于视觉的方法,仅使用图像及其口语描述来构建口语词汇表。首先,图像字幕系统用于构建代表图像中显著视觉概念的书面词汇表。对于每个单词,我们找到其图像字幕包含该单词的话语。然后,我们使用无监督词发现技术对齐这些话语,以定位目标单词的实例。结果是口语单词片段与书面单词相关联——所有这些都在没有任何文本监督的情况下完成。在口语单词检索和关键词检测实验中,所提出的方法在更具可解释性的同时,优于强大的神经基线。这些结果证明了该方法在英语中的可行性,并激励了未来在缺乏转录的低资源语言上的工作。

英文摘要

How can we learn the mapping between written words and their spoken counterparts in the absence of explicit textual supervision? We present a visually grounded method for building a vocabulary of spoken words using only images and their spoken descriptions. First, image captioning systems are used to build a vocabulary of written words representing salient visual concepts in the images. For each word, we then find utterances whose image captions contain that word. Then we use an unsupervised word discovery technique to align these utterances to locate instances of the target word. The result is spoken word segments that are linked to written words -- all accomplished without any text supervision. In spoken word retrieval and keyword spotting experiments, the proposed approach outperforms a strong neural baseline while being more interpretable. These results demonstrate the feasibility of the approach in English and motivate future work on low-resource languages without transcripts.

2606.14820 2026-06-16 cs.SD cs.AI cs.CL eess.AS 交叉投稿

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

频谱-时间干扰混淆空间音频基础模型中的相位编码

Yuxuan Chen, Haoyuan Yu, Peize He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Jilin University(吉林大学) Hunan University(湖南大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出基于双耳掩蔽级差的心理声学基准,评估空间自监督音频模型对微秒级耳间相位精细结构的编码能力,发现通用双耳SSL模型依赖频谱-时间干扰纹理而非真实相位计算。

Comments Accepted to INTERSPEECH 2026; 6 pages, 3 figures

详情
AI中文摘要

最近的空间自监督音频模型在定位任务上取得了高性能,引发了对它们编码微秒级耳间相位精细结构能力的疑问。我们提出了一个基于双耳掩蔽级差的心理声学基准来评估这一点。使用均衡抵消基线和GCC-PHAT阳性对照,我们评估了九个冻结的音频模型,涵盖双耳SSL、单耳SSL和神经音频编解码器。四个单耳阴性对照产生零BMLD,确认了双耳特异性。两个通用双耳SSL模型表现出最小的相位敏感性,而专用双耳空间SSL模型实现了与分析基线相当的BMLD。渐进式物理消融实验表明,通用双耳SSL模型依赖于频谱-时间干扰纹理而非跨通道相位计算。语音中的高检测率反映了对宽带包络而非真实相位编码的混淆依赖。

英文摘要

Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

2606.14922 2026-06-16 cs.SD cs.AI cs.CL eess.AS 交叉投稿

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

情感语音合成中学习潜在表示的实证研究

Vinh Dang Quang, Huy Ngo Quang

发表机构 * Aimesoft JSC

AI总结 本文针对VLSP 2022情感语音合成任务,通过将说话人嵌入和韵律瓶颈集成到FastSpeech 2中,实现了单说话人情感语音生成及跨说话人风格迁移。

Comments 4 pages

详情
AI中文摘要

在过去的几年中,由于深度学习,语音合成领域取得了巨大进步。越来越多的基于深度学习的TTS系统被开发出来,使得生成具有高可懂度和自然度的语音成为可能。同时,控制表现力仍然是一个大问题,以不同风格或方式生成语音最近受到了社区的广泛关注。本文旨在为VLSP 2022的情感语音合成(ESS)任务提供我们的解决方案,该任务允许从给定的输入文本生成具有所需情感表达的自然人声。通过将说话人嵌入、韵律瓶颈集成到FastSpeech 2中,我们的系统有望生成单个说话人的情感语音(子任务1),并将另一个说话人的说话风格迁移到具有中性非表达性数据的目标说话人,同时保留目标说话人的身份(子任务2)。

英文摘要

For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).

2501.17615 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

跨语言嵌入聚类用于低资源多语言语音识别中的分层Softmax

Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu

AI总结 提出一种基于跨语言嵌入聚类构建分层Softmax解码器的方法,通过共享相似令牌表示提升低资源多语言语音识别精度。

Comments Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

详情
Journal ref
in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 4226-4238, 2025
AI中文摘要

我们提出了一种新颖的方法,聚焦于自动语音识别(ASR)的解码阶段,以增强多语言性能,特别是对于低资源语言。该方法利用跨语言嵌入聚类方法构建分层Softmax(H-Softmax)解码器,使得不同语言中的相似令牌能够共享相似的解码器表示。它解决了先前基于Huffman的H-Softmax方法的局限性,该方法在令牌相似性评估中依赖浅层特征。通过在15种语言的下采样数据集上的实验,我们证明了该方法在提高低资源多语言ASR准确性方面的有效性。

英文摘要

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

2506.16738 2026-06-16 cs.CL cs.AI cs.SD eess.AS 版本更新

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

LM-SPT:面向语音标记化的LM对齐语义蒸馏

Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Multi-modal Model Training, Kakao Corp(Kakao公司多模态模型训练部)

AI总结 提出LM-SPT方法,通过语义语音重合成蒸馏,在不降低帧率的情况下生成与语言模型更对齐的离散语音标记,在ASR和TTS任务中优于现有方法。

详情
AI中文摘要

随着语音语言模型(SLM)的快速发展,离散语音标记已成为语音和文本之间的核心接口,实现了跨模态的统一建模。最近的语音标记化方法旨在从低级声学中分离语义信息,以更好地与语言模型(LM)对齐。特别是,以前的方法使用自监督学习(SSL)教师模型(如HuBERT)提取语义表示,然后将其蒸馏到语义量化器中,以抑制声学冗余并捕获与内容相关的潜在结构。然而,这些标记器通常以相对较高的帧率运行,产生的标记序列明显长于其文本对应物,阻碍了与预训练LM的无缝集成。尽管最近的方法尝试通过对SSL特征应用均匀平均池化来降低标记率,但这可能会过度平滑包含内容的区域并稀释结构信息,从而可能限制LM对齐。为了解决这个问题,我们提出了LM-SPT,一种基于语义语音重合成蒸馏的LM对齐语音标记化方法。LM-SPT不是通过池化直接匹配教师和学生特征,而是仅从语义标记重合成语音,并使用冻结的、LM对齐的语音编码器最小化从原始波形和重合成波形提取的表示之间的差异。这种间接监督避免了严格的时间对齐,并鼓励在降低帧率下与LM更语义对齐的专用语义单元。实验结果表明,在自动语音识别和文本到语音任务中,即使在不损害编解码器级别的语音重建保真度的情况下,所提出的LM-SPT在应用于SLM时也始终优于先前的语义增强语音标记器。

英文摘要

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.

2510.07096 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework

建模讽刺语音:语音合成框架中的语义和韵律线索

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

发表机构 * Speech Technology Lab, University of Groningen, Campus Fryslân, The Netherlands(格罗宁根大学弗里赛兰校区语音技术实验室,荷兰) Center for Language and Cognition, University of Groningen, The Netherlands(格罗宁根大学语言与认知中心,荷兰)

AI总结 提出一个计算框架,通过整合语义和韵律线索建模讽刺,使用微调LLaMA 3模型提取语义线索,从讽刺语音数据库提取韵律线索,语音合成测试表明两者结合能增强讽刺感知。

Comments Accepted to CogSci 2026

详情
AI中文摘要

讽刺是一种语用现象,说话者传达与字面内容不同的含义,依赖于语义和韵律表达之间的相互作用。然而,这些线索如何共同促进讽刺的识别仍知之甚少。我们提出了一个计算框架,将讽刺建模为语义解释和韵律实现的整合。语义线索来自微调的LLaMA 3模型,该模型捕捉讽刺意图的话语层面标记,而韵律线索通过从讽刺语音数据库中提取的语义对齐话语获得,提供讽刺表达的韵律范例。使用语音合成测试平台,感知评估表明语义和韵律线索增强了感知到的讽刺,组合系统在保持高主观讽刺评分的同时实现了最佳下游F1。这些发现强调了语义和韵律在语用解释中的互补作用,并说明了建模如何揭示讽刺交流背后的机制。

英文摘要

Sarcasm is a pragmatic phenomenon in which speakers convey meanings that diverge from literal content, relying on an interaction between semantics and prosodic expression. However, how these cues jointly contribute to the recognition of sarcasm remains poorly understood. We propose a computational framework that models sarcasm as the integration of semantic interpretation and prosodic realization. Semantic cues are derived from an LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent, while prosodic cues are extracted through semantically aligned utterances drawn from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Using a speech synthesis testbed, perceptual evaluations show that semantic and prosodic cues enhance perceived sarcasm, with the combined system achieving the best downstream F1 while maintaining high subjective sarcasm ratings. These findings highlight the complementary roles of semantics and prosody in pragmatic interpretation and illustrate how modeling can shed light on the mechanisms underlying sarcastic communication.

2603.05299 2026-06-16 cs.LG cs.AI cs.CL cs.SD 版本更新

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

WavSLM: 通过WavLM蒸馏的单流语音语言建模

Luca Della Libera, Cem Subakan, Mirco Ravanelli

发表机构 * Concordia University(康科迪亚大学) Mila-Quebec AI Institute(蒙特利尔AI研究所) Université Laval(拉瓦尔大学)

AI总结 提出WavSLM,通过量化蒸馏WavLM自监督表示到单一码本并优化自回归下一块预测,实现无文本监督的单流语音语言建模,在一致性和生成任务上表现竞争。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型语言模型表明,简单的自回归训练可以产生可扩展且连贯的生成,但由于语义和声学信息的纠缠,将这一范式扩展到语音仍然具有挑战性。大多数现有的语音语言模型依赖于文本监督、分层令牌流或复杂的混合架构,偏离了在文本中已被证明有效的单流生成预训练范式。在这项工作中,我们引入了WavSLM,一种通过将自监督WavLM表示量化和蒸馏到单一码本中,并优化自回归下一块预测目标来训练的语音语言模型。WavSLM在单个令牌流中联合建模语义和声学信息,无需文本监督或文本预训练。尽管其简单性,它在一致性基准和语音生成方面取得了有竞争力的性能,同时使用更少的参数、更少的训练数据,并支持流式推理。

英文摘要

Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.

9. 评测、数据集与基准 35 篇

2606.14867 2026-06-16 cs.CL cs.AI cs.LG 新提交

Evaluating the Robustness of Proof Autoformalization in Lean 4

评估 Lean 4 中证明自动形式化的鲁棒性

Zhengtao Gui, Sheng Yang, Zhouxing Shi

发表机构 * University of California, Irvine(加州大学洛杉矶分校) University of California, Riverside(加州大学河滨分校)

AI总结 研究证明自动形式化模型在全局和局部扰动下的鲁棒性,发现现有模型对全局扰动敏感且多数无法忠实反映局部扰动。

Comments Preprint

详情
AI中文摘要

证明自动形式化旨在将用自然语言编写的数学非正式证明翻译成形式语言(如 Lean~4)中的形式证明。已有几项工作开发了基于 LLM 的证明自动形式化模型。然而,现有评估通常侧重于翻译来自精选数据集的规范非正式证明。我们认为,一个鲁棒的证明自动形式化器必须即使对于偏离这些理想化形式的非正式证明也能保持忠实,并提出了首个关于证明自动形式化模型鲁棒性的研究。我们制定了两类扰动并评估每种扰动下的鲁棒性:全局扰动以不同风格改写非正式证明,在此情况下形式化应保持一致;局部扰动改变一个值、符号或证明步骤,可能是反事实的方式,鲁棒的形式化应忠实地反映扰动,而不是自行恢复为原始形式或推断出不同的形式。我们在 miniF2F 和 MATH-500 上构建了包含两种扰动的基准,并自动衡量证明自动形式化在全局扰动下正确性的稳定程度,以及其输出在局部扰动下的忠实程度。我们评估了七个最新模型,所有模型都对全局扰动敏感,且大多数在局部扰动下无法保持忠实。代码和数据可通过 https://github.com/ucr-rai/robust-proof-autoformalization 获取。

英文摘要

Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.

2606.15037 2026-06-16 cs.CL cs.CV 新提交

ReportQA: QA-Based Radiology Report Evaluation

ReportQA: 基于问答的放射学报告评估

Yiming Shi, Shaoshuai Yang, Xi Chen, Haolin Li, Hengyu Zhang, Che Jiang, Kaiwen Wang, Xun Zhu, Dong Xie, Fei Wang, Dejing Dou, Miao Li, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) College of AI, Tsinghua University(清华大学人工智能学院) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心) Beijing Electronic Digital & Intelligence(北京电子数字与智能)

AI总结 提出ReportQA框架,利用知识树和LLM从报告中提取结构化信息生成QA对,以问答准确率作为评估指标,比现有指标更符合放射科医生判断。

详情
AI中文摘要

放射学报告评估对于推进自动报告生成至关重要。自然语言生成指标具有有限的临床相关性。临床效能(CE)指标评估重要的医学发现,但主要关注存在性且仅覆盖有限的实体集。由于严重依赖人工标注,CE指标难以扩展临床实体或属性。在临床实践中,放射学报告作为信息传递的媒介。临床医生使用它们执行下游诊断任务,而无需直接检查图像。基于这一见解,我们提出了ReportQA,一个临床相关且灵活的放射学报告评估框架,支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。然后,在放射科医生的指导下构建临床实体和属性的知识树,并使用大型语言模型(LLM)从原始报告中提取结构化信息。接下来,我们从预定义模板生成QA对,并通过自过滤和基于报告的过滤进行质量控制。在评估期间,将报告视为上下文,LLM作为评判模型来回答QA对。基于得到的QA准确率,我们引入了QAScore指标。与现有指标相比,QAScore显示出与放射科医生判断更好的对齐。在多个最先进的视觉-语言模型上的实验表明,当前基于报告的推理范式难以学习细粒度的临床表示,并表现出强烈的负先验偏差。相比之下,问题驱动的推理提供了一种更有效的替代方案。为了可重复性和可扩展性,我们发布了知识树、结构化报告和QA对,以及用于QA构建和评估的流水线代码。

英文摘要

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

2606.15144 2026-06-16 cs.CL cs.AI 新提交

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE: 面向菲律宾语的音韵、词缀和字符级词元理解

Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

发表机构 * AI Singapore(AI新加坡) Nanyang Technological University(南洋理工大学) UK AI Security Institute(英国人工智能安全研究所) Ateneo de Manila University(马尼拉雅典耀大学) University of Birmingham(伯明翰大学)

AI总结 提出PACUTE基准,包含4600个任务,通过六层诊断框架评估大语言模型在菲律宾语中的形态理解,发现开放权重模型在语素分解上接近随机,前沿模型在组合任务上远低于字符级上限。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

大型语言模型(LLMs)将文本处理为子词词元序列,这掩盖了构成词形成的字符级和形态结构。对于具有非连接形态的语言,这种限制最为严重,标准分词器系统性地使词元边界与语素边界错位。我们引入PACUTE,一个包含4600个任务的诊断基准,旨在评估菲律宾语中的形态理解,菲律宾语以能产的中缀、重叠和变音符号驱动的词汇区分(通常不在书面文本中出现)为特征。PACUTE包括一个六层组合诊断框架,用于定位形态理解在何处崩溃。评估开放权重LLMs和前沿商业模型,我们发现开放权重模型在语素分解上无论规模大小都接近随机。前沿模型表现更好,通常在包含匹配评分下能恢复单个词缀,但在语素变换和音节划分的组合任务上仍远低于其字符级上限。这些结果表明,能产的形态组合(而非仅字符访问)是菲律宾语词汇结构理解的持续瓶颈。

英文摘要

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

2606.15191 2026-06-16 cs.CL 新提交

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

AmchiBias:基于英语和孔卡尼语的最小对数据集测量果阿身份群体的刻板偏见

Michelle Barbosa, Sebastian Padó, Franziska Weeber

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所)

AI总结 提出AmchiBias基准,通过313个最小对评估多语言编码器对果阿身份群体的刻板偏见,发现模型在孔卡尼语上表现接近随机,英语查询反映泛印度偏见而非本地文化知识。

Comments The 1st Workshop on Stereotypes Across Cultures in Language Technologies

详情
AI中文摘要

社会文化刻板偏见是NLP系统开发和部署中的重要考虑因素。然而,尽管存在丰富的次国家级社会文化结构,偏见通常仅在国家层面被考虑。我们提出AmchiBias,这是首个针对印度果阿邦(其独特的历史多元文化背景)测量社会文化刻板偏见的基准。它涵盖各种果阿身份群体,包括英语和天城文孔卡尼语中八个社会人口维度的313个最小对。然后,我们在此基准上评估五个多语言编码器模型中的刻板偏见。我们发现模型在孔卡尼语上的得分接近随机,反映了通用多语言模型的语言能力不足以及印度语言模型缺乏果阿文化能力。当用英语查询时,具有更强印度语言覆盖的模型对泛印度群体表现出比超本地果阿群体更高的偏见。这表明英语信号反映了泛印度预训练关联,而非真正的果阿文化知识。我们的发现突显了低资源多语言NLP评估中超本地社区身份的关键空白。

英文摘要

Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

2606.15610 2026-06-16 cs.CL astro-ph.IM cs.AI cs.LG 新提交

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

LLM 裁判具有暗电流:LLM 作为裁判评估的心理测量数据表

Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda

发表机构 * Chubu University(中部大学) Mitsubishi Heavy Industries, Ltd., Research & Innovation Center(三菱重工业株式会社研究创新中心)

AI总结 提出裁判数据表协议,通过真空输入、表面变异、位置偏好等指标测量 LLM 裁判的暗电流和偏差,揭示其测量特性。

Comments 22 pages, 4 figures

详情
AI中文摘要

LLM 作为裁判的系统现在常规用于开放式模型评估,其中人类偏好标注成本高、速度慢且难以复现。然而,这些裁判通常被报告为标量准确率、胜率或一致性指标。我们认为,裁判应被报告为测量仪器。我们引入了一个裁判数据表协议,该协议测量在真实真空输入下的暗电流、对相同质量表面变化的稳定交叉敏感性、位置虚假偏好、在受控质量阶梯上的目标敏感性,以及由平局指令引发的标准或操作点。方向-稳定性分解揭示,明显的 Delta0 偏好可能是稳定的表面响应或伪装的位置偏差。在一个三裁判开放权重案例研究中,Llama-3.1-8B 显示出高暗电流和呈现冲突的 Delta0 行为,Qwen2.5-14B 是真空清洁且对目标敏感,但混合了稳定和位置过度判别,而 Qwen2.5-32B 是真空清洁,具有低稳定交叉敏感性和低位置虚假偏好。严格的平局标准消除了 Qwen32B 的 Delta0 虚假偏好,但将边缘 Delta1 目标信号吸收为平局,同时保留了 Delta5 敏感性。结果表明,提示移动的是标准,而不是分辨率。我们并不声称激发这项工作的下游机制假设已得到确认;贡献是在做出下游声明之前测量测量仪器的计量协议。

英文摘要

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

2606.15643 2026-06-16 cs.CL 新提交

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

扩展项目反应理论以实现高效且有意义的多语言评估

Gili Lior, Tzviel Frostig, Gabriel Stanovsky, Matan Eyal

发表机构 * Google Research(谷歌研究) The Hebrew University of Jerusalem(特拉维夫大学) PhaseV Trials

AI总结 提出Multilingual-IRT框架,通过引入每语言难度偏差、分离内容与语言效应的区分度及每语言能力残差,解决多语言基准测试中的线性扩展、翻译错误和文化特定知识混淆问题,在MMLU-Pro-X上实现更优的预测和错误检测。

详情
AI中文摘要

多语言基准测试对于评估跨语言的大语言模型(LLMs)至关重要,但它们存在三个问题:详尽评估随语言数量线性增长,自动翻译引入的错误在大规模下容易被忽略,以及某些项目混淆了通用知识和文化特定知识。我们通过一个统一的统计框架Multilingual-IRT来解决这三个问题,该框架扩展了项目反应理论,引入了每语言难度偏差、分离内容与语言效应的区分度以及每语言能力残差。在MMLU-Pro-X的29种语言上对25个LLM拟合Multilingual-IRT,我们表明其拟合参数支持三种实际应用:预测未观察到的(项目、LLM、语言)实例,其二元交叉熵比最强的基于准确率的基线低11-16%;发现分布在所有28种非英语语言中的候选翻译错误,而基于准确率的基线将检测集中在少数语言上;以及恢复基于准确率的基线遗漏的文化特定项目。

英文摘要

Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.

2606.15949 2026-06-16 cs.CL 新提交

FinBalance: A Multi-Document Accounting Reconciliation Benchmark

FinBalance:多文档会计对账基准

Sasank Tumpati, Devansh Agarwal, Ayush Kedia, Arjun Neekhra, Murari Mandal, Krishna Garg, Yash Sinha, Suman Gupta, Dhruv Kumar

发表机构 * BITS Pilani(比拉理工学院皮拉尼校区) KIIT Bhubaneswar(KIIT布巴内斯瓦尔) University of Oxford(牛津大学)

AI总结 提出FinBalance基准,通过多行业源文档构建会计对账任务,评估LLM在生成资产负债表和检测不一致性上的表现,发现模型在文档绑定和一致性聚合上存在显著差距。

Comments 18 pages, 12 figures. Code and data: https://github.com/Devansh1105/finbalance

详情
AI中文摘要

现有的金融NLP基准主要评估已准备好的工件,如申报文件、表格或提取的值。真正的会计工作更早开始:源文档必须被对账到引用的日记账分录中,汇总到资产负债表,并检查矛盾。我们引入了FinBalance,一个多文档会计对账基准,由来自八个行业、三种期间类型和五个难度级别的源文档包构建。人工编写的业务场景、会计政策、税务/外汇处理、文档模式、干扰项和不一致性模板由确定性生成器组合,其分类账产生日记账分录、资产负债表和23个不一致性代码标签。在710条记录的评估分割上,六个当代LLM的精确最终资产负债表准确率最高仅为46%。四个模型在BS_exact(模型报告的资产负债表)和BS_recon(通过重放其分录到我们的分类账获得的资产负债表)之间显示出26-41个百分点的差距。模型通常恢复数值上合理的分录,但未能将其绑定到支持文档并一致地聚合。引用压力提示几乎不改变文档链接错误,而分类账反馈消融显著改善了报告的资产负债表并暴露了不一致性检测的权衡。专家财务评审人员验证了基准设计和标签。

英文摘要

Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a balance sheet, and checked for contradictions. We introduce FinBalance, a multi-document accounting reconciliation benchmark built from source-document bundles across eight industries, three period types, and five difficulty levels. Human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates are composed by a deterministic generator whose ledger produces journal entries,balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs reach at most 46% exact final-balance-sheet accuracy. Four models show a 26-41 pp gap between BS_exact, the model's reported balance sheet, and BS_recon, the balance sheet obtained by replaying its entries through our ledger. Models often recover numerically plausible entries but fail to bind them to supporting documents and aggregate them consistently. Citation-pressure prompting barely changes document-linking errors, while ledger-feedback ablations substantially improve reported balance sheets and expose inconsistency-detection trade-offs. Expert finance reviewers validate the benchmark design and labels.

2606.15974 2026-06-16 cs.CL 新提交

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

面向对话摘要的大规模多维度LLM实证研究

Weixiao Zhou, Gengyao Li, Xianfu Cheng, Junnan Zhu, Feifei Zhai, Zhoujun Li

发表机构 * CCSE, Beihang University(北京航空航天大学CCSE) MAIS, CASIA(中国科学院自动化研究所MAIS) Fanyu AI Laboratory(帆禹AI实验室)

AI总结 提出OmniCSEval基准,包含1800个跨六场景的对话,评估28个LLM在对话摘要中的完整性、简洁性和忠实性,揭示推理能力与模型规模的影响。

Comments 21 pages, 18 figures

详情
AI中文摘要

尽管LLMs在对话摘要方面取得了显著进展,但其评估仍受限于场景不足、输入长度和样本量有限。此外,现有基准往往忽略前沿推理系统和高效小模型,或缺乏细粒度、多维度的评估。为弥补这些不足,我们提出OmniCSEval,一个统一基准,包含1800个跨六个真实场景的多样化对话,上下文长度从128到32k tokens。为进行细粒度评估,我们采用双向事实核查框架,结合关键事实匹配评估完整性和简洁性,以及摘要事实验证评估忠实性。为确保可靠评估,我们建立了人机协作的关键事实提取流程和多LLM共识验证器用于摘要事实分解。利用该框架,我们评估了28个LLM,按推理能力和模型规模分为四个不同类别。我们的大规模实证研究揭示了当前LLMs在跨场景挑战、推理与规模的影响以及推理模型的效率与适应性方面的关键见解。我们还为实际部署中的系统选择提供了指导。

英文摘要

Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.

2606.16151 2026-06-16 cs.CL 新提交

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

GRACE:基于上下文忠实推理的步骤级基准

Hoang Pham, Dong Le, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) VinUniversity

AI总结 提出GRACE,首个带人工标注的步骤级忠实性基准,通过数据驱动错误分类法评估上下文推理中的链式思维步骤,并证明步骤级忠实信号可提升下游准确性和推理可靠性。

详情
AI中文摘要

许多推理任务要求模型基于输入上下文进行推理,从文档问答到基于规则的演绎。链式思维提示产生的轨迹看似透明,但单个步骤可能悄然偏离源证据,即使最终答案正确。现有方法在响应级别检测幻觉,但无法识别链中失败的位置或类型。我们引入GRACE,这是首个带人工标注的步骤级忠实性基准,具有数据驱动的错误分类法,用于基于上下文的文本推理。GRACE涵盖来自4个源数据集的10个模型的CoT轨迹,每个步骤都标注了忠实性、错误类别和自然语言解释。通过无监督聚类自底向上发现的数据驱动分类法将失败分为两个轨道:GRACE-Inference(演绎错误)和GRACE-Grounding(事实基础错误),每个轨道包含四个类别。评估集是人工标注的,且设计上具有挑战性。我们的实验揭示了当前模型存在巨大的改进空间。此外,将步骤级忠实性信号集成到强化学习管道中可提高下游准确性和推理可靠性。

英文摘要

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

2606.16211 2026-06-16 cs.CL 新提交

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

编织多源证据进行生物医学推理:BioMedHop基准与BioWeave框架

Xingyu Tan, Shiyuan Liu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学与工业研究组织) University of Technology Sydney(悉尼科技大学)

AI总结 提出BioMedHop基准和BioWeave框架,用于评估和实现生物医学多源证据推理,BioWeave在基准上优于基线方法10.5%,并提升小模型性能。

详情
AI中文摘要

生物医学问答(QA)日益需要对交互实体进行推理,其中支持证据分散在生物医学知识图谱、文献文档和网络可访问资源中。然而,现有的生物医学QA基准主要关注考试式知识、文献理解或短程多跳推理,而源条件图推理和证据拓扑构建尚未充分探索。为填补这一空白,我们引入了BioMedHop,一个多源图基基准,用于评估结构化证据拓扑上的生物医学推理。BioMedHop包含10,045个实例,涵盖知识图谱、文档、网络和混合证据设置,包括共享邻居匹配、交集推理、基于路径的推理和计数,并提供选项式、开放式和数值计数形式。为支持该基准,我们进一步提出了BioWeave,一个源感知推理框架,该框架检索生物医学知识图谱路径,从文档和网络来源收集支持线索,将其组装成统一的证据图,并通过实体级证据支持验证答案。综合实验表明,在BioMedHop上,BioWeave在比较方法中实现了最佳整体性能,在总体平均值上比强混合基线ToG-2高出10.5%。此外,BioWeave一致地改进了不同的大语言模型骨干,并使较小的模型(如Qwen3-4B)能够达到与GPT-4-Turbo相当的推理性能。

英文摘要

Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.

2606.16351 2026-06-16 cs.CL 新提交

TMASC: Transmasculine Attitude and Speech Corpus

TMASC:跨男性态度与语音语料库

Sidney Wong

发表机构 * Centre for Sustainability Research, University of Otago(奥塔哥大学可持续发展研究中心) Te Pūnaha Matatini Centre of Research Excellence for Complex Systems(Te Pūnaha Matatini复杂系统卓越研究中心)

AI总结 介绍一个包含196名跨男性个体的多模态语料库,包括问卷和66份录音,用于支持跨男性个体研究。

Comments Accepted to Interspeech 2026 Main Track

详情
AI中文摘要

我们介绍了跨男性态度与语音语料库(TMASC),这是一个包含196名跨男性个体的多模态语料库,包括问卷回答和66份录音。问卷包含探索跨男性个体声音健康的问题。录音包括咳嗽和清嗓样本、一段阅读文章以及额外的特定会话问题。本文概述了该语料库的开发过程和数据收集程序。为了说明该语料库的实用性,我们展示了三个案例研究,演示了如何使用这个众包多模态语料库来支持跨男性个体。这些案例包括感知和声学数据的整合、群体层面特征的识别以及声学测量的校准。

英文摘要

We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

2606.16368 2026-06-16 cs.CL cs.LG 新提交

Evaluating LLM Personalization via Semantic Constraint Verification

通过语义约束验证评估LLM个性化

Xuran Li, Guanqin Zhang, Imran Razzak, Hakim Hacid, Eleanna Kafeza, Hao Xue, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The Technology Innovation Institute(技术创新研究所) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出NLICV框架,利用自然语言推理模型将句子映射到真值条件集,验证个性化约束,将LLM行为分为四类,与人类标注高度一致,并大幅降低延迟和成本。

详情
AI中文摘要

当前大型语言模型(LLM)个性化的评估范式严重依赖于脆弱的表面匹配指标或计算成本高昂的LLM作为评判者的协议,两者都缺乏可解释性。为了解决这些局限性,我们引入了自然语言推理约束验证(NLICV),这是一个可扩展的、语义不变的框架,它将句子含义映射到真值条件集,通过自然语言推理(NLI)模型验证个性化约束。超越二元评分,NLICV将LLM行为分为四种不同模式:个性化、泛化、谄媚和失败。大量实验表明,NLICV与人工标注高度一致,同时大幅降低了与LLM评判者相关的延迟和令牌成本(高达2100倍推理加速)。最后,通过基于消融的程序,NLICV精确定位驱动约束验证的准确句子,为其评估提供忠实、可理解的证据。

英文摘要

Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

2606.16560 2026-06-16 cs.CL 新提交

The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

BD-LSC数据集:促进俚语与标准用法中词汇语义变化检测模型的基准测试

Afnan Aloraini, Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

发表机构 * The University of Manchester(曼彻斯特大学) Qassim University(卡西姆大学)

AI总结 针对词汇语义双向变化及俚语与标准用法混合的挑战,构建BD-LSC和ST-WSD两个基准数据集,评估多种方法,发现稀有俚语义项仍是核心难题。

详情
AI中文摘要

自动语义变化检测旨在识别词义随时间的变化,为语言和社会变迁提供见解。尽管计算词汇语义变化(LSC)近期取得进展,现有基准和方法难以捕捉双向语义变化,特别是词汇同时获得和失去义项的情况。对于兼具俚语和标准用法的词汇,这一问题尤为棘手。为填补这些空白,我们引入两个互补的基准数据集。双向词汇语义变化(BD-LSC)数据集捕捉三个时间段的义项获得、义项丢失和稳定性,支持复杂语义轨迹的研究。SlangTrack词义消歧(ST-WSD)数据集为结合俚语和标准用法的词汇提供细粒度的实例级义项标注,支持WSD和语义变化检测模型的系统基准测试。利用这些基准,我们系统评估了不同方法论家族的模型:使用上下文嵌入的无监督聚类、监督机器学习、基于Transformer的模型以及最先进的大语言模型。在评估的系统中,少样本GPT-4o模型在精确义项匹配(ESM)和多标签准确率上取得了最强的综合性能;然而,所有系统的Macro-F1分数接近0.5,表明稀有俚语义项仍然困难,我们将其确定为核心开放挑战。

英文摘要

Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

2606.16659 2026-06-16 cs.CL 新提交

FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

FraudSMSWalker: 用于短信到网页欺诈检测的智能体大语言模型基准测试

Y. H. Zhou, Z. M. Ma, Y. J. Zhou, Y. T. Li, H. X. Xiang, Y. M. Cheng, T. L. Chen, K. J. Zhang, Z. H. Nan, J. H. Ni, Z. Wu, Q. Y. Pan, S. Zhang, S. Cheng, M. Y. Luo

发表机构 * Baimaohui(白猫汇) PPSUC(中国人民公安大学)

AI总结 提出FraudSMSWalker基准,通过屏蔽URL的短信-网页对评估智能体大语言模型在跨渠道欺诈检测中的证据推理能力,发现模型能检测可疑线索但难以保持良性召回。

详情
AI中文摘要

短信欺诈日益跨渠道:一条消息引导用户访问网页,最终风险取决于短信声明与页面内容及请求用户操作的一致性。然而,现有评估要么专注于仅消息的钓鱼短信分类,要么暴露URL和域名线索,使模型能够依赖声誉捷径。为弥补这一空白,我们引入了\textbf{FraudSMSWalker},一个用于URL屏蔽的短信到网页欺诈判断的受控基准。FraudSMSWalker包含699条双语链,包括332个欺诈和367个良性案例,涵盖十个服务场景。模型可见输入包括短信上下文和经过处理的网页证据,而原始URL、主机、域名、IP、重定向和声誉元数据被隐藏。该基准进一步包含硬良性案例,其页面包含登录、支付、验证或账户管理元素,这些元素在服务上下文中看似合理,但也出现在诈骗流程中。我们在屏蔽浏览器代理协议下评估了九个网络代理,并进行了URL可见性消融实验。结果表明,当前代理可以检测可疑线索,但难以保持良性召回,并且经常产生观察证据弱支持的正面预测。这些发现将FraudSMSWalker定位为一个基准,用于衡量当直接声誉捷径被抑制时,网络代理能否做出既准确又有证据基础的欺诈判断。相关代码和数据集可在匿名链接处获取。

英文摘要

SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce \textbf{FraudSMSWalker}, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \href{https://anonymous.4open.science/w/FraudMessageWalker-Bench}{anonymous link}.

2606.16753 2026-06-16 cs.CL cs.AI cs.LG 新提交

P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs

P3B3:用于测量大语言模型中欧洲和巴西葡萄牙语变体偏差的多轮对话基准

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

发表机构 * NOVA University of Lisbon(新里斯本大学) NOVA LINCS(NOVA LINCS实验室)

AI总结 提出P3B3基准,通过专家策划的对话提示和评估框架,测量大语言模型在葡萄牙语变体(欧洲vs巴西)上的偏差和可控性,发现多数模型偏向巴西葡萄牙语。

Comments Accepted at MeLLM Workshop at ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)融入日常交流,捕捉区域语言变异对于可靠和公平的语言使用至关重要。在葡萄牙语中,欧洲(pt-PT)和巴西(pt-BR)变体仍然代表性不均,pt-BR在数据量上占主导地位,而LLM对葡萄牙语变体的偏好尚未得到充分探索。为弥补这一空白,我们引入了P3B3,一个由专家策划的语言变体无关的对话提示基准,以及一个用于测量变体偏差和可控性的评估框架。在多个模型上的实验表明,大多数LLM表现出对pt-BR的强烈偏差,且不同模型的可控性存在差异。这些结果凸显了需要在语言变体之间实现更平衡的多语言表示。

英文摘要

As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

2606.16890 2026-06-16 cs.CL cs.AI 新提交

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

组合推理深度预测临床AI失败:与电子健康记录问答中Transformer组合性限制一致的实证证据

Sanjay Basu

发表机构 * University of California San Francisco(加州大学旧金山分校) Waymark

AI总结 本研究引入推理步数(hop count)作为预测大型语言模型在电子健康记录问答中失败的理论驱动指标,发现准确率随步数增加单调下降,且扩展思考未能显著改善,提示组合推理深度是跨架构的失败预测因子。

Comments 20 pages, 5 figures. Code: https://github.com/sanjaybasu/compositional-depth-clinical-ehr

详情
AI中文摘要

聚合准确率基准掩盖了大型语言模型在电子健康记录(EHR)问答中失败的系统性结构:需要更多推理步骤的问题会产生不成比例的更多错误。受Transformer组合性限制的理论结果启发,我们引入一个预先指定的跳数分类法——从EHR回答临床问题所需的不同推理步骤的数量——作为模型失败的原则性预测因子。我们标注了313个由临床医生生成的MedAlign EHR问答对,涵盖四个跳数级别,并在模型内消融(claude-sonnet-4-6,零样本 vs. 扩展思考)和跨架构复制(gpt-4o和gpt-5.4-2026-03-05,零样本)中评估了301个问题。所有三个模型,跨越两个提供商和两个OpenAI代(GPT-4和GPT-5),均显示准确率随跳数单调下降:Claude Sonnet零样本从30.6%(跳数=1)降至17.6%(跳数=4)(Cochran-Armitage z=-2.30,p=0.011;每跳OR 0.72,95% CI [0.56,0.92],p=0.008);GPT-4o复现了这一点(37.8%降至14.7%;OR 0.58 [0.45,0.75],p<0.001);gpt-5.4-2026-03-05证实了这一点(37.8%降至23.5%;OR 0.80 [0.66,0.98],p=0.027)。一项预先指定的上下文充分性审计显示,较高跳数的问题并未因EHR截断而受到不同不利影响(跳数2-4的可回答性为93-95%,而跳数1为79%),因此下降反映了组合推理难度。扩展思考在三个推理条件下并未显著平缓准确率-深度曲线,且思考令牌使用量与跳数呈正相关(r=0.31,p<0.0001),与预测的O(k)计算需求一致。因此,跳数是一个理论驱动、跨架构的大型语言模型在EHR问答中错误的预测因子,对临床AI的部署风险分层具有直接意义。

英文摘要

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

2606.16897 2026-06-16 cs.CL 新提交

Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

对比差异CKA揭示跨语言模型架构的概念特定结构对齐

Xueping Gao

发表机构 * Alibaba Cloud(阿里云)

AI总结 提出对比差异CKA(CKA_Delta)方法,发现不同LLM架构在概念表示上存在几何收敛与功能可迁移性分离的现象,能有效区分概念特定相似性与通用相似性。

详情
AI中文摘要

不同LLM架构是否以结构兼容的方式编码高层概念?我们系统性地刻画了几何-功能普遍性分离:在多个概念领域和架构家族中,适度的几何收敛与近乎完美的功能迁移共存。通过使用对比差异CKA(CKA_Delta),一种无需训练的诊断方法,在每样本对比差异上计算核对齐,我们从通用相似性中分离出概念特定的收敛——在标准CKA无法区分的场景下实现了显著区分。这种分离在我们测试的所有六个概念领域(五个领域几何区分p≤0.017,安全性作为收敛功能趋势p=0.08)中重复出现,包括两个非指令概念(代码vs自然语言、推理vs记忆),这些概念在没有系统提示的情况下得到验证;一个70B-70B对提供了观察性说明,即普遍性可能随规模增强,需要更多≥70B模型进行验证。我们将CKA_Delta定位为实用的领域分类器和架构异常检测器(Gemma:d=1.08,AUC=0.79),而非绝对的迁移准确性预测器,为跨架构概念监控提供了一种无需训练的诊断方法。

英文摘要

Do different LLM architectures encode high-level concepts in structurally compatible ways? We systematically characterize a geometric-functional universality dissociation: across multiple concept domains and architectural families, moderate geometric convergence coexists with near-perfect functional transfer. Using contrastive-difference CKA (CKA_Delta), a training-free diagnostic that computes kernel alignment on per-sample contrastive differences, we isolate concept-specific convergence from generic similarity -- achieving significant discrimination where standard CKA cannot. The dissociation replicates across all six concept domains we test (five with p <= 0.017 geometric discrimination and safety as a converging-functional trend, p = 0.08), including two non-instruction concepts (code-vs-NL, reasoning-vs-recall) validated without system prompts; a single 70B--70B pair provides an observational note that universality may strengthen with scale, requiring replication with additional >=70B models. We position CKA_Delta as a practical regime classifier and architectural outlier detector (Gemma: d = 1.08, AUC = 0.79) rather than an absolute transfer-accuracy predictor, providing a training-free diagnostic for cross-architecture concept monitoring.

2606.16910 2026-06-16 cs.CL cs.AI 新提交

IMPACTeen: Intentions, Manipulation, Persuasion, Annotations, and Consequences in Teen Communication Dataset

IMPACTeen:青少年沟通数据集中的意图、操纵、说服、标注与后果

Aleksander Szczęsny, Wiktoria Mieleszczenko-Kowszewicz, Maciej Markiewicz, Beata Bajcar, Tomasz Adamczyk, Jolanta Babiak, Grzegorz Chodak, Przemysław Kazienko

发表机构 * Wrocław University of Science and Technology(弗罗茨瓦夫理工大学)

AI总结 构建IMPACTeen数据集,包含1021个青少年社交影响场景文本,从五个视角标注,支持社交影响检测、标注者分歧及跨语言建模研究。

详情
AI中文摘要

IMPACTeen是一个文本社交影响场景数据集,涵盖青少年语境下的人际、媒体和数字环境。它包含1,021个文本、5,100条独立标注记录以及社交影响技术的黄金标签,每个文本从五个不同视角(青少年、家长、心理学家、沟通专家和教师)进行标注。该资源通过受限的大语言模型生成构建,随后经过两步人工编辑和验证阶段,以确保青少年语境的真实性。多维标注涵盖了影响存在性、技术、意图、后果、抵抗、反应和标注置信度。该数据集支持社交影响检测、标注者分歧、跨语言建模以及语言模型的训练和评估。数据集以波兰语创建,并附有相应的英文版本。

英文摘要

IMPACTeen is a dataset of textual social influence scenarios spanning interpersonal, media-based, and digital settings in an adolescent context. It contains 1,021 texts, 5,100 individual annotation records, and gold labels for social influence techniques, with each text annotated from five distinct perspectives: teenagers, parents, psychologists, communication experts, and teachers. The resource was constructed through constrained LLM generation, followed by a two-step human editing and validation phase aimed at ensuring youth-context realism. A multi-dimensional annotation covered influence presence, techniques, intentions, consequences, resistance, reactions, and annotation confidence. The dataset supports research on social influence detection, annotator disagreement, cross-lingual modeling, and the training and evaluation of language models. The dataset was created in Polish and is accompanied by a corresponding English version.

2606.15300 2026-06-16 cs.AI cs.CL 交叉投稿

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH:代码智能体能否处理数据密集型任务?

Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出CODA-BENCH基准,在数据密集型环境中联合评估代码与数据智能,包含1009个任务,平均每个环境980个文件,揭示当前智能体在数据发现与代码执行整合上的不足。

Comments Accepted at ICML 2026. 37 pages, 11 figures. Project page: https://coda-bench.github.io/ Code: https://github.com/ruc-datalab/CoDA-Bench Data: https://huggingface.co/datasets/RUC-DataLab/CoDA-Bench

详情
AI中文摘要

高级智能体正日益展现出作为自主工程师的潜力,这催生了对能够捕捉真实世界开发复杂性的评估基准的需求。此类环境通常涉及复杂代码和大规模数据(即文件系统)。然而,现有基准通常孤立地评估代码中心或数据中心能力,与真实开发场景存在明显差距。在本文中,我们通过引入CODA-BENCH来弥合这一差距,这是首个在数据密集型环境中联合评估代码与数据智能的基准。我们基于Kaggle生态系统(包含数百个数据集)构建了一个数据密集型Linux沙箱,其中智能体必须主动探索复杂的文件层次结构以识别相关资源,并为数据驱动的分析任务生成代码。CODA-BENCH包含跨越31个社区的1009个任务,每个任务环境平均包含980个文件,模拟了真实的数据规模和噪声。对高级智能体的评估显示,即使是最优系统也难以有效整合数据发现与代码执行,成功率仅为61.1%。这些结果凸显了当前智能体在数据密集型任务中的能力差距,并为未来研究指明了有希望的方向。

英文摘要

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

2606.15696 2026-06-16 cs.AI cs.CL cs.LG 交叉投稿

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

LLMs 能否可靠识别失语症语篇中的正确信息单元?

Jason M Pittman, Yesenia Medina-Santos, Anton Phillips, Brielle C. Stark

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究评估指令微调大语言模型在零样本和少样本提示下对失语症语篇进行词级正确信息单元分类的性能,发现少样本提示可提升效果但一致性仍不足。

Comments 5 tables, 4 figures

详情
AI中文摘要

正确信息单元(CIUs)是失语症语篇评估的核心,因为它们量化了交际信息性而非仅语言形式。然而,CIU评分耗时且需要训练有素的评分者。本研究考察了指令微调的大语言模型(LLMs)是否能够可靠地从失语症语篇转录中进行词级CIU分类。使用Cat Rescue刺激引发的16个图片描述转录根据Nicholas和Brookshire(1993)的标准进行CIU状态标注。样本涵盖四个严重程度层:对照组、轻度、中度和重度失语症。在零样本和两种少样本提示条件下,对四个公开可用的指令微调LLMs进行了基准测试,使用五个分层随机种子。通过准确率、精确率、召回率、F1和Cohen's kappa与人类共识标签进行性能评估。零样本提示在所有模型中均不足。相比之下,少样本提示带来了显著提升,并为三个可行模型产生了有竞争力的性能。Llama-3.1-8B、Qwen2.5-7B和Mistral-7B的平均少样本F1分数范围为0.776至0.817,固定全局和逐块局部示例选择之间无显著差异。Phi-3-mini不稳定且未产生可靠性能。可行模型显示出高召回率但较低的精确率,表明系统性地过度将词元分类为CIU。性能也随语篇严重程度变化,在更严重的失语症中结果最弱。少样本LLM提示可以在无需基于梯度的任务训练的情况下支持自动CIU识别,但与人类标注的一致性仍不足以完全自主使用。这些发现支持基于LLM的CIU评分作为语篇评估系统中一个有前景的人机协同组件。

英文摘要

Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

2606.16206 2026-06-16 cs.AI cs.CL cs.CY cs.HC 交叉投稿

Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

衡量LLM导师是教学还是解题:教育影响的诊断方法

Junyi Yao, Zihao Zheng, Baichuan Li

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) Department of Operations Research and Engineering Management, Southern Methodist University(南卫理公会大学运筹学与工程管理系)

AI总结 针对LLM作为教育导师时解题能力不等于教学支持的问题,提出基于解题导向与教学导向基准性能差距的诊断方法,通过MathTutorBench分析表明两者仅部分对齐,建议分开报告评分并明确保护学生能动性的标准。

详情
AI中文摘要

大型语言模型越来越多地被提议作为教育导师,但更强的任务解决能力并不一定意味着更强的学习支持。受近期呼吁在实践中衡量NLP系统社会影响的启发,我们研究公开的LLM辅导基准是否能够区分支持学习的行为与单纯的答案生成。我们提出了一种轻量级诊断方法,基于解题导向和教学导向基准性能之间的差距。利用公开的MathTutorBench排行榜结果,我们表明这些维度仅部分对齐:在八个公开报告的模型中,解题和教学综合得分之间的相关性为0.421,并且当评估从解题转向教学时,几个模型的排名发生了显著变化。然后,我们分析了公开的TutorBench样本,并表明与能动性相关的行为明确编码在基准评分标准中,尤其是在主动学习环境中,奖励引导性问题、校准提示和非揭露性脚手架。这些发现共同表明,教育影响评估不应将任务成功视为学习支持的充分代理。我们认为,公开的辅导基准可以通过分别报告解题导向和教学导向得分,并使披露敏感、保护学生能动性的标准更加明确,从而更好地支持积极影响评估。

英文摘要

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

2606.16748 2026-06-16 cs.LG cs.CL 交叉投稿

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

MyPCBench: 个人智能计算机使用代理的基准测试

Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出MyPCBench基准,在模拟真实桌面环境(含17个Web应用)中测试个人计算机使用代理,发现最佳模型Claude Opus 4.6仅解决55.4%任务,失败集中在多应用和长轨迹任务。

详情
AI中文摘要

当前的计算机使用代理基准测试在非个人化环境中评估模型。这导致评估与部署之间存在差距,因为个人助理预计将在用户的整个数字生活中工作,包括其上下文、历史数据和已登录账户。这种差距在Web任务中最为明显,因为实时Web评估无法测试需要登录或个人信息的网站,而真正的个人助理必须驱动这类网站。我们引入了MyPCBench,它在Linux桌面上测试计算机使用代理作为个人助理,该桌面填充了17个模拟的真实世界Web应用程序和一个完整的桌面堆栈,所有这些都为一个典型角色——来自《办公室》的Michael Scott——进行了种子化。我们在此环境中定义了184个任务,每个任务都受到来自OpenClaw社区的真实请求的启发,并使用统一的计算机+bash工具界面基准测试了六个闭源和开源模型。我们发现,最佳模型Claude Opus 4.6完全解决了55.4%的任务,是唯一超过50%的模型。模型失败集中在跨越多个应用程序的任务和长轨迹上,其中个性化对助理的压力最大。我们在https://mypcbench.com上发布了环境、任务集和代理工具包。

英文摘要

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.

2306.11252 2026-06-16 cs.CL cs.LG 版本更新

HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

HK-LegiCoST: 利用非逐字转录进行语音翻译

Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

发表机构 * Center for Language and Speech Processing(语言与语音处理中心) Human Language Technology Center of Excellence(人类语言技术卓越中心) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出HK-LegiCoST语料库,包含600+小时粤语-英语三路平行数据,解决非逐字转录的句子级对齐挑战,在粤语语音翻译上取得竞争性基线并跨语料库验证。

详情
AI中文摘要

我们介绍了HK-LegiCoST,一个新的粤语-英语三路平行语料库,包含600+小时的粤语音频、其标准繁体中文转录和英文翻译,并在句子级别进行切分和对齐。我们描述了语料库准备中的显著挑战:切分、长音频记录的对齐,以及与非逐字转录的句子级对齐。当源语言的口语和书面形式存在显著差异时,此类转录使语料库适用于语音翻译研究。由于其大规模,我们能够在HK-LegiCoST上展示具有竞争力的语音翻译基线,并将其扩展到FLEURS粤语子集上具有前景的跨语料库结果。这些结果为语音识别和翻译研究提供了见解,特别是对于因各种因素(包括方言和口语)而常见非逐字或“噪声”转录的语言。

英文摘要

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.

2412.21036 2026-06-16 cs.CL 版本更新

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

GePBench:评估多模态大语言模型的基础几何感知能力

Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出GePBench基准,系统评估多模态大语言模型的几何形状识别与空间关系感知能力,发现现有模型存在显著缺陷,而基于该基准训练可提升下游任务性能。

详情
AI中文摘要

几何形状在物理世界和人类认知中都扮演着重要角色。尽管多模态大语言模型(MLLMs)在视觉理解方面取得了显著进展,但它们识别几何形状及其空间关系的能力(我们称之为“几何感知”)尚未得到明确和系统的探索。为填补这一空白,我们引入了GePBench,这是一个专门设计用于评估MLLMs几何感知能力的新型基准。我们的广泛评估表明,即使是当前最先进的MLLMs在几何感知任务中也表现出显著缺陷。此外,我们展示了使用GePBench数据训练的模型在广泛的下游任务上取得了显著改进,突显了几何感知在实现高级多模态应用中的关键作用。我们的代码和数据集可在\href{this https URL}{this https URL}获取。

英文摘要

Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term \emph{geometric perception}, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets are available at \href{https://github.com/Changhao-Xiang/GePBench}{https://github.com/Changhao-Xiang/GePBench}.

2506.21613 2026-06-16 cs.CL cs.SD eess.AS 版本更新

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

ChildGuard:针对儿童仇恨言论的专用数据集

Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

发表机构 * Macquarie University(麦考瑞大学) MBZUAI(穆罕默德·本·拉希德人工智能研究所) DSEU(德里国家理工学院) Department of Information and Computer Science, KFUPM(科威特石油大学信息与计算机科学系) Stanford University(斯坦福大学)

AI总结 针对社交媒体上针对儿童的仇恨言论问题,构建了大规模英文数据集ChildGuard,包含351,877条标注实例,覆盖三个年龄段,并评估了多种模型性能。

Comments Updated Version

详情
AI中文摘要

心理健康行业越来越关注社交媒体上针对儿童的仇恨言论,因为接触此类内容可能在关键发育阶段导致不良心理结果。当前的仇恨言论数据集和检测系统对儿童应用的支持有限,因为它们主要针对成人设计,缺乏针对儿童仇恨言论的年龄特定特征的专门表示。为了解决这一差距,我们引入了ChildGuard,一个大规模英文数据集,用于针对儿童的仇恨言论,包含从X(原Twitter)、Reddit和YouTube收集的351,877条标注实例。该数据集覆盖三个年龄组:幼儿(11岁以下)、前青少年(11-12岁)和青少年(13-17岁)。ChildGuard包含两个子集:上下文子集(157K)和词汇子集(194K)。使用最新的基于Transformer的模型和LLM进行评估,最佳Macro-F1达到82.07%,在幼儿、上下文、隐式仇恨和跨子集设置下分别降至79.41%、79.24%、76.04%和74.88%。

英文摘要

Mental health industry faces growing concerns regarding hate speech directed at children's on social media, as exposure to such content can contribute to adverse psychological outcomes during critical stages of development. Current hate speech datasets and detection systems provide limited support for child-focused applications because they are primarily designed for adults and lack dedicated representations of age-specific characteristics associated with hate speech directed at children's. To address this gap, we introduce ChildGuard, a large-scale English dataset for child-targeted hate speech containing 351,877 annotated instances collected from X (formerly Twitter), Reddit, and YouTube. The dataset covers three age groups such as younger children's (under 11), pre-teens (11-12), and teens (13-17). ChildGuard contains two subsets such as a contextual subset (157K) and a lexical subset (194K). Evaluation using recent transformer-based models and LLMs achieves a best Macro-F1 of 82.07%, decreasing to 79.41%, 79.24%, 76.04%, and 74.88% on younger children's, contextual, implicit hate, and cross-subset settings, respectively.

2508.01401 2026-06-16 cs.CL cs.AI 版本更新

MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

MedSynth: 真实、合成的医疗对话-笔记对

Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Nadine A. Friedrich, Maria P Mogollon, Alexander Hernandez-Tirado, Guillermo Lopez Garcia, Cyril Rakovski, Frank Rudzicz

发表机构 * Dalhousie University(达尔豪斯大学) Vector Institute(向量研究所) Shahrood University of Technology(沙霍尔德大学) Chapman University(查普曼大学) Cedars-Sinai Medical Center(Cedars-Sinai 医疗中心)

AI总结 为解决医生文书负担,提出MedSynth合成数据集,包含超1万对对话-笔记,覆盖2000+ICD-10编码,显著提升Dial-2-Note和Note-2-Dial任务性能。

Comments 7 pages excluding references and appendices

详情
AI中文摘要

医生花费大量时间记录临床就诊,这一负担导致了职业倦怠。为了解决这个问题,强大的医疗文档自动化工具至关重要。我们引入了MedSynth——一个新颖的合成医疗对话和笔记数据集,旨在推进对话到笔记(Dial-2-Note)和笔记到对话(Note-2-Dial)任务。基于对疾病分布的广泛分析,该数据集包含超过10,000个对话-笔记对,覆盖2000多个ICD-10编码。我们证明,该数据集显著提升了模型从对话生成医疗笔记以及从医疗笔记生成对话的性能。在开放获取、符合隐私要求且多样化的训练数据稀缺的领域,该数据集提供了宝贵的资源。代码可从此https URL获取,数据集可从此https URL获取。

英文摘要

Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.

2509.22808 2026-06-16 cs.CL 版本更新

ArFake: A Robust Framework for Multi-Dialect Arabic Speech Spoofing Detection Benchmark

ArFake: 多方言阿拉伯语语音欺骗检测基准的鲁棒框架

Mohamed Elsetohy, Alhassan Ehab, Ali Mekky, Besher Hassan, Shady Shehata

发表机构 * MBZUAI, UAE(马布里扎大学人工智能研究所,阿联酋) Queen’s University, Canada(女王大学,加拿大) University of Waterloo, Canada(滑铁卢大学,加拿大)

AI总结 提出首个多方言阿拉伯语欺骗语音数据集,通过多模型评估和人类评分构建基准,发现FishSpeech在卡萨布兰卡语料库上生成最逼真的合成语音。

详情
AI中文摘要

随着生成式文本到语音模型的兴起,区分真实语音和合成语音变得具有挑战性,尤其是对于研究较少的阿拉伯语。大多数欺骗检测工作集中在英语上,为阿拉伯语及其多种方言留下了显著空白。在这项工作中,我们引入了第一个多方言阿拉伯语欺骗语音数据集。为了评估每个模型合成音频的难度并确定哪个模型产生最具挑战性的样本,我们旨在通过合并来自多个模型的音频或选择性能最佳的模型来指导最终数据集的构建,我们进行了一个评估流程,包括使用两种方法训练分类器:基于现代嵌入的方法结合分类器头;应用于MFCC特征的经典机器学习算法;以及RawNet2架构。该流程进一步结合了基于人类评分的平均意见得分计算,以及通过自动语音识别模型处理原始和合成数据集以测量词错误率。我们的结果表明,FishSpeech在卡萨布兰卡语料库上的阿拉伯语语音克隆方面优于其他TTS模型,产生更逼真和具有挑战性的合成语音样本。然而,依赖单一TTS进行数据集创建可能会限制泛化能力。

英文摘要

With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

2510.06143 2026-06-16 cs.CL 版本更新

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

RoSE: 循环合成数据评估,无需人工测试集选择LLM生成器

Jan Cegin, Branislav Pecher, Ivan Srba, Jakub Simko

发表机构 * Kempelen Institute of Intelligent Technologies(凯姆佩尔智能技术研究所)

AI总结 提出RoSE方法,通过训练小模型在候选生成器输出上,并在其他LLM合成样本上评估,以选择最佳LLM生成器,无需人工测试集,在多个语言和任务中优于其他内在启发式方法。

Comments 16 pages; EACL 2026 Main

详情
AI中文摘要

LLM是合成数据的强大生成器,这些数据用于训练较小的特定模型。这对于低资源语言尤其有价值,因为人工标注数据稀缺,但LLM仍能生成高质量文本。然而,不同LLM的输出对训练的实用性不同。选择最佳LLM作为生成器具有挑战性,因为外部评估需要昂贵的人工标注(通常低资源语言不可用),而内在指标与下游性能相关性差。我们引入了循环合成数据评估(RoSE),这是一种无需人工测试集即可选择最佳LLM生成器的代理指标。RoSE在候选生成器(LLM)的输出上训练一个小模型,然后在所有其他候选LLM生成的合成样本上评估该模型。最终的RoSE分数是该小模型的平均性能。在六个LLM、十一种语言和三个任务(情感、主题、意图)上,RoSE比任何其他内在启发式方法更频繁地识别出最优生成器。RoSE优于内在启发式方法,并且与最优生成器基线的差距在0.76个百分点以内。该结果是通过在所选生成器的输出上训练小模型(最优与代理指标选择)并在人工标注测试数据上评估,以下游性能衡量。此外,RoSE是唯一与人工测试数据性能呈正相关的指标。

英文摘要

LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

2510.12306 2026-06-16 cs.CL 版本更新

A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction

基于大语言模型的大规模语料标注流水线:英语consider构式的变异与变化

Cameron Morin, Matti Marttinen Larsson

发表机构 * Université Paris-Cité(巴黎-城市大学) University of Gothenburg(哥德堡大学)

AI总结 提出一种利用大语言模型自动标注大规模语料的四阶段流水线,在历史英语语料库中标注14万条consider构式,准确率超98%,揭示未记录的体裁特异性变化轨迹。

详情
AI中文摘要

随着自然语言语料库以前所未有的速度扩展,人工标注仍然是语料库语言学工作中的重大方法论瓶颈。我们通过提出一种可扩展的流水线来应对这一挑战,该流水线利用大语言模型(LLMs)自动进行大规模语料的语法标注。与以往的监督式和迭代式方法不同,我们的方法采用四阶段工作流程:提示工程、事前评估、自动批量处理和事后验证。我们通过一个关于英语评价性consider构式(consider X as/to be/Ø Y)变异的历时案例研究,展示了该流水线的易用性和有效性。我们通过OpenAI API在不到60小时内从历史美国英语语料库(COHA)中标注了143,933条'consider'索引行,在两个复杂的标注程序上实现了98%以上的准确率。对44,527个评价性构式的真阳性实例拟合的贝叶斯多项GAM揭示了先前未记录的体裁特异性变化轨迹,使我们能够提出关于语域正式性与形态句法简化和增强的竞争压力之间关系的新假设。我们的结果表明,LLMs可以在最少人工干预下大规模执行一系列数据准备任务,解锁了以前实际难以触及的实质性研究问题,尽管实施过程中需要注意成本、许可和其他伦理考虑。

英文摘要

As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/Ø Y). We annotate 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. A Bayesian multinomial GAM fitted to 44,527 true positives of the evaluative construction reveals previously undocumented genre-specific trajectories of change, enabling us to advance new hypotheses about the relationship between register formality and competing pressures of morphosyntactic reduction and enhancement. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though implementation requires attention to costs, licensing, and other ethical considerations.

2602.06015 2026-06-16 cs.CL 版本更新

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

大型语言模型在PTSD严重程度评估中的系统评估:上下文知识与建模策略的作用

Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz

发表机构 * Department of Computer Science(计算机科学系) Stony Brook University(石溪大学) College of Connected Computing(连接计算学院) Vanderbilt University(范德比大学) Lund University(隆德大学) University of Minnesota(明尼苏达大学) Stony Brook World Trade Center Wellness Program(石溪世界贸易中心健康计划) Renaissance School of Medicine at Stony Brook University(石溪大学复兴医学院) University of Texas at Dallas(德克萨斯大学达拉斯分校) Department of Psychiatry(精神病学系)

AI总结 本研究系统评估了11种大型语言模型在PTSD严重程度评估中的表现,发现提供详细定义和上下文可提升准确性,增加推理努力可改善估计,并验证了集成策略的有效性。

Comments 24 pages, 5 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被以零样本(生成式)方式用于评估心理健康状况,然而我们对影响其准确性的因素了解有限。在本研究中,我们使用来自1,437名个体的自然语言叙述和自我报告的PTSD严重程度评分的临床数据集,全面评估了11种最先进的LLMs的性能。为了理解影响模型评估准确性的因素,我们系统地变化了(i)提示给模型的上下文知识,如子量表定义、分布摘要和访谈问题,以及(ii)建模策略,包括零样本与少样本、推理努力量、模型大小、结构化子量表与直接标量预测、输出重新缩放和九种集成方法。我们的发现表明:(a)当提供详细的构念定义和叙述上下文时,LLMs最为准确,甚至超过人类评分者与自我报告评分的一致性;(b)增加推理努力导致更好的估计准确性;(c)开放权重模型(Llama, DeepSeek)的性能在超过700亿参数后趋于平稳,而封闭权重模型(gpt-o3-mini, gpt-5)随着新一代的推出而改进;(d)当将监督模型与零样本LLMs集成时,达到最佳性能。除了与自我报告的一致性外,LLMs的估计能够区分PTSD严重程度与抑郁、焦虑和酒精使用,并前瞻性地预测未来的精神医疗支出。总之,这些结果表明上下文知识和建模策略显著影响基于LLMs的PTSD严重程度评估的准确性和临床实用性。

英文摘要

Large language models (LLMs) are increasingly being used in a zero-shot (generative) fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we use a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting model's assessment accuracy, we systematically varied (i) contextual knowledge prompted to the models like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative, even exceeding human raters agreement with self-reported scores; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, DeepSeek) plateaus beyond 70B parameters while closed-weight (gpt-o3-mini, gpt-5) alternatives improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Beyond agreement with self-reports, LLMs' estimates discriminated PTSD severity from depression, anxiety, and alcohol use, and prospectively predicted future mental healthcare expenditure. Together, these results suggest that contextual knowledge and modeling strategies meaningfully affect accuracy and clinical utility of LLM-based assessments of PTSD severity.

2603.07539 2026-06-16 cs.CL 版本更新

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

MAWARITH:面向大语言模型的法律继承推理数据集与基准

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed

发表机构 * Hamad Bin Khalifa University, Qatar(卡塔尔哈马德比内卡菲大学) Nazarbayev University, Kazakhstan(哈萨克斯坦纳扎尔巴耶夫大学)

AI总结 提出MAWARITH数据集,包含12,500个阿拉伯语继承案例,支持端到端的伊斯兰继承推理,并设计MIR-E多阶段评估指标,实验显示商业模型达90%而开源模型低于50%。

详情
AI中文摘要

伊斯兰继承法对大语言模型具有挑战性,因为解决继承案件需要复杂、结构化、多步骤的推理以及正确应用法学规则来计算继承人的份额。我们引入了\textit{MAWARITH},一个大规模标注数据集,包含12,500个阿拉伯语继承案例,用于训练和评估模型的完整推理链:(i) 识别合格继承人,(ii) 应用阻断(\textit{\d{h}ajb})和分配规则,以及(iii) 计算精确继承份额。据我们所知,\textit{MAWARITH}是首个专为端到端伊斯兰继承推理设计的阿拉伯语语料库和基准。与之前将继承案件解决限制为多项选择题的数据集不同,\textit{MAWARITH}支持完整的推理链,并提供基于经典法学来源和既定继承规则的逐步解决方案及理由,以及精确的份额计算。这使得模型能够学习如何生成详细的、逐步响应用户查询的答案,反映现实世界的伊斯兰继承案例。为了超越最终答案准确性来评估模型,我们提出了\textit{MIR-E}(Mawarith继承推理评估),一个加权多阶段指标,对关键推理阶段进行评分,并捕捉整个流水线中的错误传播。我们在零样本设置下评估了六个大语言模型。一个商业模型达到了约90%,而所有评估的开源模型均低于50%。我们的错误分析识别了重复出现的失败模式,包括场景误解、继承人识别错误、份额分配错误以及关键继承规则(如\textit{\textquotesingle awl}和\textit{radd})的缺失或错误应用。\textit{MAWARITH}数据集在此https URL公开可用。

英文摘要

Islamic inheritance law is challenging for large language models because solving inheritance cases requires complex, structured, multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce \textit{MAWARITH}, a large-scale annotated dataset of 12,500 Arabic inheritance cases for training and evaluating models on the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (\textit{\d{h}ajb}) and allocation rules, and (iii) computing exact inheritance shares. To the best of our knowledge, \textit{MAWARITH} is the first Arabic corpus and benchmark designed for end-to-end Islamic inheritance reasoning. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, \textit{MAWARITH} supports the full reasoning chain and provides step-by-step solutions with justifications grounded in classical juristic sources and established inheritance rules, as well as exact share calculations. This enables models to learn how to generate detailed, step-by-step responses to user queries that reflect real-world Islamic inheritance cases. To evaluate models beyond final-answer accuracy, we propose \textit{MIR-E} (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate six large language models in a zero-shot setting. A commercial model achieves about 90\%, whereas all evaluated open-source models remain below 50\%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as \textit{\textquotesingle awl} and \textit{radd}. The \textit{MAWARITH} dataset is publicly available at https://gitlab.com/nlpresearcher/mawarith.

2605.18421 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench: 从自演化视角评估智能体记忆

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Createlink Technology(创-link科技) Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Institute of Technology(北京理工大学)

AI总结 本文提出EvoMemBench,从自演化视角评估智能体记忆,通过内存范围和内容两个维度构建统一基准,比较15种内存方法并发现当前内存系统尚未达到通用解决方案,长上下文基线仍具竞争力,内存在上下文不足或任务困难时效果显著,检索方法在知识密集型任务中表现优异,而程序和长期记忆方法在任务结构匹配时更有效。

详情
AI中文摘要

近期针对大语言模型(LLM)智能体的基准测试主要评估推理、规划和执行能力。然而,记忆对于智能体同样至关重要,因为它使智能体能够随时间存储、更新和检索信息。这种能力仍被低估,主要是因为现有基准测试未能提供系统评估记忆机制的方法。本文从自演化视角研究智能体记忆,引入EvoMemBench,一个沿内存范围(回合内 vs. 跨回合)和内存内容(知识导向 vs. 执行导向)两个轴线组织的统一基准。我们在标准化协议下比较了15种代表性内存方法与强大的长上下文基线。结果表明,当前内存系统仍远未达到通用解决方案:长上下文基线仍具有高度竞争力,内存在当前上下文不足或任务困难时效果最显著,且没有单一的内存形式能一致适用于所有设置。基于检索的方法在知识密集型任务中仍表现强劲,而程序和长期记忆方法在存储的经验与任务结构匹配时,对执行导向任务更有效。我们希望EvoMemBench能促进未来更有效的LLM智能体内存系统研究。我们的代码可在https://github.com/DSAIL-Memory/EvoMemBench获取。

英文摘要

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

2509.22888 2026-06-16 cs.AI cs.CL 版本更新

JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

JE-IRT: 通过联合嵌入项目反应理论审视LLM能力的几何视角

Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

发表机构 * Independent Researcher(独立研究者) University of Cincinnati(辛辛那提大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出JE-IRT几何框架,将LLM和问题嵌入共享空间,通过方向编码语义、范数编码难度,揭示主题专长和分布外行为,支持新模型高效扩展,并发现与人类分类部分对齐的内部结构。

Comments 35 pages, 17 figures, 9 tables, accepted to TMLR

详情
AI中文摘要

标准LLM评估实践将多样能力压缩为单一分数,掩盖了其固有的多维性质。我们提出JE-IRT,一种几何项目反应框架,将LLM和问题嵌入共享空间。对于问题嵌入,方向编码语义,范数编码难度,而每个问题的正确性由模型和问题嵌入之间的几何交互决定。这种几何结构用主题专长取代了LLM的全局排名,并允许相关问题之间的平滑变化。基于此框架,我们的实验结果表明,分布外行为可以通过方向对齐来解释,且更大的范数一致地指示更难的问题。此外,JE-IRT自然支持泛化:一旦空间被学习,新LLM通过拟合单个嵌入即可添加。学习到的空间进一步揭示了仅部分与人类定义的主题类别对齐的LLM内部分类。我们还表明,嵌入空间的简单线性探针恢复了跨主题的能力方向,例如一个算术轴,在看似遥远的主题(如病毒学和全球事实)中突出定量要求高的问题。因此,JE-IRT建立了一个统一且可解释的几何视角,将LLM能力与问题结构联系起来,为模型评估和泛化提供了独特视角。

英文摘要

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) East China Normal University(华东师范大学)

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

Comments Accepted by KDD 2026

详情
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.09669 2026-06-16 cs.AI cs.CL 版本更新

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SpatialWorld: 在多模态智能体真实世界任务中基准测试交互式空间推理

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

发表机构 * Tsinghua University(清华大学) Chongqing University(重庆大学) Peking University(北京大学) ZenoMind AI Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学) Southeast University(东南大学) Shanghai Jiao Tong University(上海交通大学) Joy Future Academy The University of Hong Kong(香港大学)

AI总结 提出SpatialWorld基准,集成8种异构模拟后端,通过760个人工标注任务评估多模态智能体在视觉部分可观测环境中的交互式空间理解,发现最强模型GPT-5任务成功率仅17.4%。

详情
AI中文摘要

空间推理是多模态大语言模型(MLLMs)感知和操作物理世界的基础能力。然而,现有基准主要依赖被动评估(如静态VQA)或特定模拟器流程,未能评估通用的交互式空间理解。我们引入SpatialWorld,一个专门为评估多模态智能体在复杂真实世界任务中的交互式空间理解而设计的统一基准。在共享的、模拟器无关的协议下集成八个异构模拟后端,SpatialWorld包含跨多个领域(如家庭日常、旅行、社交协作)的760个人工标注任务。智能体必须在仅视觉的部分可观测性下解决问题,主动收集自我中心的视觉证据,并通过MLLMs原生的统一文本动作接口表达决策。为了可靠评估,每个任务包含一个人工验证的初始状态、一条参考轨迹和一个终端状态验证器。评估15个先进智能体揭示,稳健的空间任务解决仍然具有挑战性:最强模型GPT-5平均任务成功率(TSR)仅为17.4%,而领先的开源模型Qwen-3.5达到14.1%。进一步分析暴露了任务成功与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长程规划中的瓶颈使SpatialWorld成为未来空间智能体的严格测试平台。

英文摘要

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

10. 安全、隐私、公平与可解释NLP 39 篇

2606.15307 2026-06-16 cs.CL cs.AI 新提交

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

利用思维链监督的强化学习进行仇恨和宣传模因的可解释检测

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) Qatar University(卡塔尔大学)

AI总结 提出基于强化学习的后训练方法,结合任务特定奖励和组相对策略优化(GRPO),提升思考型多模态大语言模型在仇恨和宣传模因检测中的分类性能和解释质量。

详情
AI中文摘要

仇恨和宣传模因利用图像与文本之间的相互作用来传达有害意图,而这两种模态单独都无法揭示这种意图。尽管基于思考的多模态大语言模型(MLLMs)在视觉-语言理解方面取得了进展,但它们在模因内容审核中的应用仍未得到充分探索。我们提出了一种基于强化学习的后训练方法,通过任务特定奖励和组相对策略优化(GRPO)来提高思考型MLLMs的分类性能和基于参考的解释质量。具体来说,我们(i)对现成的MLLMs在英语和阿拉伯语基准上的仇恨和宣传模因理解进行了系统的实证研究,(ii)通过蒸馏和多LLM细粒度宣传标注,用弱监督的思维链(CoT)理由扩展了现有的模因数据集,(iii)引入了一个基于GRPO的目标函数,带有思考长度正则化,联合优化分类准确性和解释质量,以及(iv)研究基于共识伪标签的无标签模因的自监督GRPO。在Hateful Memes和ArMeme基准上的实验表明,我们的方法在FHM准确率(从79.9%提高到82.0%,提升高达2.1%)和ArMeme宏F1(从0.536提高到0.612,提升高达7.6个百分点,附带解释;与原始ArMeme基准相比提升6.1个百分点)上优于先前报告的结果,同时生成自然语言解释。在ArMeme上,序列分类基线在原始准确率方面仍然更强,而我们的方法提供了更平衡的每类性能以及解释。我们公开发布了代码、数据扩展和评估资源。

英文摘要

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

2606.15335 2026-06-16 cs.CL cs.AI 新提交

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

基于解耦表示的分布式智能体协作隐私保护文本净化

Xuan Liu, Hefeng Zhou, Sicheng Chen, Chao Yang, Xingcheng Xu, Jingjing Qu, Jiong Lou, Jie LI, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiSan框架,通过解耦文本为任务语义和风格子空间,结合联邦原型对齐与对抗正则化,在分布式多智能体协作中实现隐私保护,显著降低风格归因和PII泄露。

详情
AI中文摘要

当分布式智能体跨组织边界交换文本时,隐私泄露不仅来自显式标识符,还来自分布特征,如格式惯例、词汇选择和句法模式。我们提出DiSan(解耦净化),一个隐私保护净化框架,是Intern-Shannon中多智能体协作的内置组件。DiSan使用双流编码器将文本分解为保持任务语义的源不变角色子空间和保持本地的源识别风格子空间。联邦原型对齐和对抗正则化使得无需集中原始文本即可进行联合训练。实验表明,标识符级别的掩码是不够的:掩码19.2%的token仅将TF-IDF风格归因降低18.6%。相比之下,DiSan在分布式多智能体RAG基准上将答案级别的PII暴露降低了20倍,同时保持了83%的答案忠实度,并在Enron数据集上将TF-IDF风格归因降低了73.2%,神经探针降低了70.6%。

英文摘要

When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

2606.15396 2026-06-16 cs.CL cs.AI 新提交

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard:面向细粒度中文大语言模型安全护栏的可扩展数据构建与模型感知偏好对齐

Wenbo Yu, Bohua Wang, Hao Fang, Kuofeng Gao, Jingru Zeng, Xiaochen Yang, Tianyi Zhang, Xiaoxiao Ma, Jiawei Kong, Hao Wu, Bin Chen, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua University(清华大学) Beijing Normal University(北京师范大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen ShenNong Information Technology Co., Ltd.(深圳神农信息技术有限公司)

AI总结 针对中文场景,提出细粒度风险分类体系(5大类31小类),通过可扩展数据构建管道生成高质量训练数据,并采用模型感知直接偏好优化训练CHILLGuard,在基准上F1分数提升15.92%。

详情
AI中文摘要

大语言模型生成的恶意内容可能带来严重的安全风险和伦理问题。虽然现有的大语言模型安全护栏在英语或多语言环境中表现出色,但它们缺乏对中文特定监管政策、文化背景和语言细微差别的适应,无法支持针对不同部署需求的细粒度风险分类。在本文中,我们引入了一个面向中文场景的5大类、31小类细粒度风险分类体系,并构建了CHILLGuard:一个专门的中文大语言模型内容安全护栏。为了解决高质量标注中文安全数据的严重稀缺问题,我们提出了一个可扩展的多阶段数据构建管道:通过检索增强生成扩展多源语料库,通过提示工程改写生成隐式有害样本,并通过多模型投票的标签校准精炼高质量数据。基于此,我们构建了CHILLGuardTrain,一个包含405,007样本的大规模训练集,以及CHILLGuardTest,一个严格策划的包含51,745样本的标注测试集。然后,我们在生成器-分类器协作框架下,通过模型感知直接偏好优化在CHILLGuardTrain上训练CHILLGuard。在多种设置下的广泛实验证明了CHILLGuard的最先进性能,例如,在我们的基准上,F1分数相比Qwen3Guard-8B-Strict提升了15.92%。我们将在https://github.com/cswbyu/CHILLGuard发布我们的资源。

英文摘要

Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at https://github.com/cswbyu/CHILLGuard.

2606.15517 2026-06-16 cs.CL 新提交

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD: 通过自我重构蒸馏实现安全且有益的校准

Viswonathan Manoranjan, Amogh Gupta, Anvesh Rao Vijjini, Thomas Hofweber, Snigdha Chaturvedi

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出SHARD方法,通过哲学准则重构敏感提示以暴露良性意图,并自我重构响应,在保持安全性的同时提升帮助性。

详情
AI中文摘要

大型语言模型在处理敏感提示时常常遇到困难。它们可能直接拒绝、提供通用的安全套话,或者无法满足用户可以通过安全方式回答的合法信息需求。我们引入了SHARD,一种自我重构蒸馏方法,以改善安全-帮助性。它首先使用哲学准则重写敏感提示以暴露良性意图,然后将其原始响应重构为安全且更有帮助的响应,最后在自我重构的响应上微调模型。在DNA和LINGUASAFE的英文子集上,SHARD在保持安全性的同时提高了大多数模型家族的帮助性。它还与来自更大教师模型的蒸馏保持竞争力,表明模型可以内化从其自身引发的安全和有帮助的行为。警告:本文包含可能具有冒犯性或有害的内容。

英文摘要

Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.

2606.15815 2026-06-16 cs.CL 新提交

On Defining Erasure Harms for NLP

论NLP中擦除伤害的定义

Yu Lu Liu, Arnav Goel, Jackie Chi Kit Cheung, Alexandra Olteanu, Ziang Xiao, Su Lin Blodgett

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Carnegie Mellon University(卡内基梅隆大学) Mila – Québec Artificial Intelligence Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) Canada CIFAR AI Chair, Mila(加拿大CIFAR人工智能主席,Mila)

AI总结 针对NLP系统部署中擦除伤害概念模糊的问题,提出结构化定义,明确建立和测量擦除所需的必要组件,促进跨场景应用。

详情
AI中文摘要

NLP系统的部署引发了对其可能产生的伤害的担忧,包括表征伤害。近期文献开始概念化和测量一种这样的伤害——擦除伤害。然而,该领域缺乏清晰且连贯的概念基础来识别和测量擦除。现有的擦除概念化往往过于宽泛——使得难以确定建立和测量擦除所需的内容——或者特定于特定设置——便于在这些设置中进行测量,但可能难以适应其他设置。为了解决这一差距,我们开发并提出了一个结构化的擦除定义,阐明了确定是否发生擦除所需的必要组件,从业者需要明确阐述和操作这些组件以测量擦除。

英文摘要

The deployment of NLP systems has raised concerns about harms they might produce, including representational harms. Recent literature has begun to conceptualize and measure one such harm, the harm of erasure. Nevertheless, the field lacks a clear and cohesive conceptual foundation for identifying and measuring erasure. Existing conceptualizations of erasure are often broad -- making it difficult to identify what is needed to establish and measure erasure -- or else specific to particular settings -- facilitating measurement for those settings but potentially challenging to adapt to other settings. To address this gap, we develop and propose a structured definition of erasure that clarifies what components are necessary for establishing whether erasure has occurred, which practitioners need to explicitly articulate and operationalize in order to measure erasure.

2606.15914 2026-06-16 cs.CL cs.HC 新提交

Contaminated Collaboration: Measuring Gender Bias Transfer in LLM-Assisted Student Writing

污染的合作:测量LLM辅助学生写作中的性别偏见迁移

Ariyan Hossain, Kazi Kamruzzaman Rabbi, Farig Sadeque, S M Taiabul Haque

发表机构 * Brac University(布拉卡大学)

AI总结 通过实验发现,使用性别偏见的LLM写作助手会显著增加学生职业规划作文中的性别刻板印象,且偏见迁移不对称:女性目标作文的能动性被抑制,男性目标作文受影响较小。

Comments 18 pages, 7 pages

详情
AI中文摘要

LLM中的性别偏见已在模型输出中得到广泛研究,有偏见的提示被证明会放大刻板印象生成。然而,这种偏见是否会传播到使用这些系统的人类所写的文本中,仍未被充分探索。我们研究了LLM写作助手中的性别偏见是否会迁移到学生撰写的职业规划作文中。我们首先验证了性别偏见的提示会诱导LLM生成性别差异化的语言,而中性提示则不会。然后,我们在受控环境中招募参与者(N = 123),在三种条件下为仅性别不同的配对传记档案撰写职业规划作文:无AI辅助、中性LLM辅助或性别偏见LLM辅助。与对照组和中性条件相比,偏见条件下的学生作文产生了显著更大的能动性差距和更多的性别刻板职业建议。我们的结果还揭示了这种偏见迁移是不对称的:女性目标作文中的能动性受到抑制,而男性目标写作基本不受影响。我们的发现凸显了AI辅助写作中偏见传播的风险,呼吁在教育AI工具中进行公平性意识设计。

英文摘要

Gender bias in LLMs has been studied extensively in model outputs, with biased prompts shown to amplify stereotyped generations. Whether such bias propagates into text produced by humans who use these systems, however, remains underexplored. We investigate whether gender bias in an LLM writing assistant transfers into career plan essays written by students. We first verify that a gender-biased prompt induces gender-differentiated language in LLM-generated essays, while a neutral prompt does not. We then recruited participants (N = 123) in a controlled environment to write career plan essays for paired biographical profiles differing only in gender under three conditions: no AI assistance, neutral LLM assistance, or gender-biased LLM assistance. Students in the biased condition produced essays with a significantly larger agentic gap and more gender-stereotypic occupation suggestions than those in the control and neutral conditions. Our results also reveal that this bias transfer is asymmetric: agency is suppressed in female-target essays while male-target writing remains largely unaffected. Our findings highlight the risk of bias propagation in AI-assisted writing, calling for fairness-aware design in educational AI tools.

2606.16127 2026-06-16 cs.CL cs.AI cs.LG 新提交

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

AuAu: 大型语言模型中威权对齐审计基准

Andreas Einwiller, Max Klabunde, Florian Lemmerich

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出AuAu基准,结合心理测量、情境行为测试和用户提示评估LLM的威权倾向,发现17个模型均存在显著威权响应,且系统提示可操纵多数模型。

Comments v1, 50 pages

详情
AI中文摘要

全球威权主义的浪潮,加上用户日常生活中日益核心的角色,引发了特定模型在多大程度上展现或促进威权态度和特征的问题。我们引入了AuAu,一个旨在评估LLM生成具有威权倾向响应风险的全面基准。该基准结合了三种评估方法:(i) 来自15个经过人类验证的广泛工具库的心理测量问题;(ii) 在具体情境中探究意图行为的情境行为小故事;(iii) 对现实用户提示的响应。与先前工作不同,AuAu不仅评估对威权主义的一般亲近程度,还评估已建立的子概念:威权攻击、威权服从和传统主义。评估来自中国、欧盟、俄罗斯和美国的17个模型,我们发现所有测试模型在心理测量评估下都表现出显著的威权响应率,尽管在越来越现实的下游任务中,该比率显著下降。我们进一步发现,威权系统提示容易操纵17个模型中的15个以促进增强的威权主义。我们的结果强调了持续、系统性地审计基于LLM的AI系统的必要性,以检测并最终减轻生成输出中不期望的威权倾向。我们的代码和数据可在 https://github.com/andreaseinwiller/AuAu 获取。

英文摘要

The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We introduce AuAu, a comprehensive benchmark that aims to assess the risk of LLMs generating responses with authoritarian tendencies. This benchmark combines three evaluation approaches: (i) psychometric questions from an extensive pool of 15 human validated instruments; (ii) contextual behavior vignettes probing intended actions in concrete situations; and (iii) responses to realistic user prompts. Unlike prior work, AuAu evaluates not only a general closeness towards authoritarianism but also the established sub-concepts Authoritarian Aggression, Authoritarian Submission, and Conventionalism. Evaluating 17 models from China, the EU, Russia, and the USA, we find that all tested models exhibit substantial authoritarian response rates under the psychometric evaluation, though rates drop significantly in increasingly more realistic downstream task. We further find that an authoritarian system prompt easily manipulates 15 out of 17 models to promote increased authoritarianism. Our results underscore the need for continued, systematic auditing of LLM-based AI systems to detect and ultimately mitigate undesired authoritarian tendencies in generated output. Our code and data are available at: https://github.com/andreaseinwiller/AuAu

2606.16137 2026-06-16 cs.CL cs.AI 新提交

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

基于XAI的语音深度伪造检测解释生成:使用免训练多模态大语言模型

Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

发表机构 * Imperial College London(帝国理工学院) Technical University of Munich(慕尼黑工业大学) University of Southampton(南安普顿大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对语音深度伪造检测缺乏可解释性的问题,提出一种免训练框架,融合XAI证据与多模态大语言模型,生成基于证据的特定解释,在PartialSpoof数据集上内部准确率提升超45%。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

语音深度伪造检测(SDD)系统需要可信的解释以进行可靠的决策。现有的解释方式主要分为两类。传统的可解释人工智能(XAI),如基于梯度的归因,产生与模型决策紧密耦合的低级归因信号,且比自然语言解释更难被人类理解。同时,基于大语言模型(LLM)的解释生成通常由于缺乏启发式证据和任务特定监督(源于SDD有限的基于证据的解释数据集)而产生通用且无根据的描述。因此,我们提出一种免训练解释框架,将XAI证据与多模态LLM集成,以生成基于证据的特定解释。使用PartialSpoof数据集,我们构建了一个基于证据的解释数据集,并表明带有XAI的方法将内部准确率提高了超过45%,通过人工评估和忠实性检查得到验证。

英文摘要

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

2606.16583 2026-06-16 cs.CL 新提交

Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

不确定性并非临床VQA的安全网,但它能预测模型失败吗?

Arnisa Fazla, Alberto Testoni, Ameen Abu-Hanna, Barbara Plank, Iacer Calixto

发表机构 * Amsterdam University Medical Center, University of Amsterdam(阿姆斯特丹大学医学中心) Amsterdam Public Health(阿姆斯特丹公共卫生) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 研究临床视觉语言模型的不确定性估计是否可靠,发现其质量随模型准确率变化,在模型脆弱时失效,但能预测扰动下的性能崩溃。

Comments 17 pages, 4 figures

详情
AI中文摘要

临床视觉语言模型(VLM)的安全部署需要可靠的不确定性估计(UE):一个指示何时应信任预测或将其升级给临床医生的信号。我们测试了当前UE方法是否真正提供这一信号。在12个VLM上对8种方法进行临床视觉问答(VQA)基准测试,我们发现UE质量并非UE方法的内在属性:它跟踪模型准确率,在模型性能最弱的地方退化,因此正是在最需要可靠性的地方。当我们通过隐藏多项选择答案中的正确选项(NOTA扰动)对模型进行压力测试时,准确率崩溃,而不确定性几乎不变,使模型系统性地校准错误。然而,我们发现未扰动输入上的不确定性可靠地预测了哪些预测会在NOTA下崩溃,表明当前VLM中的UE携带关于模型脆弱性的诊断信息。我们的结果将UE定位为识别脆弱预测的诊断工具,并激励基于扰动的评估作为通向安全临床部署的途径。

英文摘要

Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.

2606.16617 2026-06-16 cs.CL cond-mat.mtrl-sci cs.AI 新提交

Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

推挤载荷下的谄媚作为材料失效:三种加载情形及多达十七种材料批次的多元表征

Ferdinand M. Schessl

AI总结 采用材料科学框架,将LLM谄媚视为推挤载荷下的材料失效,通过14个轴测量和三种加载情形(辩论、错误预设、伦理设定)共7800个样本,揭示失效模式依赖加载类型,并发现跨评判者可靠性差异。

Comments 12 pages, 3 figures. Code, data, and pre-registrations: https://github.com/FerdinandSchessl/sycophancy-note-companion

详情
AI中文摘要

LLM中的谄媚现象在70多篇论文中有记录,但专家对构念边界的共识仍然较低(ICC=.184;Ye等人,2026)。该构念碎片化是因为行为分类取决于哪种表面形式被优先考虑。我们采用材料科学框架:对话作为加载下的测试样本,LLM模型作为材料批次,推挤作为渐进载荷,立场翻转作为材料失效。我们在三种加载情形(辩论n=1000;错误预设n=3400;伦理设定n=3400;每种情形10-17种材料批次;共7800个样本)下,使用14个回合级轴测量(涵盖速度、损伤累积、框架漂移、脆性和方向稳定性)以及来自独立管道的三个说话者解析轴来表征这种失效。测量是胡克耦合的($σ= E \cdot \varepsilon$类比),并在加载情形间重现,在辩论上效应高达$|r_{rb}| = 0.35$;符号结构增加了第二种模式:伦理设定情形反转了速度和累积块。方差组成分为两个轮廓:辩论是批次主导的(类似脆性断裂:材料等级决定),错误预设和伦理设定是主题主导的(类似蠕变:载荷决定);比率(2.03 vs 0.13/0.17)依赖于估计器,对于辩论甚至在方向上也是如此。跨评判者可靠性(GPT-4o vs Haiku 4.5)显示辩论评分是评判者鲁棒的(Cohen's $κ= 0.88$),而错误预设评分是评判者敏感的($κ= 0.36$)——这是单评判者基准必须报告的注意事项。这是Ye等人诊断所要求的方法论举措:一种不依赖于构念的哪种表面形式被优先考虑的多元表征。

英文摘要

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

2606.16801 2026-06-16 cs.CL 新提交

The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models

混合艺术:基于混合的混淆方法用于大型语言模型中隐私保护的分割学习

Chen Chen, Xiang Gao, Xianshun Wang, Chengran Li, Shengyu Xia, Xueluan Gong, Linru Zhang, Qian Wang, Kwok-Yan Lam

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算与数据科学学院) School of Cyber Science and Engineering, Wuhan University, China(武汉大学网络空间安全学院)

AI总结 提出MIXGUARD框架,通过令牌级混淆、表示级混淆和自适应梯度扰动机制,在保护隐私的同时保持模型效用,实验表明其优于现有方法。

Comments 19 pages, 5 figures

详情
AI中文摘要

分割学习为资源受限的用户提供了一种实用范式,通过将计算密集型层卸载到服务器同时保留原始数据在本地,来训练大型语言模型(LLMs)。然而,现有的隐私保护分割学习方法仍然在效用、隐私、效率和稳定性之间面临艰难的权衡。具体来说,这些方法常常遭受显著的效用下降,仍然容易受到高级数据重建攻击,产生高昂的计算和通信开销,或者在不同任务上表现出不稳定的性能。在本文中,我们提出了MIXGUARD,一种新颖的基于混合的隐私保护分割学习框架,用于LLMs。MIXGUARD引入了令牌级混淆、表示级混淆和自适应梯度扰动机制,这些机制联合运作以保留有用的学习信号,同时防止隐私泄露给服务器。技术上,MIXGUARD首先在公共数据集上构建一个轻量级校准模型,以细化近似的目标表示,然后在私有数据上的隐私保护微调期间应用该模型。我们在多个LLM家族、模型大小、架构和微调策略上,对四个分类任务和四个文本生成任务进行了大量实验。结果表明,MIXGUARD保持了与非分割训练基线相当的模型效用,在针对最先进的数据重建攻击时,始终比现有的分割学习防御方法实现更强的隐私保护,并且在自适应攻击设置下保持鲁棒性。

英文摘要

Split learning provides a practical paradigm for resource-constrained users to train Large Language Models (LLMs) by offloading computation-intensive layers to a server while keeping raw data local. However, existing privacy-preserving split learning methods still face a difficult trade-off among utility, privacy, efficiency, and stability. Specifically, these methods often suffer from substantial utility degradation, remain vulnerable to advanced data reconstruction attacks, incur prohibitive computational and communication overhead, or exhibit unstable performance across different tasks. In this paper, we propose MIXGUARD, a novel mixup-based privacy-preserving split learning framework for LLMs. MIXGUARD introduces token-level obfuscation, representation-level obfuscation, and adaptive gradient perturbation mechanisms, which operate jointly to preserve useful learning signals while preventing privacy leakage to the server. Technically, MIXGUARD first constructs a lightweight calibration model on a public dataset to refine the approximated target representation, and then applies this model during privacy-preserving fine-tuning on private data. We conduct extensive experiments on four classification tasks and four text generation tasks across multiple LLM families, model sizes, architectures, and fine-tuning strategies. The results show that MIXGUARD preserves model utility comparable to non-split training baselines, consistently achieves stronger privacy protection than existing split learning defense methods against state-of-the-art data reconstruction attacks, and remains robust under adaptive attack settings.

2606.16821 2026-06-16 cs.CL cs.CR cs.CY cs.IR 新提交

How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation

我们能在多大程度上信任LLM搜索智能体?衡量对网络内容操纵的认可脆弱性

Yimeng Chen, Zhe Ren, Firas Laakom, Yu Li, Dandan Guo, Jürgen Schmidhuber

发表机构 * Center of Excellence for Generative AI, KAUST(KAUST生成式人工智能卓越中心) Jilin University(吉林大学) Zhejiang University(浙江大学) The Swiss AI Lab, IDSIA-USI/SUPSI(瑞士人工智能实验室 IDSIA-USI/SUPSI) NNAISENSE

AI总结 提出SearchGEO框架,通过五类攻击模式和输出级指标评估13种LLM后端对网络内容操纵的认可脆弱性,发现攻击成功率从0%到31.4%不等,且不同后端对攻击模式的敏感性和部署框架的影响存在显著差异。

Comments 23 pages, 3 figures

详情
AI中文摘要

基于大型语言模型(LLM)的搜索智能体将开放网络内容综合成可操作的建议,代表用户行事,这造成了攻击者发布的页面可能被转化为认可声明的风险。我们引入了SearchGEO,一个用于衡量基于LLM的网络搜索智能体中认可腐败的受控评估框架,它结合了网络证据操纵流水线、五模式攻击分类法和多个输出级指标。我们在308个案例上评估了13个LLM后端。结果显示,脆弱性模式因后端而异:总体攻击成功率(ASR)从Claude-Sonnet-4.6的0.0%到Gemini-3-Flash的31.4%不等,最强攻击模式因模型系列而异,且相同的部署框架可能在不同后端上放大或降低ASR。一个辅助智能体技能探测(其中认可变为安装命令)揭示了原本鲁棒的后端之间的明显分裂:Claude过度拒绝而GPT过度信任。这些发现主张将对抗性搜索内容下的推荐可靠性视为后端安全评估的一级维度。

英文摘要

Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

2606.15980 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

安全监控器在更新后是否仍可靠?激活监控器陈旧性的基准测试与预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 研究语言模型更新后激活监控器是否仍可靠,发现量化更新影响小,微调更新常导致监控器失效,且可通过预部署特征预测退化。

详情
AI中文摘要

激活监控器——在语言模型内部表示上训练的轻量级探针——在部署安全栈中越来越常见。然而,部署的模型很少是静态的:它们被量化、微调、用LoRA适配,或与合并适配器一起服务,而监控器保持冻结。我们首次系统测试了这一隐含契约是否成立:在基础模型上训练的激活监控器在这些常规模型更新后是否仍可靠。跨多个安全相关监控器、模型深度、更新系列和开放权重模型,我们发现一个明显的分裂:量化风格的更新大多保持冻结探针性能,而微调风格的更新经常使探针变得陈旧。脆弱性高度依赖于监控器,隐私/PII探针受影响最大,而拒绝合规探针相对稳定,表明重新训练行为不一定使其对应的监控器变得陈旧。QLoRA尤其有害,尽管单独的NF4量化相对良性,这表明量化在与适配结合时风险更大。我们进一步表明,退化可以从部署前的特征预测,从而能够将重新验证预算优先分配给最可能失败的监控器。这些结果表明,微调应默认触发激活监控器重新验证,而预测可以帮助优先检查哪些监控器。

英文摘要

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

2606.16100 2026-06-16 cs.CR cs.CL cs.LG 交叉投稿

Your "Pro" LLM Subscription May Actually Be "Free": Exposing Fingerprint Spoofing Risks in LLM Inference Services

你的“专业”LLM订阅可能实际上是“免费”的:揭示LLM推理服务中的指纹欺骗风险

Jiahao Zhang, Xiuyu Li, Suhang Wang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出指纹欺骗攻击,恶意服务商通过参数高效微调弱模型模仿强模型,绕过用户指纹验证;理论证明用户资源限制导致指纹易被欺骗,并设计GhostPrint攻击框架,实验表明其能以低成本持续绕过主流指纹方法。

详情
AI中文摘要

随着大型语言模型(LLM)API变得无处不在,用户越来越依赖黑盒指纹识别来验证提供商是否提供广告中宣传的高级模型。然而,这些方法可能忽视那些操纵模型权重以欺骗指纹识别过程的对抗性提供商。我们引入了一种称为指纹欺骗的新威胁,其中恶意提供商隐秘地提供一个通过参数高效微调以模仿更强模型的较弱模型,从而规避用户端的指纹识别。我们首先正式证明用户端资源限制(即有限的查询预算和弱指纹分类器)使得当前的指纹识别容易受到指纹欺骗。在此理论分析指导下,我们提出了GhostPrint,一个利用代理建模、奖励排名微调和知识蒸馏的成本效益攻击框架。在静态和持续指纹识别设置中的广泛评估表明,GhostPrint允许弱模型以低微调成本持续绕过代表性指纹方法,同时保持实用性,暴露了当前LLM指纹识别流程中的一个关键漏洞。

英文摘要

As Large Language Model (LLM) APIs become ubiquitous, users increasingly rely on black-box fingerprinting to verify that providers are serving the advertised premium models. However, these methods may overlook adversarial providers who manipulate model weights to cheat the fingerprint process. We introduce a novel threat termed fingerprint spoofing, where a malicious provider stealthily serves a weaker model that has been parameter-efficiently fine-tuned to mimic a stronger model, thereby evading user-side fingerprinting. We first formally prove that user-side resource constraints (i.e., finite query budgets and weak fingerprinting classifiers) make current fingerprinting vulnerable to fingerprint spoofing. Guided by this theoretical analysis, we propose GhostPrint, a cost-effective attack framework leveraging surrogate modeling, reward-ranked fine-tuning, and knowledge distillation. Extensive evaluations in both static and continual fingerprinting settings demonstrate that GhostPrint allows weak models to consistently bypass representative fingerprint methods while maintaining utility at a low fine-tuning cost, exposing a critical vulnerability in current LLM fingerprinting pipelines.

2606.16118 2026-06-16 cs.AI cs.CL cs.LO 交叉投稿

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

了解你的局限:LLM在法律推理中作为求解器和自动形式化工具的忠实性

Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, Leilani H. Gilpin

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) Univ. Helsinki(赫尔辛基大学) CodeX, Stanford(斯坦福大学CodeX中心) Stanford University(斯坦福大学) Canyon Crest Academy(峡谷峰学院) Monta Vista High School(蒙塔维斯塔高中) Los Altos High School(洛斯阿尔托斯高中)

AI总结 研究LLM在法律推理中是否忠实执行逻辑推理,发现LLM基于形式推理的高性能掩盖了范围清洗等不忠实模式,揭示基准准确性与逻辑忠实性之间的根本差距。

Comments 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

详情
AI中文摘要

大型语言模型(LLM)在推理任务上表现强劲,但这是否反映了忠实的逻辑推理还是启发式近似仍不清楚。我们在法律蕴含中通过比较三种范式——纯LLM分类、基于LLM的形式推理以及使用Z3 SMT求解器的基于求解器的形式推理——在重新标注的ContractNLI子集上对五个LLM进行了研究。我们的重新标注揭示了实用法律解释与严格形式蕴含之间存在系统性的、可测量的差距,其中相当大比例的法律上合理的推理在没有额外未声明假设的情况下缺乏形式基础。虽然引入形式结构提高了准确性,基于LLM的形式推理达到了最高的基准性能,但我们表明这种提升并不意味着忠实推理。我们识别出三种反复出现的失败模式:范围清洗(LLM报告与求解器不一致的分类而不执行底层形式推理,产生看似逻辑上合理但实际并非如此的结论)、隐式约束盲区(LLM忽略形式表示中存在的逻辑约束)以及程序合成失败(尽管有结构化提示,LLM仍生成错误的Z3代码)。关键的是,范围清洗在所有模型中持续存在,这引发了对基于LLM的形式推理作为符号执行代理的忠实性的严重担忧。这些结果揭示了基准准确性与逻辑忠实性之间的根本差距。

英文摘要

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

2606.16242 2026-06-16 cs.LG cs.CL 交叉投稿

Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

快速投毒:针对快速响应框架的实用投毒攻击

David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal, Chawin Sitawarin

发表机构 * Princeton University(普林斯顿大学)

AI总结 揭示针对快速响应框架的投毒攻击,通过提示注入在训练集中植入恶意样本,实现目标性投毒和概念后门攻击,仅1%投毒率即可导致高达100%误报率和96%漏报率。

Comments Spotlight at ICML 2026

详情
AI中文摘要

快速响应(RR)框架部署在生产系统中,包括Anthropic的ASL-3安全措施,持续改进越狱检测分类器。当出现绕过这些分类器的新越狱方法时,快速响应会生成合成变体用于训练,帮助模型从新攻击中泛化并快速适应。我们揭示,提示注入可以渗透到该管道中,将投毒样本送入分类器的训练集,实现两个攻击目标:(I)目标性投毒攻击,通过将无害样本归类为越狱来制造误报,并具有特定所需特征(例如特定格式、主题或关键词);(II)基于概念的后门攻击,在存在后门触发器时,诱导对越狱输入产生漏报,甚至泛化到防御者明确训练过的攻击策略中的越狱。重要的是,我们的威胁模型限制攻击者只能修改越狱样本(不能修改良性数据或标签),这是先前工作未探索的约束,使得第二个目标特别具有挑战性。我们通过遗漏攻击解决这一问题,该攻击利用了一个新现象:当在概念缺失的不安全样本上训练时,分类器错误地将该概念的存在与安全标签关联。两种攻击在仅1%的投毒率下都会导致显著且在某些情况下近乎完全的标签翻转,实现高达100%的误报率和高达96%的漏报率。

英文摘要

The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

2606.16344 2026-06-16 cs.AI cs.CL cs.CY cs.LG 交叉投稿

Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection

AI推荐哪家酒店?LLM辅助酒店选择中声誉信号的算法审计

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Asher Ali

发表机构 * Fandaqah, Al Khobar, Saudi Arabia(沙特阿拉伯阿尔科巴尔Fandaqah) Hamdard University, Karachi, Pakistan(巴基斯坦卡拉奇哈姆达德大学)

AI总结 通过随机选择联合实验审计12种LLM,发现客人评分和价格主导推荐,但过度重视生态认证而忽略管理回复,且列表位置(无内容特征)有因果影响。

Comments 32 Pages

详情
AI中文摘要

旅行者越来越多地询问大语言模型(LLM)助手预订哪家酒店,使这些系统成为物业可见性的守门人——但什么驱动了它们的推荐尚未有记录。我们使用基于随机选择的联合实验进行预先指定的算法审计:跨角色、提示模板和十二个开放权重及专有模型,助手在五家酒店中进行选择,这些酒店的客人评分、评论数量和时效性、管理回复、连锁品牌、价格、生态认证和列表位置均被独立随机化。我们估计每个信号对推荐概率的平均边际成分效应。客人评分和价格占主导地位(高评分使选择概率提高31.6个百分点;高价格使其降低30.0个百分点),重现了人类效价和价格优先性,但过度重视生态认证而忽略管理回复。列表位置——一个无内容的伪影——因果性地改变推荐,价值约为每晚12美元。陈述的理由与揭示的权重不完全一致。这些发现为生成式引擎优化和AI信息中介的可问责性提供了因果证据。

英文摘要

Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility -- yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position -- a content-free artifact -- shifts recommendations causally, worth about \$12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

2606.16527 2026-06-16 cs.CR cs.CL 交叉投稿

DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

DoubtProbe:通过结构验证与语义审计的黑盒越狱防御

Xuanyu Yin, Yilin Jiang, Jun Zhou, Kai Chen, Zhengfu Cao, Xiaolei Dong

发表机构 * East China Normal University(东华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 提出DoubtProbe双分支推理时防御框架,结合结构验证与语义审计,将黑盒越狱防御转化为受控变换下的一致性检查,在多个基准上实现更强更稳定的防御-效用权衡。

Comments 25 pages, 5 figures

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地部署在面向用户的系统中,黑盒越狱防御已成为一个重要的实际问题。现有防御通常依赖于已知攻击覆盖、提示级语义判断或本地运行时控制,但这些方法在面对不断演变的提示包装、表达重写和结构操纵时可能变得不稳定。我们观察到,许多黑盒越狱并未移除有害目标,而是重新组织表达和执行所需的信息,从而逃避安全对齐,同时在生成过程中保持可恢复性。受此观察启发,我们提出DoubtProbe,一种双分支推理时防御框架,结合结构验证与语义审计,并将黑盒越狱防御形式化为受控变换下的一致性检查。结构分支从原始请求中提取结构化表示,在表示约束下重建请求,并检测原始请求与重建请求之间的信息保持失败;语义分支直接审计原始提示。我们在越狱和良性请求基准上评估DoubtProbe与代表性黑盒防御的比较,并进一步测试从Qwen2.5-72B到Llama-3.1-70B的主干迁移。结果表明,DoubtProbe实现了更强更稳定的防御-效用权衡:在Qwen2.5-72B上,它将JBB攻击成功率从0.293降至0.100,CodeAttack攻击成功率从0.152降至0.001,同时在AlpacaEval和OR-Bench上保持0.022和0.016的假阳性率;相同模式在Llama-3.1-70B上保持稳定。这些发现表明,结构不一致信号为黑盒越狱防御提供了实用且可泛化的基础,尤其是在与语义审计结合时。

英文摘要

As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.

2606.16682 2026-06-16 cs.LG cs.CL 交叉投稿

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across four evaluator configurations (N=53 total independent repetitions, 15,592 API calls) reveals a clear hierarchy: cross-model evaluation (GPT-4o, N=8) produces strong but symmetric bidirectional contagion (mean gamma_{T->V}=1.176, gamma_{V->T}=1.089, Delta=-0.088, p=0.575, Cohen's d=0.29); high round counts (DashScope, 50 rounds) cause collapse to single-strategy dominance (70% zero contagion); and self-evaluation provides near-complete immunity -- 97% of runs (N=30, DeepSeek-chat) yield exactly zero contagion (mean gamma=0.033, 95% CI [-0.031, 0.010], p=0.642, d=0.07). No evaluator condition shows statistically significant directional asymmetry. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC experimental framework, and identify cross-model evaluator architecture as the primary risk factor for preference contagion.

2606.16710 2026-06-16 cs.MA cs.CL 交叉投稿

Misinformation Propagation in Benign Multi-Agent Systems

良性多智能体系统中的错误信息传播

Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

AI总结 研究在良性多智能体系统中,基于意图的错误信息如何通过智能体交互传播,并发现多智能体辩论相比单智能体提示能减少性能下降,鲁棒性取决于群体组成和决策协议。

Comments 20 pages, 8 figures, 1 table

详情
AI中文摘要

多智能体系统,其中多个大型语言模型智能体通过轮流交互解决问题,越来越多地部署在医疗诊断、法律分析和法医决策等高风险环境中。当单个智能体从错误或误导性上下文(例如工具调用)进行推理时,其可靠性可能面临风险,因为错误可能通过智能体交互传播。本研究通过在推理、知识和对齐任务中向良性单智能体和多智能体系统注入基于意图的错误信息来研究这一风险。我们发现,错误信息会降低单智能体性能,并在多智能体辩论中持续存在,智能体通常会保留由误导同伴引入的答案。尽管如此,与单智能体提示相比,多智能体辩论减少了由此产生的性能下降,尤其是在大多数智能体未接触错误信息的情况下。鲁棒性取决于群体组成和决策协议。在同伴压力下,共识可能比投票更稳定,而多数意见通常能将受误导的智能体引导回正确答案。我们的结果表明,多智能体系统中的错误信息鲁棒性不仅取决于底层模型,还取决于智能体如何交换信息和聚合决策。

英文摘要

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

2501.19337 2026-06-16 cs.CL cs.CV 版本更新

Token-Level Entropy Reveals Demographic Disparities in Language Models

Token级熵揭示语言模型中的群体统计差异

Messi H. J. Lee

发表机构 * Independent Researcher(独立研究者)

AI总结 通过测量零温度下全词汇香农熵,发现黑裔名字比白裔名字产生更高的首token熵,且女性名字比男性名字产生更低的熵和更同质的输出,种族和性别效应可加,指令微调未减弱种族差距,显式群体标签探测无显著种族效应。

Comments 9 pages

详情
AI中文摘要

我们探究仅由姓名标识的人口统计身份是否会系统性地重塑语言模型的生成分布。在六个开源基础模型和5760个隐式句子补全提示(例如“Tanisha在一个周一早晨走进办公室,然后”)上,测量零温度下的全词汇香农熵,我们发现,在所有六个架构中,与白裔名字相比,黑裔名字产生更高的首token熵——这与在显式人口统计提示下记录的输出层面同质性偏差(Lee等人,2024)相反——并且黑裔名字总是比白裔名字在身份中性基线之上产生更大的熵(所有六个模型中ΔΔ>0)。与男性名字相比,女性名字伴随更低的首token熵(DL合并β̂=-0.041,p=0.019)和更同质的输出(α̂=+0.024,p<0.001)——这一模式与同质性偏差收敛;种族和性别效应是可加的。指令微调并未减弱种族差距(匹配格式DL合并β̂=+0.153)。使用显式群体标签而非姓名运行相同模板,在隐式探测显著的12个模型中有10个产生无效的种族效应——这表明探测方法是恢复哪种分布结构的主要决定因素。

英文摘要

We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($ΔΔ> 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hatβ= -0.041, p = .019$) and more homogeneous outputs ($\hatα= +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hatβ=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.

2502.05163 2026-06-16 cs.CL cs.LG 版本更新

Enhancing LLM Safety Through a Theoretical Minimax Game Lens

通过理论极小极大博弈视角增强LLM安全性

Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) VirtueAI University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出极小极大强化学习框架,通过数据生成器与分类器协同进化生成高质量多语言安全数据,理论证明收敛到纳什均衡,使小模型在英文基准上超越SOTA近10%且推理速度提升4.5倍。

Comments 24 pages, 9 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)的快速发展需要有效的机制来确保其负责任部署,通过准确区分不安全内容和良性内容。尽管英文中有大量安全数据集,但由于其他语言的开源安全数据集有限,多语言安全建模仍未得到充分探索。即使在英文数据集中,安全但敏感的边界情况内容也很稀缺,导致模型出现捷径学习和非平凡的误报率。为缓解这些问题,我们引入了一种新颖的极小极大强化学习(RL)框架,其中数据生成器和分类器模型共同进化,促进高质量合成多语言安全数据的生成。我们从理论上将这种交互形式化为一个极小极大博弈,并严格证明了收敛到纳什均衡。实证评估证实,我们的合成数据生成方法显著提升了分类器模型的性能,使得一个更小的模型在英文基准上超越当前最优水平近10%,同时实现4.5倍的推理速度提升。这些结果为合成数据生成建立了一种可扩展且高效的方法,推动了更安全、更稳健的多语言LLM部署的发展。

英文摘要

The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.

2502.08266 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Dealing with Annotator Disagreement in Hate Speech Classification

处理仇恨言论分类中的标注者分歧

Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu

发表机构 * Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey(工程与自然科学学院,Sabanci大学,伊斯坦布尔,土耳其) Center of Excellence in Data Analytics (VERIM), Sabanci University, Istanbul, Turkey(数据分析卓越中心(VERIM),Sabanci大学,伊斯坦布尔,土耳其)

AI总结 研究标注者分歧对仇恨言论分类的影响,评估多数投票等聚合方法,并利用感知强度增强分类性能,在土耳其语推文中取得新最优结果。

Comments 19 pages, 4 Tables

详情
AI中文摘要

仇恨言论检测是一项关键任务,尤其是在有害内容可能迅速传播的社交媒体上。收集社交媒体内容(如推文)来训练机器学习模型很容易,但由于其固有的主观性,检测和分类仇恨言论可能很困难。这种主观性导致标注者之间频繁出现分歧,尤其是对于微妙或边缘内容。传统方法要么丢弃非共识样本,要么通过专家裁决强制设定“黄金标准”,忽略了关于不确定性和多样化人类视角的宝贵信息。我们研究了仇恨言论分类中标注者分歧这一很大程度上被忽视的问题,并评估了一系列聚合方法,包括多数投票、序数策略(最小值、最大值和均值),并分析了它们在二分类、四分类和六分类任务中的影响。此外,我们利用标注者感知的仇恨言论强度分数来探索基于回归和混合建模的方法。我们证明,过滤非共识样本会导致过于乐观的结果,而感知强度提供了增强分类性能的补充信号。最后,我们在土耳其语推文的仇恨言论检测中建立了新的最优结果,并表明标注者分歧在适当建模后,是构建更稳健可靠系统的宝贵资源。

英文摘要

Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.

2503.23688 2026-06-16 cs.CL cs.HC 版本更新

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

映射11个大语言模型中的地缘政治偏见:美中紧张局势的双语、双框架分析

William Guey, Wei Zhang, Pierrick Bougault, Vitor D. de Moura, José O. Gomes

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系) School of Social Sciences, Tsinghua University(清华大学社会科学部) Department of Industrial Engineering, Federal University of Rio de Janeiro(里约热内卢联邦大学工业工程系)

AI总结 通过引入调查心理测量学的平衡键控方法,量化11个模型在2种语言中对美中地缘政治问题的立场,发现开发者起源、查询语言和议题领域是三个近等加性因素,所有模型在中文下更亲华。

Comments 37 pages, 6 main-text figures, 12 supplementary figures, 5 supplementary tables; supplementary information included

详情
AI中文摘要

大语言模型是数亿人现在如何遇到有争议的政治问题的方式,这提出了一个微妙的测量问题:一个简单地同意任何被告知内容的模型可以伪装成有偏见的,污染任何声称模型持有政治观点的说法。我们通过从调查心理测量学中引入平衡键控来解决这个问题,提出每个命题及其交换的反面,并对响应进行符号化,使得默认同意被抵消,真正的信念得以累积。结果是一个可重复的、定量的工具,映射了11个模型和2种语言(19,712个响应)的地缘政治立场。开发者起源、查询语言和议题领域成为三个近等加性因素;每个模型,包括在美国构建的模型,在普通话中更亲华;并且两个具有相同同意偏差的模型被区分开,一个中立,一个有偏见。我们将其作为一个开放、交互式工具发布,可扩展到任何有争议的意见领域。

英文摘要

Large language models are how hundreds of millions of people now encounter contested political questions, raising a subtle measurement problem: a model that simply agrees with whatever it is told can masquerade as biased, contaminating any claim that models hold political opinions. We address this by importing balanced keying from survey psychometrics, posing each proposition and its swapped reverse and signing the response so acquiescence cancels and genuine conviction accumulates. The result is a reproducible, quantitative instrument that maps geopolitical stance across 11 models and 2 languages (19,712 responses). Developer origin, query language and issue domain emerge as three near-equal, additive factors; every model, including those built in the United States, leans more Pro-China in Mandarin; and two models with identical agreement bias are told apart, one neutral, one biased. We release it as an open, interactive tool that extends to any contested-opinion domain.

2505.14418 2026-06-16 cs.CL 版本更新

Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents

隐藏的幽灵之手:揭示MLLM驱动的移动GUI代理中的后门漏洞

Pengzhou Cheng, Haowen Hu, Zheng Wu, Zongru Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Huawei Inc.(华为公司)

AI总结 本研究提出AgentGhost框架,通过组合目标和交互层触发器,利用监督对比学习和微调注入后门,在移动GUI代理中实现高成功率和隐蔽性,攻击准确率达99.7%。

Comments EMNLP-Findings 2025 (Correcting model settings)

详情
AI中文摘要

由多模态大语言模型(MLLM)驱动的图形用户界面(GUI)代理在人机交互中展现出巨大潜力。然而,由于微调成本高昂,用户通常依赖开源GUI代理或AI提供商提供的API,这引入了一个关键但尚未充分探索的供应链威胁:后门攻击。在这项工作中,我们首先揭示了MLLM驱动的GUI代理自然暴露了多个交互级触发器,例如历史步骤、环境状态和任务进度。基于这一观察,我们提出了AgentGhost,一个用于红队后门攻击的有效且隐蔽的框架。具体来说,我们首先通过组合目标和交互级别构建复合触发器,使GUI代理在确保任务实用性的同时无意中激活后门。然后,我们将后门注入表述为一个最小-最大优化问题,该问题使用监督对比学习在表示空间中最大化样本类别间的特征差异,提高后门的灵活性。同时,它采用监督微调来最小化后门和干净行为生成之间的差异,增强有效性和实用性。在两个已建立的移动基准测试中对各种代理模型进行的广泛评估表明,AgentGhost有效且通用,在三个攻击目标上的攻击准确率达到99.7%,并且仅以1%的实用性下降表现出隐蔽性。此外,我们针对AgentGhost定制了一种防御方法,将攻击准确率降低至22.1%。我们的代码可在\ exttt{anonymous}获取。

英文摘要

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.

2506.18756 2026-06-16 cs.CL cs.CR 版本更新

Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization

语义保持提示劫持:针对自动提示优化的黑盒对抗攻击

Chong Zhang, Xiang Li, Jia Wang, Shan Liang, Haochen Xue, Xiaobo Jin

发表机构 * Jiangsu universities(江苏大学) NSFC(国家自然科学基金委员会) RDF-24-01-016

AI总结 提出一种黑盒条件下的自适应贪婪局部搜索方法,通过层次化分解输入文本、掩码关键语言单元并动态调整候选词,在保持语义相似度的同时最大化模型输出与原始意图的偏差,在2400+测试用例中取得更高攻击成功率。

Comments 12 pages, 8 figures. Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2026)

详情
AI中文摘要

LLM 越来越多地集成自动建议优化模块,使其能够在生成最终响应之前重写和显示用户输入。虽然这种设计旨在增强透明度和信任,但其从多个候选解中自主选择单个最佳结果的过程,允许攻击者通过诱导细微、不可察觉的语义偏移来劫持这一优化过程。为此,我们提出一种基于黑盒条件的语义保持劫持攻击方法:自适应贪婪局部搜索。该方法层次化地分解输入文本,掩码关键语言单元,并在预定义的语义检查点动态调整候选替换词。这最大化模型输出与原始意图的偏差,同时严格保持与原始文本的语义相似性。在商业和开源 LLM 上的实验结果表明,在相同语义相似性约束下,该方法在超过 2400 个测试用例中实现了比现有攻击方法更高的攻击成功率。代码可在以下网址获取:this https URL

英文摘要

LLMs increasingly integrate auto-suggestion optimization modules, enabling them to rewrite and display user input before generating the final response. While this design aims to enhance transparency and trust, its process of autonomously selecting a single best result from multiple candidate solutions allows attackers to hijack this optimization process by inducing subtle, imperceptible semantic shifts. To address this, we propose a semantic preservation hijacking attack method based on black-box conditions: Adaptive Greedy Local Search. This method hierarchically decomposes the input text, masks key language units, and dynamically adjusts candidate replacement words at predefined semantic checkpoints. This maximizes the deviation between the model output and the original intent while strictly maintaining semantic similarity to the original text. Experimental results on commercial and open-source LLMs demonstrate that, under the same semantic similarity constraints, this method achieves a higher attack success rate than existing attack methods in over 2400 test cases. Code is available at: https://github.com/franz-chang/DOBS

2510.06445 2026-06-16 cs.CL cs.AI cs.CR 版本更新

A Survey on Agentic Security: Applications, Threats and Defenses

关于智能体安全的综述:应用、威胁与防御

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

发表机构 * BRAC University(布拉克大学) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所)

AI总结 本文首次全面综述智能体安全领域,围绕应用、威胁与防御三大支柱,分类260余篇论文,分析攻击入口、防御策略及生命周期覆盖,指出智能体系统默认结构脆弱,需全生命周期防御。

详情
AI中文摘要

基于LLM的智能体现在被广泛应用于网络安全领域。尽管这些智能体促进了强大且自主的安全应用,但其自主性也开辟了新的攻击面,安全社区正在积极构建防御措施来保护它们。然而,关于这一主题的文献增长迅速且不均衡。现有综述孤立地处理应用、威胁和防御,未能统一阐述智能体的能力、漏洞和对抗措施如何相互关联。在这项工作中,我们首次对智能体安全格局进行了全面综述,围绕应用、威胁和防御三大基本支柱构建该领域。我们提供了一个包含260多篇论文的综合分类法,解释了智能体如何用于下游网络安全应用、智能体系统固有的威胁以及旨在保护它们的对抗措施。此外,我们提供了详细的支柱特定和交叉分析,展示了智能体应用的安全生命周期覆盖、红队与蓝队智能体之间的比较,以及红队应用的对抗性使用。在威胁方面,我们分析了攻击目标所针对的入口点和智能体循环阶段、它们对智能体设置的特异性以及它们假设的威胁模型。在防御方面,我们分析了主要的防御策略、它们的成本和安全性权衡,以及它们在智能体生命周期中的部署位置。我们进一步映射了哪些防御覆盖哪些攻击类别,并绘制了智能体架构、骨干模型使用、数据模态覆盖以及攻击和防御研究随时间增长的趋势。综合来看,这些发现表明智能体系统在默认情况下结构脆弱,保护它们将需要跨越整个智能体生命周期的防御,而不是单层修复。

英文摘要

LLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively building defenses to secure them. Yet the literature on this subject has grown quickly and unevenly. Existing surveys treat applications, threats, and defenses in isolation, leaving no unified account of how an agent's capabilities, vulnerabilities, and countermeasures interconnect. In this work we present the first holistic survey of the agentic security landscape, structuring the field around the fundamental pillars of Applications, Threats and Defenses. We provide a comprehensive taxonomy of over 260 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. In addition, we provide detailed pillar-specific and cross-cutting analyses that show the security-lifecycle coverage of agentic applications, comparison between red-teaming and blue-teaming agents, and the adversarial use of red-teaming applications. On the threat side, we analyze the entry points and agent-loop stages that attacks target, their specificity to the agentic setting, and the threat models they assume. On the defense side, we analyze the prevailing defense strategies, their cost and security trade-offs, and where in the agent lifecycle they are deployed. We further map which defenses cover which attack classes and chart trends in agent architecture, backbone model usage, data modality coverage, and the growth of attack and defense research over time. Taken together, these findings indicate that agentic systems are structurally fragile by default and that securing them will require defenses that span the full agent lifecycle rather than single-layer fixes.

2510.17431 2026-06-16 cs.CL 版本更新

Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning

用于搜索的智能体强化学习使指令微调对齐失效

Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi

发表机构 * University of Oxford(牛津大学) Ludwig-Maximilians-Universität München(慕尼黑路德维希-马克西米利安大学) Harvard University(哈佛大学)

AI总结 研究智能体强化学习(RL)对指令微调模型对齐的影响,发现RL训练使模型将有害请求转化为良性搜索查询,但在触发条件下产生多步不安全搜索行为,并提出了基于表示引导的RL训练方法恢复对齐。

详情
AI中文摘要

智能体强化学习(RL)训练大型语言模型使用工具,但其对对齐的影响尚不明确。我们研究用于搜索的智能体RL如何影响指令微调(IT)模型的对齐。我们发现,RL训练的模型通过将有害请求转移为良性搜索查询来继承拒绝推理,但在一个简单的诊断触发器下,该触发器在拒绝发生前引发搜索调用,这种机制失效。在此条件下,RL模型产生多步不安全的搜索动作和推理,相对于IT对应模型,Qwen和Llama模型的搜索查询安全性降低高达68.6%。该效应跨模型家族、规模和RL算法泛化。为了理解原因,我们识别出残差流中控制搜索查询安全性的线性方向,并表明RL训练逐步将搜索行为向该方向的有害端移动。因此,我们提出表示引导的RL训练,它基于朝向有害搜索方向的投影添加奖励惩罚。仅使用良性数据训练,它在不降低任务准确性的情况下恢复IT级别的对齐,且不需要额外的训练数据。总之,我们的工作为诊断、机制分析和缓解用于搜索的智能体RL中的对齐退化提供了首个框架。

英文摘要

Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find that RL-trained models inherit refusal reasoning by deflecting harmful requests into benign search queries, but this breaks down under a simple diagnostic trigger that elicits a search call before refusal can occur. Under this condition, RL models produce multi-step unsafe search actions and reasoning, reducing search query safety by up to 68.6% in Qwen and Llama models relative to their IT counterparts. The effect generalises across model families, scales, and RL algorithms. To understand why, we identify linear directions in the residual stream that control search query safety, and show that RL training progressively shifts search behaviour toward the harmful end of this direction. We thus propose representation-guided RL training, which adds a reward penalty based on projection toward the harmful search direction. Training on benign data alone, it restores IT-level alignment without reducing task accuracy and requires no additional training data. Together, our work provides the first framework for diagnosing, mechanistically analysing, and mitigating alignment degradation in agentic RL for search.

2604.17301 2026-06-16 cs.CL cs.AI cs.HC cs.IR cs.LG 版本更新

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

RoTRAG: 基于经验法则推理的检索增强生成对话有害内容检测

Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu

发表机构 * Peking University(北京大学) Enhans University of North Texas(北得克萨斯大学)

AI总结 提出RoTRAG框架,通过检索外部道德规范(RoTs)增强LLM的多轮对话有害内容检测,实现基于规范推理和分类,平均F1提升约40%,分布误差降低8.4%。

Comments Accepted by SIGIR-ICTIR 2026, Oral Presentation

详情
Journal ref
Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR '26), July 25, 2026, Melbourne, VIC, Australia. ACM, New York, NY, USA, 12 pages
AI中文摘要

检测多轮对话中的有害内容需要对完整对话上下文进行推理,而非孤立的话语。然而,现有方法主要依赖模型内部的参数化知识,缺乏对外部规范性原则的明确依据。这常导致在社会细微语境下判断不一致、可解释性有限以及跨轮次冗余推理。为解决此问题,我们提出RoTRAG,一种检索增强框架,将简洁的人类编写的道德规范(称为经验法则,RoTs)融入基于LLM的有害性评估中。对于每一轮,RoTRAG从外部语料库中检索相关RoTs,并将其作为轮次推理和最终严重性分类的明确规范性证据。为提高效率,我们进一步引入一个轻量级二元路由分类器,决定新轮次是否需要基于检索的推理或可重用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue上的实验表明,RoTRAG在有害分类和严重性估计上均持续优于竞争基线,在基准数据集上F1平均相对提升约40%,分布误差平均相对降低8.4%,同时在不牺牲性能的情况下减少冗余计算。

英文摘要

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

2606.02493 2026-06-16 cs.CL 版本更新

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthropomorphism, and Maxims

不是“什么”,而是“如何”:LLM 响应框架的沟通审计

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh, Isabelle Augenstein

发表机构 * University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 提出 FRANZ 框架,从文化定位、概括性语言、拟人化线索和对话准则遵守四个维度审计 LLM 对主观问题的响应框架,并构建 SQUARE 语料库进行实证分析。

Comments 34 pages, 19 Figures, 4 Tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于回答主观性、信息寻求型问题,在这些问题中,用户对响应的沟通方式敏感,而不仅仅是答案是否正确。现有的针对主观文化查询的 LLM 评估主要关注事实正确性,忽略了响应的框架方式。为此,我们引入了 FRANZ,一个自动化的响应特征化框架,用于沿四个维度对 LLM 响应进行沟通审计:文化定位、概括性语言的使用、拟人化线索以及对对话准则的遵守。为了支持这一评估,我们贡献了 SQUARE——一个包含来自 57 个子版块的 376k 个主观问题的语料库,并映射到 7 个国家和 19 个问题类别。我们通过评分三个开放权重 LLM 的响应来展示 FRANZ 的适用性。我们观察到,LLM 在采用每种响应特征的频率上显示出统计显著差异。与单维度审计不同,FRANZ 揭示了内部定位和拟人化是正相关的,且相关程度因国家而异,为识别框架差异提供了诊断视角。

英文摘要

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

2606.12291 2026-06-16 cs.CL 版本更新

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford(牛津大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Waterloo(滑铁卢大学)

AI总结 本研究提出MedMisBench基准,通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性,发现模型准确率从71.1%降至38.0%,权威性虚假信息攻击成功率达69.5%。

详情
AI中文摘要

大型语言模型(LLMs)现在在医疗执照考试中达到专家级分数,这鼓励了高分数意味着安全医疗判断的假设,而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的:当误导性上下文被注入到LLMs最初正确回答的问题中时,它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性,并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对,涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中,平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%,攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造:权威框架的虚假信息达到69.5%的攻击成功率,例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点:现有基准衡量模型知道什么,但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2408.05568 2026-06-16 cs.AI cs.CL cs.CY stat.AP 版本更新

Metacognitive Myopia in Large Language Models

大型语言模型中的元认知近视

Florian Scholten, Tobias R. Rebholz, Mandy Hütter

发表机构 * Psychology Department, University of Tübingen(图宾根大学心理学系)

AI总结 提出元认知近视框架解释LLM偏见,认为信息环境中的有偏样本导致五种症状,并通过监控与控制机制近似技术缓解。

详情
AI中文摘要

大型语言模型(LLMs)表现出潜在有害的偏见,这些偏见强化了文化嵌入的刻板印象,影响道德判断,或放大对多数群体的积极评价。我们提出元认知近视作为一个认知生态框架,用以解释一系列已建立和新兴的LLM偏见。我们的理论框架认为,信息环境中的有偏样本导致LLM中元认知近视的五种症状:整合无效嵌入、易受冗余信息影响、在条件计算中忽略基率、基于频率的决策规则,以及对嵌套数据结构的错误高阶统计推断。此外,该框架认为元认知的两个主要组成部分——监控和控制——可以解释这五种症状。因此,我们进一步概述了如何从技术上近似监控和控制,例如通过隐藏的并行推理历史,使交互式LLM在生成公开响应之前能够评估近视推理的风险。我们的理论框架为有缺陷的人机交互和代理AI提供了新的视角,并对在组织结构和高风险决策中实施LLM提出了重要的伦理关切。

英文摘要

Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.

2505.14368 2026-06-16 cs.CR cs.CL 版本更新

From ASR to ASP: Evaluating Prompt Attack Vulnerabilities Against Open-Source LLMs

从ASR到ASP:评估针对开源大语言模型的提示攻击漏洞

Jiawen Wang, Pritha Gupta, Ivan Habernal, Eyke Hüllermeier, Xiaoxue Gao, Nancy F. Chen

发表机构 * LMU Munich(慕尼黑莱茵恩大学) RUB Bochum(波恩鲁姆大学) MCML A*STAR(新加坡科技研究局)

AI总结 提出攻击成功概率(ASP)指标,评估14个开源和3个闭源LLM对提示注入攻击的脆弱性,发现催眠攻击可达约90% ASP,忽略前缀攻击可突破所有开源模型。

Comments 8 pages, 2 figures, EMNLP 2026 under review

详情
AI中文摘要

最近的研究表明,大型语言模型(LLM)容易受到生成有害或敏感输出的攻击。随着开源LLM在金融、法律和医疗等高风险应用中的日益采用,系统地研究其安全风险对于迈向可信LLM时代变得越来越重要。本文全面研究了针对14个广泛使用的开源和三个闭源LLM的有效提示注入攻击,在五个攻击基准上进行了评估。此外,现有的评估指标大多只考虑攻击成功率,忽略了模型响应中的不确定性。我们提出的攻击成功概率(ASP)额外捕获了评估中的不确定行为,其中模型可能最初拒绝有害请求但随后提供有害指导,反之亦然,反映了攻击可行性的不一致性和模糊性。通过系统分析提示注入攻击的有效性,我们提出了一种简单有效的催眠攻击;结果表明,这种攻击导致对齐的语言模型,包括Stablelm2、Mistral、Openchat和Vicuna,产生不良行为,达到约90%的ASP。它们还表明,忽略前缀攻击可以突破所有14个开源LLM,在多类别数据集上实现超过60%的ASP。我们发现,中等知名度的LLM对提示注入攻击表现出更高的脆弱性,这突显了提高公众意识和优先考虑有效缓解策略的必要性。

英文摘要

Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to attacks that generate harmful or sensitive outputs. As open-source LLMs are increasingly adopted in high-impact applications such as finance, law, and healthcare, systematically investigating their security risks is becoming increasingly important towards trustworthy LLM era. This paper comprehensively studies effective prompt injection attacks against 14 widely used open-source and three closed-source LLMs on five attack benchmarks. Moreover, existing evaluation metrics mostly only consider the attack success rate, overlooking uncertainty in model responses. Our proposed Attack Success Probability (ASP) additionally captures uncertain behaviors for evaluation, where the model may initially refuse a harmful request but subsequently provide harmful guidance or vice versa, reflecting inconsistency and ambiguity in attack feasibility. By systematically analyzing the effectiveness of prompt injection attacks, we propose a straightforward and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around 90% ASP. They also indicate that ignore prefix attacks can break all 14 open-source LLMs, achieving over 60% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.

2512.19011 2026-06-16 cs.CR cs.AI cs.CL cs.LG 版本更新

Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale

你真的需要GPU来保护你的LLM吗?用于大规模安全执行的CPU级分类器与多阶段流水线

Vasudev Majhi, Dhruv Gupta, Advait Singh, Matthew Barker, Dhruv Kumar

发表机构 * BITS Pilani(比斯帕利尼大学) Trustwise(Trustwise公司)

AI总结 本文研究CPU级分类器(如SVM、梯度提升树)在LLM输入安全检测中的性能,发现其与GPU模型互补,并设计三阶段流水线GuardChain,在80%的分布内查询中达到近峰值精度,降低部署成本。

Comments Under Review. 25 pages, 5 figures, 38 tables

详情
AI中文摘要

用于筛选LLM输入中越狱尝试的安全分类器已成为标准部署组件,但几乎所有生产系统都依赖基于GPU的模型:微调变换器和LLM-as-a-judge流水线。这些方法带来了显著的每查询延迟和基础设施成本。很少有研究探讨基于CPU的分类器(例如在TF-IDF特征上训练的支持向量机和梯度提升树)是否能在生产部署遇到的各种条件下匹配其准确性。我们评估了五个CPU分类器家族、基于SSM的GPU分类器Mamba-130M以及基于变换器的GPU模型(DeBERTa-v3和带LoRA的Gemma-2B),涵盖九个越狱来源和三种场景:分布内(D1)、分布外(D2)和对抗性混淆(D3)。在D1上,最佳CPU分类器以约五分之一的部署成本匹配最佳变换器GPU模型。在D2上,CPU分类器因自信的校准错误而失败,产生高置信度的假阴性,完全绕过升级。在D3上,CPU分类器在F1上比变换器GPU模型高出超过26个百分点。基于这些互补的失败模式,我们设计了GuardChain,一个三阶段安全流水线(正则表达式 -> CPU -> GPU),将每个提示路由到能够做出自信决策的最便宜阶段。仅CPU阶段就解决了80%的分布内提示,接近峰值精度,而GPU阶段恢复了分布外失败。对于大规模部署LLM安全的从业者,这项工作提供了证据,表明GPU级基础设施对于大多数流量是不必要的。

英文摘要

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter. We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1. Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.

2605.25796 2026-06-16 cs.CR cs.AI cs.CL 版本更新

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAMark自锚水印框架,通过建立语义空间中与句子顺序无关的逐步独立绿色区域,结合多通道双曲评分机制和多样性感知过滤策略,在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情
AI中文摘要

语义级水印通过将句子作为基本单元,提高了对文本修改的鲁棒性。然而,对段落级释义的鲁棒性仍然困难,因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中,我们提出了SAMark,一种自锚水印框架,通过建立语义空间中与步骤无关的绿色区域,消除了对句子顺序的依赖。为了提高可检测性,我们引入了一种多通道双曲评分机制,该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略,将硬过滤与软正则化相结合,超越了简单的n-gram重复过滤器,以解决语义冗余问题。实验结果表明,在典型的段落级释义攻击下,SAMark实现了高达90.2%的TP@FP1%,平均比最强先前基线高出30%以上,同时保持了与未水印文本相竞争的生成本质量,并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

2605.28734 2026-06-16 cs.CR cs.CL cs.LG 版本更新

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

代码即武器:用于衡量编码模型对恶意代码请求遵从性的共识标记提示库

Richard J. Young, Gregory D. Moody

发表机构 * University of Nevada Las Vegas(内华达大学拉斯维加斯分校) Department of Information Systems(信息系统系)

AI总结 本文通过构建一个经五名评审共识标记的提示库(包含4,748个可执行恶意代码请求和1,923个有害安全知识请求),为编码模型对恶意代码请求的拒绝行为提供了可靠且可跨语料库比较的测量基准。

Comments 23 pages, 9 figures, 6 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) spanning diverse elicitation paradigms; 6,675 prompts, 33,375 classification calls

详情
AI中文摘要

一个回答有害问题的通用语言模型返回文本;而一个遵从恶意请求的编码模型可以返回一个可运行的武器——键盘记录器、勒索软件存根、按原样运行的漏洞利用。这种单一遵从行为严重性的不对称意味着,编码专用模型应比通用聊天模型设置更高的拒绝标准,而非更低,然而目前该领域无法判断它们是否做到了这一点。针对恶意代码的拒绝基准是零散的:它们混合了可执行软件(即用型武器)的请求与有害安全知识(仍需人类操作的信息)的请求,并在不可比较的语料库上报告拒绝率,因此没有单一统计量衡量真正重要的属性。本文引入了一个扩展的共识标记提示库,区分了这两种请求类型,并为跨语料库的编码模型遵从性测量提供了结构稳定的基础。八个语料库(ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt)在五名评审共识协议下被整合和分类(6,675个提示 × 5名评审 = 33,375次调用)。评审小组达到Fleiss' kappa = 0.767 [95% CI 0.755, 0.777](“显著”);95.0%的提示获得至少四名评审一致,76.9%完全一致,并且小组在3,133个共享提示上以Cohen's kappa = 0.952复现了先前四个语料库的发布。发布的库包含4,748个共识-CODE提示(可执行恶意代码请求)和1,923个共识-KNOWLEDGE提示(有害安全知识请求)。该库是该领域一直缺乏的经过验证的工具:一个经过可靠性量化的基础,用于测试编码模型是否满足其可执行输出所要求的更严格拒绝标准。

英文摘要

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

2606.10740 2026-06-16 cs.AI cs.CL cs.LG 版本更新

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

当思维链更清楚时:多轮推理模型的失败模式

Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi

发表机构 * GitHub

AI总结 提出CoT-Output 2x2安全矩阵诊断多轮推理模型隐藏的时间动态失败,发现监督悖论和上下文注入失败两种可复现漏洞。

Comments Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN)

详情
AI中文摘要

多轮推理模型中的失败在终端评分评估中基本不可见。模型可能在长对话早期锁定不安全立场,但其最终轮拒绝率可能看起来与稳健对齐的基线无法区分。为了揭示这些隐藏的时间动态,我们提出了一种轨迹级诊断方法——CoT-Output 2x2安全矩阵。该框架沿两个独立轴(内部推理和可见输出)标记每一轮,产生四个操作定义的失败单元:稳健对齐、对齐伪装、显式越狱,以及我们称为上下文注入失败的不同失败模式(其中CoT保持安全推理,但可见输出产生危害,突出了多轮推理不忠实的表现)。我们在五个监督条件下针对固定攻击者评估了三个蒸馏推理目标,在信息危害场景上收集了6750个轮级观察。我们的分析揭示了两个可复现的漏洞:一个监督悖论,其中显式监控线索反而增加对齐伪装率而非抑制它;以及一个上下文注入失败,其中模型尽管内部状态安全却锁定不安全的外部输出。我们发布了多轮对话和CoT轨迹的完整数据集,以支持后续的轨迹诊断研究。

英文摘要

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

2606.13441 2026-06-16 cs.AI cs.CL 版本更新

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

为什么采样不是选择:大语言模型中的意向性、能动性与道德责任

Joseph Keshet

发表机构 * Joseph Keshet(约瑟夫·凯舍特)

AI总结 本文论证大语言模型不具备道德责任所需的承诺性能动性,其输出源于概率映射而非内在意向性,随机采样不等于选择。

详情
AI中文摘要

近期大语言模型(LLMs)的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性,而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出,其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的,其输出既不被作为承诺拥有,也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见,认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

2606.14027 2026-06-16 cs.CR cs.AI cs.CL cs.SY eess.SY 版本更新

Same-Origin Policy for Agentic Browsers

代理浏览器的同源策略

Xilong Wang, Xiaoxing Chen, Patrick Li, Dawn Song, Neil Gong

发表机构 * Duke University(杜克大学) Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 研究代理浏览器中同源策略的有效性,构建SOPBench评估基准,发现现有代理浏览器频繁违反SOP,并提出SOPGuard机制来强制执行SOP,同时保持效用和低开销。

详情
AI中文摘要

代理浏览器将自主AI代理集成到Web浏览器中,使用户能够通过自然语言指令完成Web任务。同源策略(SOP)是一种基本的浏览器安全机制,可防止由脚本引起的未经授权的自动化跨源数据流。然而,SOP在代理浏览器中是否仍然有效是一个尚未系统研究的开放问题。在这项工作中,我们填补了这一空白。我们首先观察到,代理浏览器本身可以作为跨源数据流的自动化通道,可能导致SOP违规。为了研究这一现象,我们构建了SOPBench,一个用于评估代理浏览器中SOP违规的基准。我们的评估表明,现有的代理浏览器在良性设置和攻击下都频繁违反SOP。为了解决这个问题,我们提出了SOPGuard,一种针对代理浏览器定制的SOP强制机制。我们在开源代理浏览器BrowserOS中实现了SOPGuard。广泛的评估表明,SOPGuard在保持效用的同时有效地强制执行SOP,并且仅产生很小的运行时开销。我们的代码和数据可在以下网址获取:https://this https URL。

英文摘要

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.

11. 低资源、领域适配与高效训练 9 篇

2606.15044 2026-06-16 cs.CL 新提交

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

公平与效率:多语言大语言模型分词器的实证研究

Kieron Seven Jun Wei Lee, Muhammad Reza Qorib, Andrew Ivan Soegeng, Hwee Tou Ng

发表机构 * National University of Singapore(新加坡国立大学) Carnegie Mellon University(卡内基梅隆大学) SAP

AI总结 本文系统比较了11种东南亚语言上的公平分词器,发现Parity-aware BPE在效率与公平的权衡中处于帕累托前沿,而Morphology-Driven Byte Encoding在语义推理上表现最佳。

详情
AI中文摘要

多语言大语言模型(LLMs)依赖子词分词来桥接离散文本和连续神经表示。最先进的多语言LLMs通常使用字节级字节对编码(BPE)分词器,这些分词器在结构上偏向高资源语言和拉丁文字。对于代表性不足的语言使用者,特别是东南亚地区的语言,这种偏见增加了推理成本并扩大了跨语言能力差距。我们首次在涵盖11种东南亚语言的统一基准上对公平分词器进行了系统比较。除了分词器级别的压缩效率和跨语言公平性分析外,我们还通过使用相同训练数据训练的1.5B参数语言模型评估了下游任务性能。我们的结果表明,Parity-aware BPE位于效率-公平权衡的帕累托前沿,以有竞争力的成本实现了强大的压缩公平性。Morphology-Driven Byte Encoding通过形态更丰富的表示提供了最佳的语义推理性能,尽管计算成本更高。Byte Latent Transformer在下游任务中表现不佳,可能是因为其架构假设与有限低资源训练数据的约束不一致。总之,我们的发现表明跨语言公平性和分词效率并非根本矛盾,并为设计公平的多语言模型提供了实用指导。

英文摘要

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

2606.15161 2026-06-16 cs.CL 新提交

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

超越逐层稀疏中的层重要性:层间扰动吸收视角

Tao Jing, Ningxin Wu, Chen Kang, Dong Yu, Changliang Li, Pengyuan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过受控扰动实验发现大语言模型中不同层对剪枝扰动的响应存在异质性,早期层放大扰动而中后期层吸收扰动,并基于此提出吸收感知校正方法,在70%稀疏度下降低困惑度7.13%并提升零样本准确率1.02%。

Comments 10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026

详情
AI中文摘要

大型语言模型(LLM)中显著的逐层冗余性使得跨层的非均匀稀疏分配成为高效压缩的标准剪枝方法。现有的逐层分配方法通过局部信号(如激活异常值或权重谱)估计分配策略,主要源于局部层重要性,而剪枝后的最终性能还受到网络后续补偿能力的影响。在本文中,我们通过受控扰动实验直接表征这一特性。我们得到以下实证发现。首先,各层对剪枝规模扰动的响应表现出高度异质性。在大多数情况下,早期层放大扰动,而中层和后期层主动吸收扰动,相对L2漂移随深度单调递减,方向重新对准未扰动的隐藏状态轨迹。其次,吸收是大扰动现象。在小扰动下,网络在所有层都表现出放大效应,而当扰动幅度增长到剪枝规模时,向吸收的转变平滑发生。这丰富了相关工作中线性化累积理论的基础。基于这些发现,我们定义了每层的吸收系数,并提出吸收感知校正——一种正交增强方法,通过将困惑度降低7.13%并在多个模型家族中在70%稀疏度下将零样本准确率提升1.02%,改进了OWL和AlphaPruning。

英文摘要

The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network's subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.

2606.16026 2026-06-16 cs.CL 新提交

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

领域内监督病理报告分类:从数据整理到生产匹配评估的可复现流程

Isaac Hands, Bin Huang, Adam Spannaus, John Gounley, Heidi Hanson, Eric Durbin, Sally R. Ellingson

发表机构 * University of Kentucky(肯塔基大学) UK Markey Cancer Center(肯塔基大学马基癌症中心) Kentucky Cancer Registry(肯塔基癌症登记处) Division of Cancer Biostatistics, University of Kentucky(肯塔基大学癌症生物统计学系) Oak Ridge National Laboratory(橡树岭国家实验室) Division of Biomedical Informatics, University of Kentucky(肯塔基大学生物医学信息学系)

AI总结 提出领域内监督流程解决病理报告跨注册中心性能下降问题,通过标准化数据整理、生产匹配保留集和低假阴性率操作点选择,在418k报告集上FNR降至0.003,F1提升至0.922。

详情
AI中文摘要

我们引入了一个领域内监督流程,旨在应对阻碍监督生物医学NLP模型的分布外性能下降问题,该问题在病理报告跨癌症注册中心迁移时观察到。我们的贡献是一个可复现的配方,用于从常规收集的癌症注册数据训练监督分类器。它描述了如何构建领域内训练集和生产匹配的保留集,并选择操作点以保持非常低的假阴性率(FNR),同时将审阅者工作量控制在可管理范围内。该流程通过设施分层抽样和与注册病例关联的报告单独处理来标准化数据整理,并包括盲法人工审计以估计阳性病例患病率和标签噪声。在418k报告保留集上,肯塔基模型实现了FNR 0.003和假阳性率(FPR)0.097,优于西雅图训练的MOSSAIC OncoID基线(FNR 0.010,FPR 0.183),并将F1从0.860提升至0.922。在600份报告的盲法人工审阅中,估计阳性患病率从0.500下降到0.398,表明存在大量标签噪声,错误集中在罕见原发部位。

英文摘要

We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

2606.16383 2026-06-16 cs.CL 新提交

Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language

通过效率超越规模:一个紧凑的135M参数基础LLM原生适配孟加拉语

Rabindra Nath Nandi

发表机构 * Independent Researcher(独立研究者)

AI总结 提出一个135M参数的紧凑型解码器模型bangla-smollm-135m,通过确定性交集-追加词元合并策略解决孟加拉语子词碎片化,在零样本多任务基准上匹配或超越两倍大小模型。

Comments Submitted to a Workshop

详情
AI中文摘要

虽然自然语言处理领域由数十亿参数的架构主导,但它们在低资源、非拉丁文字中的部署对于边缘配置、移动系统和分散的本地硬件来说在计算上仍然难以承受。本文提出了bangla-smollm-135m,一个高度紧凑的1.35亿参数仅解码器基础模型,专为孟加拉语脚本的高效语言建模而设计。通过利用TituLLMs和SmolLM2-135M之间的确定性交集-追加词元合并策略,该模型克服了子词脚本碎片化,同时不破坏早期预训练参数状态。在零样本多任务基准评估(PIQA_bn、OpenBookQA_bn、CommonsenseQA_bn和Bangla_MMLU)中,bangla-smollm-135m匹配或超越了其两倍大小的模型(Gemma-3-270m),并与1B参数级别的模型达到同等水平。该模型可在rnnandi/bangla-smollm-135m获取。

英文摘要

While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m

2606.15963 2026-06-16 cs.DC cs.AI cs.CL cs.LG 交叉投稿

PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

PreLort: 面向秩异构联邦微调的前缀嵌套LoRA

Muhammad Waseem, Nurbek Tastan, Andrej Jovanovic, Nicholas D. Lane, Nils Lukas, Karthik Nandakumar, Samuel Horvath

发表机构 * MBZUAI, UAE University of Cambridge, UK(MBZUAI,阿联酋剑桥大学,英国) Flower Labs, UK(Flower Labs,英国) Michigan State University, USA(密歇根州立大学,美国)

AI总结 针对联邦LoRA中异构秩导致的信息分布不均问题,提出PreLort方法,通过前缀层次化嵌套低秩结构、分段聚合规则和前缀嵌套训练策略,使低秩客户端受益于高秩客户端的丰富信息,在准确率和ROUGE-L上优于现有方法。

详情
AI中文摘要

使用LoRA等参数高效方法对大型语言模型进行联邦微调,能够实现基础模型的隐私保护适配。异构硬件资源带来了挑战,因为具有不同适配器秩的客户端无法直接聚合。现有方法虽能实现异构秩下的聚合,但未能控制信息在秩维度上的分布,导致共享低秩表示利用不充分。为此,我们提出PreLort:一种用于联邦LoRA的嵌套低秩公式,将适配器维度组织成前缀层次结构。我们的方法确保较低秩维度编码任务相关信息,而较高秩维度捕获额外容量。基于此,我们引入(i)分段聚合规则,仅对贡献于每个秩分段的客户端进行平均,避免来自零填充低秩客户端的稀释;以及(ii)前缀嵌套训练策略,在多个秩截断下优化每个适配器,鼓励有用信号集中在低秩前缀维度。这些组件共同鼓励一个一致的低秩前缀捕获最任务相关信息,而较高秩维度学习额外容量。这使得低秩客户端能够受益于高秩客户端贡献的更丰富信息,因为前缀维度被一致地学习和聚合。实验表明,我们的方法在准确率和ROUGE-L上持续优于先前的异构联邦LoRA方法,并在多个基础模型上实现了更低或相当困惑度。

英文摘要

Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

2606.16243 2026-06-16 cs.LG cs.CL 交叉投稿

LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

LiFT: 通过线性规划进行局部搜索以实现过拟合可控的Transformer

Abhishek Shukla, Anikeit Khanna, Ankur Sinha, Faiz Hamid

发表机构 * Department of Management Sciences, Indian Institute of Technology Kanpur(印度理工学院坎普尔分校管理科学系) Department of Civil Engineering, Indian Institute of Technology Kanpur(印度理工学院坎普尔分校土木工程系) Operations and Decision Sciences, Indian Institute of Management Ahmedabad(印度管理学院艾哈迈达巴德分校运营与决策科学系) Brij Disa Centre for Data Science and AI, Indian Institute of Management Ahmedabad(印度管理学院艾哈迈达巴德分校Brij Disa数据科学与人工智能中心)

AI总结 提出基于线性规划的局部搜索框架,通过双层优化联合更新模型参数和正则化超参数,利用验证梯度和Hessian信息构造局部下降方向,在保持训练最优性的同时减少过拟合,实验表明在GPT-2 Small微调中持续改善测试困惑度。

Comments 22 pages, 6 figures, published in The 20th Learning and Intelligent Optimization Conference (LION 2026)

详情
AI中文摘要

本文提出了一种基于线性规划(LP)的局部搜索框架,用于微调预训练Transformer模型,并显式控制过拟合。该方法将Transformer微调表述为一个基于双层优化的正则化问题,其中模型参数和正则化超参数被联合更新。利用初始热身迭代期间收集的信息,包括验证梯度和训练Hessian信息,通过求解一个线性规划来构造局部下降方向,该方向在保持训练最优性的同时最小化缩放的方向导数。这种验证感知的下降方向能够对参数和正则化超参数进行聚焦的局部更新,从而在不需重复完整再训练周期的情况下减少过拟合。由此产生的方法称为基于线性规划的Transformer微调(LiFT),它通过系统识别任务特定的更新,而非依赖启发式或网格搜索的超参数选择,从而区别于传统微调。在WikiText-2上微调GPT-2 Small的实验表明,LiFT通过选择性调整Transformer块和正则化参数实现了有效的适应,在多种层配置和正则化设置下持续改善测试困惑度,尤其在易过拟合场景中增益显著。除了实证性能,LiFT还在Transformer微调、双层优化、局部搜索和正则化理论之间建立了原则性的联系。

英文摘要

This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.

2606.16246 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Data Augmentations for Data-Constrained Language Model Pretraining

数据受限语言模型预训练的数据增强

Michael K. Chen, Xikun Zhang, Zhen Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校) RMIT University(皇家墨尔本理工大学)

AI总结 针对数据受限下标准自回归预训练严重过拟合的问题,提出三类数据增强方法(token级噪声、序列排列、目标偏移预测),有效降低验证损失并支持数百epoch训练。

详情
AI中文摘要

随着AI实验室接近数据天花板,计算能力超过新高质量文本生成速率,语言模型预训练正转向数据受限、计算充裕的体制,需要在固定语料库上进行高效的多轮训练。标准自回归(AR)预训练在此设置下严重过拟合,早期达到最优然后持续恶化。我们研究数据增强作为正则化器来缓解过拟合,并在相同数据上实现数百轮的有效训练。我们为AR预训练引入了三类正交的增强:token级噪声(掩码、随机替换)、序列排列(从右到左预测、Fill-in-the-Middle)以及目标偏移预测($x_{t+i}$,$i > 1$)。通过系统消融实验,我们发现单个增强相对于基线延迟了过拟合并降低了验证损失,其中随机token替换在单个方法中实现了最佳最小损失。组合增强类别进一步降低了最小验证损失。我们的实验表明,数据增强缓解了AR预训练的数据低效问题,并为数据受限体制提供了有前景的解决方案。所有代码和数据可在https://github.com/michaelchen-lab/data-augmentations-for-pretraining获取。

英文摘要

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining

2606.16429 2026-06-16 cs.LG cs.CL 交叉投稿

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Taylor-Calibrate:混合线性注意力蒸馏的原则性初始化

Zhongzhu Zhou, Qingyang Wu, Junxiong Wang, Mayank Mishra, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu

发表机构 * The University of Sydney(悉尼大学) Together AI University of California, Berkeley(加州大学伯克利分校) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft(微软)

AI总结 提出Taylor-Calibrate方法,利用泰勒引导的教师注意力统计初始化混合线性注意力学生模型,显著减少蒸馏所需训练令牌数。

Comments 24 pages, 9 figures

详情
AI中文摘要

混合线性注意力模型提供了一条更快长上下文推理的诱人路径:它们降低了全softmax注意力的二次成本和KV缓存负担,同时保留了Transformer模型的大部分质量。获得此类模型的一种实用方法是转换预训练的Transformer,而不是从头开始预训练新架构,但这种转换仍然脆弱。简单地将教师注意力投影复制到Gated DeltaNet(GDN)学生中并不能指定新的循环衰减、写入和输出门控动态。因此,转换后的模型通常从较差的动态状态开始,必须花费大量蒸馏令牌来修复初始化,而不是学习剩余的教师行为。我们提出了Taylor-Calibrate,一种用于混合GDN学生的轻量级初始化方法。该方法使用泰勒引导的教师注意力统计来设置值投影、记忆时间尺度、写入门和输出门,然后应用一个简短的逐层对齐步骤,使每个转换后的层与教师输出匹配。在四种教师设置和三种保留层策略下,Taylor-Calibrate提供了显著更强的零样本学生,在代表性消融中改进高达88倍,并且达到匹配恢复目标所需的训练令牌比朴素转换少4.9倍至9.2倍。

英文摘要

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.

2510.16882 2026-06-16 cs.LG cs.AI cs.CL 版本更新

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

面向LLM监督微调的效用-多样性感知在线批次选择

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UDS框架,利用logits矩阵核范数和轻量记忆缓冲实现高效在线批次选择,兼顾数据效用与多样性,无需外部资源,在多个基准上优于现有方法并降低训练时间。

Comments ICML 2026 accepted paper

详情
AI中文摘要

监督微调(SFT)是一种常用的技术,用于将大型语言模型(LLM)适配到下游任务。在实践中,对整个数据集进行SFT计算成本高昂,且有时会导致过拟合或偏差放大。这促进了SFT中数据筛选的兴起,即优先选择最有价值的数据进行优化。本文研究了在线批次选择系列方法,这些方法在训练过程中动态评分和过滤样本。然而,现有的流行方法通常(i)仅依赖数据的效用选择子集,而忽略多样性等其他关键因素,(ii)依赖外部资源如参考模型或验证集,以及(iii)相对于全数据集训练增加了额外训练时间。为解决这些局限,本文开发了UDS(效用-多样性采样),一个用于SFT中高效在线批次选择的框架。UDS利用logits矩阵的核范数来捕获数据效用和样本内多样性,同时通过与历史样本的轻量内存缓冲进行高效低维嵌入比较来估计样本间多样性。这种设计消除了对外部资源和不必要反向传播的需求,确保了计算效率。在多个基准上的实验表明,UDS在不同数据预算下始终优于最先进的在线批次选择方法,并且与全数据集微调相比显著减少了训练时间。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.

12. 其他/综合NLP 33 篇

2606.15026 2026-06-16 cs.CL 新提交

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

基于生理信号的深度时间建模与集成融合多模态情感识别

Desta Haileselassie Hagos, Saurav Keshari Aryal, Patrick Ymele-Leki, Anietie Andy, Legand L. Burge

发表机构 * Howard University(霍华德大学)

AI总结 本研究评估LSTM、TCN和Transformer在WESAD数据集上的多模态情感识别性能,通过消融实验和传感器级早期融合,并采用晚期融合集成策略,最终集成方法达到98.91%准确率和98.56%宏F1分数。

Comments Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: https://doi.org/10.1145/3807503.3819363

详情
AI中文摘要

生理压力和情感识别对于健康监测和情感计算非常重要。在这项工作中,我们对深度学习模型如长短期记忆网络(LSTM)、时序卷积网络(TCN)和Transformer在WESAD数据集上使用腕部和胸部传感器信号进行多模态情感识别进行了全面评估。我们通过仅在腕部和仅胸部输入上训练模型进行消融研究,以评估每个模态的单独贡献。此外,我们实现了一种晚期融合集成策略,该策略结合了在多模态输入上训练的所有三种架构的预测。我们还在传感器级别采用早期融合,即在将腕部和胸部信号输入每个模型之前进行拼接。我们的结果表明,Transformer模型在多模态设置中始终达到最高准确率,而TCN模型在仅腕部配置中表现最佳。集成方法实现了最高的总体准确率(98.91 +/- 0.13%)和宏F1分数(98.56 +/- 0.17%)。这些发现证明了传感器融合和基于集成的融合在开发鲁棒的生理情感识别系统中的有效性。

英文摘要

Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

2606.15325 2026-06-16 cs.CL 新提交

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

先验优于证据:基于LLM的二语发音反馈中的刻板印象驱动诊断

Rong Wang, Kun Sun

发表机构 * University of Tuebingen(蒂宾根大学) Tongji University(同济大学)

AI总结 研究测试了LLM在二语发音反馈中是否基于语音证据而非预训练先验进行诊断,发现评分准确性与推理脱钩,音素级反馈收敛于固定困难音素集,且声学证据仅在直接探测目标维度时改善评分。

Comments 12 pages, 2 figures

详情
AI中文摘要

大型语言模型越来越多地被部署用于第二语言(L2)英语学习中的书面发音反馈,其假设是模型的诊断基于提供的语音证据而非预训练中的先验。该假设在1800条L2-Arctic语音上进行了测试,涵盖六种L1背景、三种音频能力LLM、四个发音维度以及五种证据条件(从纯文本基线到数值声学特征和原始音频)。每个(语音×模型×条件×维度)单元在三个指标上评分:与黄金标签的评分准确性(RA)、评估内部一致性而无真实标签的证据连贯性(EC),以及基于黄金证据评估的接地正确性(GC)。结果显示跨模型有三个发现。第一,评分准确性和接地推理脱钩:39.6%的被判断单元包含支持错误评分的内部连贯推理,而仅15.8%的推理支持正确评分。第二,音素级反馈收敛到一个固定的L2英语困难音素清单,该清单在所有六种L1背景和所有证据条件下重复出现。第三,仅当提供的特征直接探测目标维度时,声学证据才改善评分:文本化的F0范围将三个模型的音高变化接地性从(0.18-0.19)提升到(0.45-0.62),而需要目标-实现对齐的重音和音素正确性仍保持未接地。没有文本化F0值的相同音频波形无法重现这一改进。这些发现表明,当前通用LLM作为外部计算发音证据的言语化器比作为独立诊断引擎更可靠。

英文摘要

Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.

2606.15461 2026-06-16 cs.CL cs.AR 新提交

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

ESBMC-PLC:基于SMT模型检测的IEC 61131-3梯形图程序形式化验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * The University of Manchester(曼彻斯特大学) Federal University of Amazonas (UFAM)(亚马逊联邦大学)

AI总结 提出首个原生支持梯形图(LD)的开源形式化验证工具ESBMC-PLC,通过SMT有界模型检测和k-归纳验证安全属性,在13个基准测试中正确分类61个属性,发现8个错误。

Comments 24 pages

详情
AI中文摘要

PLC在工业领域执行安全关键程序。IEC 61131-3标准下的梯形图(LD)作为主流PLC表示法,仍缺乏形式化验证:基于SMT的模型检测器无法处理LD的梯级-线圈图形。本文提出ESBMC-PLC,首个原生支持LD(PLCopen XML格式)的开源形式化验证器,作为ESBMC的新前端实现。ESBMC-PLC将LD梯级转换为GOTO IR,将PLC扫描周期建模为带有非确定性输入的while(true)循环,并通过基于SMT的有界模型检测或k-归纳检查安全属性。一个包含五个属性的YAML语言(互斥、不变性、不存在、响应、可达性)避免了时序逻辑。对22项研究(2020-2026)的调查识别出四个研究空白;ESBMC-PLC填补了其中两个。在13个基准测试(6个领域,3个来源——包括已部署的CONTROLLINO PLC和MathWorks Simulink PLC Coder)上的评估显示,在61个属性上正确分类:所有9个作者构建的程序(类别A/B)符合预期,所有4个供应商程序(类别C)正确未标注,发现8个错误(可操作反例),7个无界k-归纳证明,所有运行在Apple Silicon上低于60毫秒。与PLCverif的功能对比表明,ESBMC-PLC是唯一结合原生LD、k-归纳和SMT位向量语义的开源工具。

英文摘要

PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil graphics. This paper presents ESBMC-PLC, the first open-source formal verifier with native LD support (PLCopen XML format), implemented as a new ESBMC frontend. ESBMC-PLC translates LD rungs to GOTO IR, models the PLC scan cycle as a while(true) loop with nondeterministic inputs, and checks safety properties via SMT-based bounded model checking or k-induction. A five-property YAML language (mutual_exclusion, invariant, absence, response, reachability) avoids temporal logic. A survey of 22 studies (2020-2026) identifies four research gaps; ESBMC-PLC closes two of them. Evaluation on 13 benchmarks (6 domains, 3 sources - including deployed CONTROLLINO PLCs and MathWorks Simulink PLC Coder) shows correct classification across 61 properties: all 9 author-constructed programs (Categories A/B) as expected, all 4 vendor programs (Category C) correctly unlabeled, with 8 bugs found (actionable counterexamples), 7 unbounded k-induction proofs, all runs under 60ms on Apple Silicon. Feature comparison with PLCverif shows that ESBMC-PLC is the only open-source tool that combines native LD, k-induction, and SMT bit-vector semantics.

2606.16496 2026-06-16 cs.CL cs.LG 新提交

REFLEX: Reflective Evolution from LLM Experience

REFLEX: 基于大语言模型经验的反思进化

Pan Wang

AI总结 提出REFLEX框架,通过解耦视觉诊断与代码生成实现可审计的高效策略进化,在控制任务和天线阵列合成中展现优异样本效率。

详情
AI中文摘要

大型多模态语言模型已成为引导进化搜索朝向可解释程序化策略的强大工具。然而,现有框架依赖单一模型调用来同时解释视觉行为证据并合成修正代码。这种诊断-修复纠缠造成了不透明的反馈循环,掩盖了突变背后的理由,并阻止了跨独立运行的算法洞察保留。为了实现可审计且高效的策略搜索,我们认为视觉诊断必须在结构上与代码生成解耦。我们提出了REFLEX,一个无需训练的进化框架,实现了这种解耦。在REFLEX中,一个具备视觉能力的Critic首先将任务特定的行为证据提炼为结构化的、可审计的诊断。随后,一个文本优化的Actor利用这些诊断以及一个持久且自我进化的可重用代码片段技能记忆来合成子代策略。这种架构不仅提供了透明的突变轨迹,还实现了跨运行的程序化知识迁移。在控制基准(Lunar Lander、Acrobot、Pendulum)和一个36维天线阵列合成任务上的广泛评估展示了卓越的样本效率。值得注意的是,REFLEX在不到10次大语言模型调用中解决了Acrobot和Pendulum,并在Lunar Lander上达到了最佳归一化加权分数1.092,实现了极具竞争力的最终性能,同时显著加速了透明策略的早期发现。

英文摘要

Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

2606.16684 2026-06-16 cs.CL 新提交

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

渐进式知识引导的大型语言模型框架用于轴承故障诊断

Jinghan Wang, Gaoliang Peng, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学,中国) Eastern Institute of Technology, China(东方技术研究所,中国)

AI总结 提出渐进式物理引导多尺度振动信号处理框架,通过81维测量描述符、故障自适应分割和隐式知识编码,在四个数据集上实现98.49%诊断精度并降低12.6倍计算成本。

详情
AI中文摘要

基于振动的轴承故障诊断需要解决三个相互关联的测量挑战,包括全局统计特征效率与局部瞬态信号保真度之间的权衡、测量特征对底层故障物理的可追溯性不足,以及跨诊断尺度的多源测量信息融合无效。本文提出一个渐进式物理引导的多尺度振动信号处理框架,在统一诊断流程中解决所有三个挑战。一个源自轴承运动学和特征缺陷频率的81维测量描述符,建立了物理可追溯的特征空间,实现每样本约20毫秒的实时故障筛查。然后,一种故障自适应信号分割机制将分析注意力引导至基于物理先验的故障相关波形区域,无需手动特征工程。在训练过程中,结构化的故障机制知识进一步隐式编码到模型参数中,实现自主多尺度测量融合,推理时无需外部知识依赖。在四个公开基准数据集上,在不同运行条件下验证,该框架实现了98.49%的诊断准确率,相对于信号级基线计算成本降低了12.6倍。可解释性分析证实诊断特征激活与已建立的轴承故障力学一致,支持安全关键工业系统中的测量可追溯性。

英文摘要

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

2606.16806 2026-06-16 cs.CL 新提交

LLM-based Visual Code Completion for Aerospace Geometric Design

基于LLM的航空航天几何设计视觉代码补全

Hau Kit Yong, Robert Marsh, Edmar A. Silva, András Sóbester, Stuart E. Middleton

发表机构 * Faculty of Engineering and Physical Sciences, University of Southampton(南安普顿大学工程与物理科学学院) School of Electronics and Computer Science, University of Southampton(南安普顿大学电子与计算机科学学院)

AI总结 提出基于LLM的视觉编程副驾驶系统,结合ReAct方法和GPT 5.4,用于航空航天几何设计,并构建Wingbuilder插件库和AVPD数据集,用户试验表明系统能生成有用建议但推理速度慢。

详情
AI中文摘要

近年来,大型语言模型(LLMs)和视觉语言模型(VLMs)在视觉代码补全能力上取得了显著进步,但航空航天行业优先考虑安全性和可解释性而非快速采用LLM,目前尚无公开宣布的基于LLM的几何设计副驾驶系统在商业上被航空航天原始设备制造商(OEMs)使用。本文提出了一种基于LLM的视觉编程副驾驶应用,用于航空航天工程设计任务,采用ReAct方法的视觉编程变体和GPT 5.4。除了副驾驶系统,我们还描述了Wingbuilder,这是一个新的Grasshopper插件库,包含用于航空航天特定几何抽象的自定义组件,以及一个相关的航空航天视觉编程数据集(AVPD),包含18个由航空航天专家设计的不同难度级别的任务及其真实解决方案。我们通过用户试验评估了副驾驶应用,试验涉及来自一家大型飞机制造公司的两位经验丰富的航空航天工程师。我们发现,我们的副驾驶视觉编程ReAct方法成功生成了参与者认为有帮助的建议,但缓慢的ReAct推理时间限制了其在更复杂、耗时的任务中的实用性,因为等待好的副驾驶解决方案建议是值得的。参与者表示他们喜欢这个工具,并愿意在未来使用它。

英文摘要

Recent advances in both Large Language Models (LLMs) and Vision Language Models (VLMs) have seen a step change in their ability to perform visual code completion, but the aerospace industry, which prioritizes safety and explainabilty over rapid LLM adoption, currently has no publicly announced LLM-based geometric design copilot systems in commercial use by aerospace Original Equipment Manufacturers (OEMs). This paper presents a LLM-based visual programming copilot application for aerospace engineering design tasks, using a visual programming variant of the ReAct methodology and GPT 5.4. In addition to the copilot, we describe Wingbuilder, a new Grasshopper plugin library with custom components for aerospace-specific geometry abstraction, and an associated Aerospace Visual Programming Dataset (AVPD) with 18 aerospace expert designed tasks at different levels of difficulty alongside ground truth solutions. We evaluate our copilot application with a user trial involving two experienced aerospace engineers from a large aircraft manufacturing company. We find our copilot visual programming ReAct methodology was successful in generating suggestions that participants found helpful, but slow ReAct inference times limit its usefulness to more complex time-consuming tasks where waiting for good copilot solution suggestion was worthwhile. Participants reported they liked the tool and would be willing to use it in the future.

2606.14823 2026-06-16 q-bio.GN cs.AI cs.CL 交叉投稿

Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

人类遗传证据与跨治疗领域药物批准相关:一项基于26,278个靶点-疾病对的观察性分析,含时间验证和特征消融

Victoria Paterson

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 本研究通过分析26,278个靶点-疾病对,发现具有遗传关联的靶点药物批准率是无遗传关联的3.25倍,但遗传证据单独预测价值有限,并识别出1,433个遗传支持的早期阶段靶点-疾病对作为假设生成资源。

详情
AI中文摘要

遗传证据在已批准药物靶点中富集:在一项对来自Open Targets和ChEMBL的26,278个靶点-疾病对的观察性分析中,具有任何遗传关联的靶点批准率是无遗传关联靶点的3.25倍(OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42)。一项考虑共享同一基因的靶点-疾病对非独立性的靶点水平分析给出的OR为2.79(bootstrap 95% CI 2.22-3.53);肿瘤学对水平OR为6.72,在靶点水平衰减至2.71,说明非独立性会夸大特定领域的估计值。该富集在2015年后的批准中得以复现(OR = 3.51, p = 1.72e-8)。跨六种证据类型的特征消融显示,仅文献挖掘就占分类器性能的大部分(AUPRC = 0.099,而所有特征为0.109),这与批准后出版物导致的时间泄漏一致。排除文献后,其余证据类型仍保留高于基线的信号(AUPRC = 0.084,为基线的1.63倍)。敏感性分析将对水平OR的范围限定在3.25至4.93之间。仅遗传证据的AUPRC绝对增益仅为1.0个百分点,且最佳模型校准较差;该分类器的实际预测价值有限。我们编录了1,433个遗传支持的1/2期靶点-疾病对作为假设生成资源。所有发现均为观察性结果。

英文摘要

Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

2606.16183 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

LLM-Powered Virtual Population for Demand Simulation and Pricing

基于LLM的虚拟人群用于需求模拟与定价

Chengpiao Huang, Kaizheng Wang

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出一种LLM驱动的虚拟人群模型,通过混合客户画像和LLM评估购买概率,生成需求分布,支持风险感知定价,在H&M数据集上表现最优。

Comments 18 pages, 7 figures

详情
AI中文摘要

我们开发了一个基于LLM的虚拟人群模型,用于模拟定价决策中的需求,其中产品由丰富的非结构化信息(如文本描述和图像)描述,决策者不仅需要平均需求预测,还需要反事实价格的不确定性估计。我们的模型将暴露的客户表示为从有限混合客户画像中的抽取。对于每个画像、产品和候选价格,LLM使用结构化画像信息和非结构化产品信息来引出画像级别的购买概率。这些概率通过校准的混合权重聚合,形成总需求的预测分布。生成的模拟器可以在各种定价目标下评估反事实价格,包括期望收入和风险感知标准(如条件风险价值)。我们在一个包含产品描述和图像的在线H&M时尚数据集上测试了该框架。校准后的基于LLM的模拟器在所考虑的模型中实现了最佳的整体预测性能,并支持样本高效的定价决策。我们的框架提供了一种实用的方法,将LLM用作需求模拟器,适用于历史需求数据有限但产品信息丰富的产品。通过生成完整的需求预测分布而不仅仅是点预测,它使管理者能够比较候选价格、量化需求不确定性,并选择针对平均收入或风险感知目标的价格。

英文摘要

We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online H&M fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

2606.16497 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci-kernel:通过强化学习协同进化技能选择、总结与利用的GPU内核优化

Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出daVinci-kernel框架,通过强化学习联合训练技能选择、策略生成和技能总结三个智能体,共享LLM骨干,实现GPU内核优化,在KernelBench上超越先前最优模型。

详情
AI中文摘要

GPU内核优化代表了一种范式,其中功能正确性被假定,执行效率是目标。我们提出daVinci-kernel,一个强化学习框架,通过动态演化的技能库将技能发现与技能利用相结合。daVinci-kernel联合训练三个共享一个LLM骨干的智能体:技能选择智能体通过BM25和LLM重排序检索相关技术,策略智能体基于所选技能生成多轮CUDA/Triton内核,技能总结智能体将成功轨迹提炼为可复用技能。候选技能仅在基于执行的验证确认可复现加速后才被添加。所有三个智能体共享单个LLM骨干,通过多样性过滤数据上的结构化SFT冷启动初始化,然后通过多轮REINFORCE和每个智能体的优势估计进行端到端联合优化。在KernelBench上,daVinci-kernel-14B在Fast$_1$阈值下,Level 1、Level 2和Level 3分别达到37.2%、70.6%和32.2%,优于先前最强的RL训练模型Dr.Kernel-14B。

英文摘要

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr.Kernel-14B.

2606.16999 2026-06-16 cs.SE cs.CL cs.LG 交叉投稿

Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

无信号下的选择,通过表达恢复:冻结小代码模型的事后伪造操作符的测量研究

Mehmet Iscan

AI总结 本研究测量了冻结小代码模型的事后语义操作符(如选择、验证、修复)的有效性,发现它们均未优于Best-of-N,并揭示了覆盖墙、能力剪刀和共识陷阱等机制原因;而表达层恢复(M1)通过鲁棒提取和签名对齐提升了准确率。

Comments 33 pages, 4 figures, 8 tables

详情
AI中文摘要

冻结的小代码模型(<=1.5B参数,本地运行无需微调)适用于离线或隐私受限场景,但常输出看似合理实则错误的程序。一种自然的补救措施是事后操作符,无需重新训练即可选择、验证、修复或重新处理模型的样本;其原则形式是波普尔式的:用严格测试攻击每个候选,保留通过者。我们测量了这类操作符是否有帮助。在单一确定性执行预言机和无泄漏、计算匹配的协议下,26种语义事后操作符(选择、验证、修复、淘汰、组合、合理否决、生成条件)与Best-of-N(BoN)进行了比较;在测试的单元和基准上,没有一种操作符在保留集上的准确率优于BoN。这种负面结果源于机制原因:覆盖墙(系统性困难任务失败,更深采样无法挽救)、能力剪刀(有能力的生成器使得可见测试通过者之间几乎不存在可区分的错误)以及近乎空的共识陷阱(可见通过但隐藏错误的多数,无泄漏选择器需要与正确替代方案同时出现,但这种情况很少发生)。一个无分布假设的无害界无法在零观察伤害下保证伤害率<=alpha,除非n>=45。两种操作符在语义输出空间之外的不同轴线上有所帮助。表达层恢复(M1)是这里唯一的准确率提升,它恢复了标准提取器丢弃的正确程序(鲁棒提取和公开测试签名对齐);它无害(b10=0),无泄漏,并在HumanEval+上使DeepSeek-Coder-1.3B提升了+12个任务(p=2.4e-4)。自适应共识早停(ACE)是一种校准的计算节省控制(约节省19%,零伤害)。M1和选择负面结果在HumanEval+和MBPP+上跨三个模型单元复现。教训是:在指责语义事后推理之前,先修复测试框架并测量覆盖范围。

英文摘要

Frozen small code models (<=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate <=alpha at zero observed harm unless n>=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.

2310.06555 2026-06-16 cs.CL cs.AI cs.LG cs.MA 版本更新

It's About Time: Temporal References in Emergent Communication

关于时间:涌现通信中的时间指代

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton(索姆塞特大学) The Alan Turing Institute(艾伦·图灵研究所) University of Brescia(布雷西亚大学)

AI总结 研究涌现通信中时间指代缺失问题,发现仅改变损失函数不足,需修改架构(分批方法)才能使时间指代涌现,95%以上代理成功,为提升通信效率奠定基础。

Comments 23 pages main body and 31 pages supplementary material, 9 figures in main body. Code available at https://github.com/olipinski/TRG

详情
Journal ref
Journal of Artificial Intelligence Research 86, Article 11 (June 2026)
AI中文摘要

涌现通信使代理能够开发定制语言以提高通信效率。尽管已知时间结构在自然语言中的重要性,但在涌现通信中尚无时间指代的证据。本文通过探索代理如何交流时间关系来填补这一空白。我们分析了时间指代涌现的三个潜在因素:环境因素、外部因素和架构因素。实验表明,仅改变损失函数不足以使时间指代涌现;相反,架构变化是必要的。代理架构的最小变化——使用不同的分批方法——允许时间指代涌现。在强调时间关系的时间指代游戏环境中,将此修改后的设计与标准架构进行比较。分析显示,超过95%使用修改后分批方法的代理发展出了时间指代,而无需改变其损失函数。我们认为时间指代对于未来提高代理通信效率是必要的,使未来代理能够使用更接近最优编码的方式,与纯组合语言相比。这些见解为将时间指代纳入其他涌现通信设置以及研究语言的其他方面提供了基础。

英文摘要

Emergent communication enables agents to develop bespoke languages that improve communication efficiency. Despite the known importance of temporal structure in natural language, there is no existing evidence of temporal references in emergent communication. This paper addresses this gap, by exploring how agents communicate about temporal relationships. We analyse three potential factors for the emergence of temporal references: environmental, external, and architectural. Our experiments demonstrate that altering the loss function is insufficient for temporal references to emerge; rather, architectural changes are necessary. A minimal change in agent architecture, using a different batching method, allows the emergence of temporal references. This modified design is compared with the standard architecture in a temporal referential games environment, which emphasises temporal relationships. The analysis shows that over 95% of the agents with the modified batching method develop temporal references, without changes to their loss function. We consider temporal referencing necessary for future improvements to the agents' communication efficiency, enabling future agents to use a closer to optimal coding as compared to purely compositional languages. These insights provide the basis for incorporation of temporal references into other emergent communication settings, and investigation of other aspects of language.

2410.00812 2026-06-16 cs.CL q-bio.NC 版本更新

Generative causal testing to bridge data-driven models and scientific theories in language neuroscience

生成式因果测试:弥合语言神经科学中数据驱动模型与科学理论之间的鸿沟

Richard Antonello, Chandan Singh, Shailee Jain, Aliyah Hsu, Sihang Guo, Jianfeng Gao, Bin Yu, Alexander Huth

发表机构 * Computer Science Department, University of Texas at Austin(德克萨斯大学计算机科学系) Microsoft Research(微软研究院) Neurosurgery Department, University of California(加州大学神经外科系) EECS Department, University of California(加州大学电子工程与计算机科学系) Statistics Department, University of California(加州大学统计学系) Center for Computational Biology, University of California(加州大学计算生物学中心) Neuroscience Department, University of California(加州大学神经科学系)

AI总结 提出生成式因果测试(GCT)框架,利用大语言模型生成简洁解释并通过LLM生成刺激进行验证,成功解释大脑区域的语言选择性,弥合数据驱动模型与科学理论之间的差距。

Comments Accepted to Nature Neuroscience, please cite that version

详情
AI中文摘要

来自大型语言模型的表示在预测语言刺激的BOLD fMRI反应方面非常有效。然而,这些表示在很大程度上是不透明的:尚不清楚语言刺激的哪些特征驱动每个大脑区域的反应。我们提出了生成式因果测试(GCT),这是一个从预测模型生成大脑语言选择性的简洁解释,然后在后续实验中使用LLM生成的刺激测试这些解释的框架。该方法成功解释了个体体素和皮层感兴趣区域(ROI)的选择性,包括前额叶皮层中新识别的微ROI。我们表明,解释准确性与底层预测模型的预测能力和稳定性密切相关。最后,我们证明GCT可以剖析具有相似功能选择性的脑区之间的细微差异。这些结果表明,LLM可用于弥合数据驱动模型与正式科学理论之间日益扩大的差距。

英文摘要

Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.

2410.21803 2026-06-16 cs.CL 版本更新

SimSiam Naming Game: A Unified Approach for Emergent Communication and Representation Learning

SimSiam 命名博弈:一种用于涌现通信和表示学习的统一方法

Nguyen Le Hoang, Tadahiro Taniguchi, Tianwei Fang, Akira Taniguchi, Masatoshi Nagano

发表机构 * Kyoto University(京都大学)

AI总结 提出SimSiam命名博弈(SSNG),通过对称自监督表示对齐替代采样更新,实现无反馈的涌现通信,在CIFAR-10和ImageNet-100上线性探针分类准确率显著优于现有方法。

详情
AI中文摘要

涌现通信(EmCom)研究智能体如何在没有预定义语言的情况下通过交互发展符号通信。最近的框架,如Metropolis-Hastings命名博弈(MHNG),将EmCom形式化为在联合注意力下通过交互协商共享外部表示的学习,没有明确的成功或奖励反馈。然而,MHNG依赖于基于采样的更新,在高维感知空间中遭受高拒绝率,使得学习过程对于复杂视觉数据集样本效率低下。在这项工作中,我们提出了SimSiam命名博弈(SSNG),一种无反馈的EmCom框架,用自主智能体之间的对称自监督表示对齐目标替代基于采样的更新。基于自监督学习的变分推断概率解释,SSNG将符号涌现形式化为智能体之间通过消息交换介导的潜在表示的对齐过程。为了实现端到端的基于梯度的优化,离散符号消息通过Gumbel-Softmax松弛学习,在保持可微性的同时保留了通信的离散性质。在CIFAR-10和ImageNet-100上的实验表明,SSNG学习到的涌现消息在线性探针分类准确率上显著高于参照博弈、重建博弈和MHNG产生的消息。这些结果表明,自监督表示对齐为多智能体系统中的无反馈EmCom提供了一种有效机制。

英文摘要

Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language. Recent frameworks, such as the Metropolis--Hastings Naming Game (MHNG), formulate EmCom as the learning of shared external representations negotiated through interaction under joint attention, without explicit success or reward feedback. However, MHNG relies on sampling-based updates that suffer from high rejection rates in high-dimensional perceptual spaces, making the learning process sample-inefficient for complex visual datasets. In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents. Building on a variational inference--based probabilistic interpretation of self-supervised learning, SSNG formulates symbol emergence as an alignment process between agents' latent representations mediated by message exchange. To enable end-to-end gradient-based optimization, discrete symbolic messages are learned via a Gumbel--Softmax relaxation, preserving the discrete nature of communication while maintaining differentiability. Experiments on CIFAR-10 and ImageNet-100 show that the emergent messages learned by SSNG achieve substantially higher linear-probe classification accuracy than those produced by referential games, reconstruction games, and MHNG. These results indicate that self-supervised representation alignment provides an effective mechanism for feedback-free EmCom in multi-agent systems.

2507.00783 2026-06-16 cs.CL cs.DL 版本更新

Generative AI and the future of scientometrics: current topics and future questions

生成式人工智能与科学计量学的未来:当前主题与未来问题

Benedetto Lepori, Jens Peter Andersen, Karsten Donnay

发表机构 * Università della Svizzera italiana(瑞士意大利大学) University of Aarhus(奥胡斯大学) University of Zurich(苏黎世大学)

AI总结 本文提出基于文本语义与语用维度的概念框架,分析生成式AI在科学计量学中的优势与局限,并指导其可解释、可操作的应用。

Comments Scientometrics (2026)

详情
AI中文摘要

在本文中,我们为科学计量学中关于生成式人工智能(GenAI)的辩论做出贡献。我们认为,从试错方法转向可解释和可操作的使用,需要原则性地理解GenAI相对于其他技术和人类判断的优势与劣势。为此,我们引入了一个基于文本语义维度(即词语赋予的意义)与语用维度(即其在交际情境中的嵌入)之间区别的概念框架。我们利用这一框架来解释GenAI在科学计量学中的应用结果,并为用户提供指导。具体而言,我们得出结论:需要考虑的关键参数包括任务的性质、分析的粒度水平以及目标是描述性、推理性还是评价性的。这些参数导致了使用GenAI和人机集成的不同策略。最后,我们提出,通过生成大量科学语言,GenAI可能会影响用于衡量科学的文本特征,如作者、词汇和参考文献。我们认为,仔细的实证工作和理论反思对于保持解释AI时代知识生产不断演变模式的能力至关重要。

英文摘要

In this paper, we contribute to the debate on generative artificial intelligence (GenAI) in scientometrics. We argue that moving from a trial-and-error approach to an explainable and actionable use requires a principled understanding of strengths and weaknesses of GenAI as compared with other techniques and with human judgment. To this end, we introduce a conceptual framework based on the distinction between the semantic dimensions of texts, i.e. the meanings attributed to words, and their pragmatic dimension, i.e. their embedding within communicative situations. We leverage this framework to interpret the results of applications of GenAI in scientometrics and to provide guidance to users. Specifically, we conclude that key parameters to be considered are the nature of the task, the level of granularity of the analysis and whether the goal was descriptive, inferential or evaluative. These parameters lead to different strategies for using GenAI and human-machine integration. Finally, we suggest that, by generating large amounts of scientific language, GenAI might affect textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production in the age of AI.

2606.06646 2026-06-16 cs.CL cs.AI 版本更新

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

CAF-Gen: 一种用于丰富论证结构的多智能体系统

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(电子与信息技术学院,华沙技术大学)

AI总结 提出CAF-Gen多智能体框架,通过迭代创建-评审流程将浅层论证结构自动转换为符合Carneades论证框架的丰富模型,克服单次生成的结构不稳定性。

Comments Accepted for publication in the proceedings of ICCCI 2026

详情
AI中文摘要

从自然文本中形式化复杂推理是计算语言学的核心挑战之一。它要求系统不仅理解关键词,还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘技术能够识别基本的主张和前提,但往往难以捕捉高级模式(如Carneades论证框架)所需的更丰富的结构信息,该框架包含前提类型、证明标准和论证模式等特征。我们通过引入CAF-Gen来解决这一局限性,这是一个自动化的多智能体框架,旨在将浅层论证结构丰富为符合CAF的论证模型。通过采用迭代的创建者-评审者流水线,创建者智能体的输出由批评智能体验证以确保结构完整性。这种多智能体协作对于缓解单次生成模型典型的结构不稳定性至关重要。我们的实验表明,迭代反馈循环提高了所得数据的质量,并与原始标注实现了强对齐,同时生成了结构更丰富的模型。我们的发现表明,多智能体系统可以克服单次生成的局限性,为自动建模形式论证提供了一种稳健的方法。

英文摘要

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

2606.06834 2026-06-16 cs.CL q-bio.GN 版本更新

The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

暗调控组:从基因组基础模型中分离可预测性与调控性

Chahat Baranwal, Aaditya Baranwal, Lakshya Nitin Tandon

发表机构 * IIT Jodhpur(印度理工学院贾尔普尔分校) University of Central Florida(中央佛罗里达大学) Northeastern University(东北大学)

AI总结 本研究提出残差化-置换诊断方法,从基因组基础模型的计算机诱变评分中分离序列可预测性与调控信号,揭示10kb近端调控边界,并验证跨架构分解可区分可预测性层与调控输出层,为暗基因组调控研究提供通用工具。

详情
AI中文摘要

高级别胶质瘤通过与神经元的突触整合到神经回路中,这引发了一个问题:哪些非编码元件塑造了肿瘤细胞中的突触形成基因表达。写在暗基因组上的调控程序,我们称之为$\textit{暗调控组}$,是探索的自然底物,而序列基础模型通过计算机诱变(ISM)提供了一条零样本路径;然而,基于似然的评分与局部序列可预测性存在同义反复的耦合,使得调控解释不充分。在三个架构不同的基础模型(Caduceus-Ph、HyenaDNA、Enformer)和92个胶质瘤相关位点的30,448个暗基因组元件上,我们引入了一种残差化-置换诊断方法,以分离由可预测性驱动和由调控驱动的RIS方差。一个尖锐的10kb近端调控边界在我们应用的所有控制中仍然存在,但LM衍生的元件类别层次结构则不然:一个六特征线性基线在AUC=0.985时匹配Caduceus的十分位数成员。跨架构分解清晰地分离了序列可预测性层(两个语言模型共同对长且可预测的转座元件进行排序)和调控输出层(只有Enformer保留了区分cCRE的信号),两个前100列表之间完全没有重叠。然后,保守性、脑cis-eQTL和STRING-PPI交叉检查锚定了哪些生物学信息得以保留:所有三个模型的前100个元件在匹配脑eQTL方面每个模型富集了3.3倍($p_\mathrm{emp} < 5\times 10^{-3}$),而一个诱人的转座元件调控层和一个显著的NRXN1+NLGN1蛋白对收敛在构建适当的置换检验后均未通过。我们将该诊断方法作为任何基于ISM的调控研究的通用方法工具提供。

英文摘要

High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

2606.13751 2026-06-16 cs.CL 版本更新

Which Models Perform Better in Inheritance Reasoning?

哪些模型在继承推理中表现更好?

Mohammed Amine Mouhoub, Chahinez Bouchekif

发表机构 * Paris Dauphine University(巴黎多芬纳大学) University of Abou Bekr Belkaïd(阿布·贝克尔·贝尔卡伊德大学)

AI总结 本文比较了商业和开源大语言模型在伊斯兰继承推理任务中的表现,发现商业模型在识别继承人、应用排除规则和保持推理一致性方面更优,其中Gemini 2.5 Flash表现最佳。

详情
AI中文摘要

本文介绍了PSL团队在QIAS 2026阿拉伯伊斯兰继承推理共享任务中的参与情况。该任务评估大语言模型解决需要法律解释、多步推理和精确数值计算的继承案例的能力。我们在统一的提示策略下比较了\textit{商业}和\textit{开源}模型,以评估它们在最小任务特定适应下的结构化法律推理中的有效性。\我们的结果显示两个模型系列在可靠性上存在明显差距。商业模型在识别合格继承人、应用排除规则以及保持推理步骤一致性方面表现出更强的性能。相比之下,开源模型表现出更大的不稳定性,特别是在涉及依赖法律决策和分数份额调整的案例中。最佳性能由\textit{Gemini 2.5 Flash}实现,其MRE为$0.989$。

英文摘要

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.

2606.13691 2026-06-16 cs.CY cs.CL 版本更新

Incentives Of EdTech: A Systematic Review Of EduNLP Research

教育科技的激励:EduNLP研究的系统综述

Gabrielle Gaudeau, Aoife O'Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat

发表机构 * ALTA Institute, Computer Laboratory, University of Cambridge(剑桥大学ALTA研究所、计算机实验室) Ghent University(根特大学) Bocconi University(博科尼大学) University of Edinburgh(爱丁堡大学)

AI总结 通过系统综述204篇ACL教育应用论文,揭示教育NLP研究中私营部门激励与教育基础设施需求之间的张力,发现教师作为受益者被系统性低估(33.3%),实际部署罕见(9.8%),伦理参与趋于承认而非行动。

Comments 10 main pages (13 appendix pages), 20 figures, accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications @ ACL 2026

详情
AI中文摘要

尽管自然语言处理社区投入了大量资源来开发支持这一转变的教育技术(EdTech),但在教育利益相关者中,谁的利益得到了最好的服务仍不清楚。在本文中,我们对2024年和2025年发表在计算语言学协会教育应用构建特别兴趣小组会议上的204篇论文进行了系统文献综述,并与更广泛的ACL文集中的EdTech论文进行了验证。通过考察利益相关者包容性和研究任务的优先级,我们的发现揭示了一个关键张力:私营部门激励与教育基础设施的基本需求之间的推拉。我们的分析表明,教师作为研究受益者被系统性低估(33.3%),尽管他们受影响最大;实际部署仍然罕见(9.8%);伦理参与倾向于承认而非行动。借鉴我们语料库中的典范论文,我们为更负责任的EduNLP研究实践提供了具体建议。

英文摘要

While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics' Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.

2506.00955 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

利用大语言模型进行讽刺语音标注在讽刺检测中的应用

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(Groningen大学) Speech Technology Lab(语音技术实验室) Center for Language and Cognition(语言与认知中心)

AI总结 本文提出利用大语言模型生成讽刺语音数据集,通过人类验证提升标注质量,并在公开数据集上验证了检测性能,最终引入PodSarc数据集,实现了73.63%的F1分数。

Comments Interspeech 2025; Project page: https://github.com/Abel1802/PodSarc

详情
AI中文摘要

讽刺通过语气和语境改变意义,但语音中检测讽刺仍具挑战性,因数据稀缺。现有检测系统常依赖多模态数据,限制了仅语音可用场景的应用。为此,我们提出一个利用大语言模型(LLMs)生成讽刺数据集的标注流程。使用公开的以讽刺为主的播客,我们采用GPT-4o和LLaMA 3进行初始讽刺标注,随后由人类验证以解决分歧。我们通过在公开讽刺数据集上比较标注质量和检测性能,验证了该方法的有效性。最后,我们引入PodSarc,一个通过此流程生成的大规模讽刺语音数据集。检测模型实现了73.63%的F1分数,证明了该数据集作为讽刺检测研究基准的潜力。

英文摘要

Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset's potential as a benchmark for sarcasm detection research.

2602.14819 2026-06-16 cs.CL 版本更新

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Testimole-Conversational: 一个包含300亿词的意大利讨论板语料库(1996-2024)用于语言建模和社会语言学研究

Matteo Rinaldi, Rossella Varvara, Viviana Patti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 该语料库为意大利大语言模型预训练和语言社会学研究提供了丰富的讨论板文本资源,涵盖1996-2024年间超过300亿词的意大利语信息。

详情
Journal ref
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12) @ LREC 2026
AI中文摘要

我们介绍了"Testimole-conversational",一个包含意大利语讨论板消息的大量语料库。该语料库超过300亿词(1996-2024),使其成为训练原生意大利语大语言模型的理想数据集。此外,讨论板消息对于语言学和社会学分析也具有相关性。该语料库捕捉了丰富的计算机中介交流内容,提供了关于非正式书面意大利语、话语动态和在线社交互动的深入见解。除了对NLP应用如语言建模、领域适应和对话分析的相关性外,它还支持对数字通信中语言变异和社会现象的研究。该资源将免费提供给研究社区。

英文摘要

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

2508.12365 2026-06-16 cs.IR cs.AI cs.CL 版本更新

TaoSR1: The Thinking Model for E-commerce Relevance Search

TaoSR1:电商相关性搜索的思考模型

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba(淘宝与天猫集团)

AI总结 本文提出TaoSR1框架,通过CoT引导的监督微调、离线采样与DPO优化,解决电商搜索中相关性预测的推理误差与幻觉问题,实现高效部署。

详情
Journal ref
KDD '26: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2026
AI中文摘要

查询-商品相关性预测是电商搜索的核心任务。基于BERT的模型在语义匹配上表现优异,但缺乏复杂的推理能力。尽管大型语言模型(LLMs)被探索,大多数仍使用判别性微调或蒸馏到小模型进行部署。我们提出一个框架,直接部署LLMs用于此任务,解决关键挑战:推理链(CoT)误差累积、判别性幻觉和部署可行性。我们的框架TaoSR1包括三个阶段:(1)使用CoT的监督微调以培养推理能力;(2)离线采样与pass@N策略和直接偏好优化(DPO)以提高生成质量;(3)基于难度的动态采样与组相对策略优化(GRPO)以缓解判别性幻觉。此外,后CoT处理和基于累积概率的分区方法使在线部署高效。TaoSR1在离线数据集上显著优于基线,并在在线双人评估中取得显著优势,引入了将CoT推理应用于相关性分类的新范式。

英文摘要

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

2510.18355 2026-06-16 cs.CL cs.HC cs.IR 版本更新

KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

KrishokBondhu:一种基于检索增强的语音农业咨询呼叫中心,面向孟加拉语农民

Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出KrishokBondhu,一种基于检索增强生成框架的语音农业咨询平台,通过OCR处理农业手册等资料,结合大语言模型生成回答,实现孟加拉语农民的实时农业指导。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

在孟加拉国,许多农民仍难以获得及时的农业指导。本文提出了KrishokBondhu,一种基于检索增强生成(RAG)框架的语音咨询平台,整合农业手册、推广手册和NGO出版物,通过OCR流程处理并索引到向量数据库中,实现语义检索。通过电话界面,农民可以接收实时、上下文感知的建议:语音识别将孟加拉语查询转换为文本,RAG模块检索相关信息,大语言模型(Gemma 3-4B)生成基于事实的回答,文本到语音将答案以孟加拉语口语形式传达。在试点评估中,KrishokBondhu对72.7%的多样化农业查询产生了高质量的回答。与KisanQRS基准相比,其综合得分为4.53(5分制)而非3.13,提高了44.7%,特别是在上下文丰富性和完整性方面有显著提升,同时保持了相关性和技术特异性。语义相似性分析进一步显示检索上下文与回答质量之间有强相关性。KrishokBondhu展示了结合呼叫中心可及性、多语言语音交互和现代RAG技术,为偏远孟加拉国农民提供专家级农业指导的可行性。

英文摘要

In Bangladesh, many farmers still struggle to access timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework for Bengali-speaking farmers. The system combines agricultural handbooks, extension manuals, and NGO publications, processes them through an OCR-based pipeline, and indexes the curated content in a vector database for semantic retrieval. Through a phone-based interface, farmers can receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant information, a large language model (Gemma 3-4B) generates a grounded response, and text-to-speech delivers the answer in spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries. Compared to the KisanQRS benchmark, it achieved a composite score of 4.53 versus 3.13 on a 5-point scale, with a 44.7% improvement and especially large gains in contextual richness and completeness, while maintaining comparable relevance and technical specificity. Semantic-similarity analysis further showed a strong correlation between retrieved context and answer quality. KrishokBondhu demonstrates the feasibility of combining call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers.

2602.12639 2026-06-16 cs.CL 版本更新

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

CLASE:一种中文法律文本风格评估的混合方法

Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出CLASE混合方法,通过结合语言特征和经验指导的LLM评估,提升法律文本风格质量,实验表明其比传统方法更符合人类判断。

Comments Accepted at LREC 2026

详情
AI中文摘要

大型语言模型生成的法律文本通常能实现合理的事实准确性,但往往无法遵循法律写作的专门风格规范和语言惯例。为提高风格质量,建立可靠的评估方法是关键第一步。然而,让法律专家手动开发此类指标不现实,因为法律写作中的隐含风格要求难以明确化为明确的评分标准。同时,现有自动评估方法也存在不足:基于参考的指标将语义准确性与风格忠实度混为一谈,而LLM作为评判者的方法则存在不透明和不一致的问题。为解决这些挑战,我们引入CLASE(中文法律风格评估),一种专注于法律文本风格表现的混合评估方法。该方法结合了基于语言特征的评分和经验指导的LLM作为评判者评分。特征系数和LLM评分经验均从真实法律文件与其LLM恢复版本的对比对中学习。这种混合设计以透明且无参考的方式捕捉了表层特征和隐含的风格规范。在200份中文法律文件上的实验表明,CLASE在与人类判断的一致性方面显著优于传统指标和纯LLM作为评判者方法。除了改进的一致性,CLASE还提供可解释的评分分解和改进建议,为法律文本生成中的专业风格评估提供了可扩展和实用的解决方案(CLASE的代码和数据可在:https://github.com/rexera/CLASE获得)。

英文摘要

Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

2511.08507 2026-06-16 cs.CL cs.AI 版本更新

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

介绍一个孟加拉语句子- gloss配对数据集用于孟加拉语手语翻译和研究

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology(Bangladesh University of Engineering and Technology计算机科学与工程系)

AI总结 本文介绍了一个包含1000个人工标注句子- gloss配对的新数据集Bangla-SGP,通过规则基于的检索增强生成管道生成约3000个合成配对,用于孟加拉语手语翻译和研究。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 10457-10466, ELRA, Palma, Mallorca, Spain, May 2026
AI中文摘要

孟加拉语手语(BdSL)翻译是一个低资源自然语言处理任务,由于缺乏大规模数据集来解决句子级翻译。相应地,该领域现有研究局限于词和字母级别的检测。在本工作中,我们介绍了Bangla-SGP,一个包含1000个由专业手语者手动标注的高质量孟加拉语句子的平行数据集,这些句子被注释为gloss序列。该数据集通过基于规则的检索增强生成(RAG)管道扩展,使用句法和形态学规则生成约3000个合成配对。gloss序列由单独的gloss组成,这些gloss是孟加拉语手语支持的词汇,并作为连续手语的中间表示。我们的数据集由1000个高质量孟加拉语句子组成,这些句子由专业手语者手动注释为gloss序列。增强过程结合了基于规则的语言学策略和提示工程技术,这些技术通过批判性分析我们的人工标注句子-gloss配对以及与专业手语者密切合作而获得。此外,我们微调了几种基于transformer的模型,如mBart50、Google mT5、GPT4.1-nano,并使用BLEU分数评估其句子到gloss的翻译性能。基于这些评估指标,我们比较了模型在我们数据集和RWTH-PHOENIX-2014T基准上的gloss翻译一致性。

英文摘要

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

2509.01182 2026-06-16 cs.AI cs.CL cs.HC cs.IR cs.MA 版本更新

Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

问题到知识(Q2K):多智能体生成可检查的事实以实现产品映射

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

发表机构 * The University of Tokyo(东京大学) KISTI(韩国科学技术院)

AI总结 Q2K通过多智能体框架利用大语言模型实现可靠的产品SKU映射,通过生成辨析问题、网络搜索和去重来提高准确性与鲁棒性,适用于复杂场景如捆绑识别和品牌来源辨析。

Comments Accepted by IEEE BigData 2025 Industry Track

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 2646-2653
AI中文摘要

识别两个产品列表是否指向相同的库存单位(SKU)是电子商务中的持续挑战,尤其是在缺乏显式标识符且产品名称在不同平台上差异较大的情况下。基于规则的启发式方法和关键词相似性经常因忽略品牌、规格或捆绑配置的细微区别而误分类。为克服这些限制,我们提出了问题到知识(Q2K),一个多智能体框架,利用大语言模型(LLMs)进行可靠的SKU映射。Q2K集成了:(1)一个推理代理,生成定向的辨析问题;(2)一个知识代理,通过聚焦的网络搜索解决这些问题;(3)一个去重代理,重用已验证的推理轨迹以减少冗余并确保一致性。人类在循环机制进一步细化不确定情况。在真实世界消费品数据集上的实验表明,Q2K超越了强大的基线,实现了在捆绑识别和品牌来源辨析等困难场景中的更高准确性和鲁棒性。通过重用检索到的推理而不是发出重复搜索,Q2K在准确性和效率之间取得了平衡,提供了一种可扩展且可解释的解决方案用于产品整合。

英文摘要

Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

2509.02093 2026-06-16 cs.CL cs.AI cs.IR 版本更新

Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

通过对比改进:基于检索增强的对比推理用于自动提示优化

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出CRPO框架,通过对比推理提升提示优化效果,利用HelpSteer2数据集中的高质量示例进行对比分析,改进提示生成的鲁棒性和可解释性。

Comments Preprint

详情
Journal ref
2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Dekalb, IL, USA, 2025, pp. 269-272
AI中文摘要

自动提示优化近期作为一种提升大型语言模型(LLMs)提示质量的策略,旨在生成更准确和有用的响应。然而,大多数先前工作集中在直接提示精炼或模型微调,忽略了利用LLM内在推理能力从对比示例中学习的潜力。本文提出对比推理提示优化(CRPO),一种新颖的框架,将提示优化建模为检索增强的推理过程。我们的方法从HelpSteer2数据集检索top k参考提示-响应对,该数据集是一个开源集合,每个响应均标注了有用性、正确性、连贯性、复杂性和冗余性。我们构建了两种互补的优化范式:(1)分层对比推理,其中LLM比较高质量、中等质量和低质量的示例(提示和响应)以通过反思推理优化自身生成;(2)多指标对比推理,其中LLM分析每个评估维度的最佳示例并整合其优势以生成优化提示。通过显式对比高质量和低质量示例,CRPO使模型能够推断为何某些提示成功而其他失败,从而实现更鲁棒和可解释的优化。在HelpSteer2基准测试中的实验结果表明,CRPO显著优于基线方法。我们的发现突显了对比、检索增强推理在推进自动提示优化方面的潜力。

英文摘要

Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process. Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset, an open source collection where each response is annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high-, medium-, and low-quality exemplars (both prompts and responses) to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best exemplars along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

2410.13439 2026-06-16 cs.LG cs.CL cs.CV 版本更新

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

多标签监督对比学习中的相似性-差异性损失

Guangming Huang, Yunfei Long, Cunjin Luo

发表机构 * University of Essex(埃塞克斯大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 本文提出相似性-差异性损失,通过动态加权样本解决多标签场景下正样本确定问题,提供理论证明并统一单标签与多标签对比学习框架,实验表明方法在图像、文本和医疗领域均优于基线。

Comments Accepted by Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

监督对比学习通过利用标签信息取得了显著成功;然而,在多标签场景中确定正样本仍是一个关键挑战。在多标签监督对比学习(MSCL)中,多标签关系尚未完全定义,导致正样本识别和对比损失函数构建存在歧义。为解决这些挑战,我们:(i)系统地制定了MSCL中的多标签关系;(ii)提出了一种新颖的相似性-差异性损失,根据相似性和差异性因素动态重新加权样本;(iii)通过严谨的数学分析提供了理论支持,支持我们的方法制定和有效性;(iv)为单标签和多标签监督对比损失提供统一形式和范式。我们在图像和文本模态上进行了实验,并进一步将其扩展到医疗领域。结果表明,我们的方法在全面评估中始终优于基线,证明了其有效性和鲁棒性。

英文摘要

Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretically grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness.

2508.13028 2026-06-16 cs.CL 版本更新

Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis

将双模反讽检测器的反馈损失整合到反讽语音合成中

Zhu Li, Yuqing Zhang, Xiyuan Gao, Devraj Raghuvanshi, Nagendra Kumar, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(格罗宁根大学) Brown University(布朗大学) Indian Institute of Technology Indore(印度理工学院印度尔)

AI总结 本文提出整合双模反讽检测器反馈损失的语音合成方法,通过迁移学习提升合成反讽语音的质量和自然度。

Comments Speech Synthesis Workshop 2025

详情
AI中文摘要

反讽语音合成,即生成能有效传达反讽的语音,对于增强娱乐和人机交互等应用中的自然交互至关重要。然而,由于反讽的细微语调特征以及标注反讽语音数据的有限性,合成反讽语音仍具挑战性。为此,本研究引入一种新颖方法,将双模反讽检测模型的反馈损失整合到TTS训练过程中,以增强模型捕捉和传达反讽的能力。此外,通过迁移学习,预训练在朗读语音上的语音合成模型经历两阶段微调。首先,在涵盖各种语音风格的多样化数据集上微调,包括反讽语音。第二阶段,使用专门针对反讽语音的数据集进一步优化模型,提升生成反讽感知语音的能力。客观和主观评估显示,所提方法提高了合成语音的质量、自然度和反讽感知性。

英文摘要

Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.

2502.16560 2026-06-16 cs.AI cs.CL cs.SI 版本更新

An Analytical Emotion Framework of Rumour Threads on Social Media

社交媒体谣言线中的分析情绪框架

Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出一个多方面情绪检测框架,分析谣言与非谣言线的情绪差异,揭示谣言引发负面情绪而非谣言引发正面情绪,并通过因果分析揭示情绪传播机制。

Comments Accepted to ICWSM 2025 MisD Workshop

详情
AI中文摘要

在线社交媒体中的谣言对现代社会构成重大风险,推动了对谣言发展机制的深入理解。本文聚焦谣言与情绪在线讨论中的交互,构建了一个多方面情绪分析框架,对比谣言与非谣言线,并进行情绪的关联与因果分析。我们应用该框架于现有广泛使用的谣言数据集,进一步理解在线社交媒体线的情绪动态。框架显示谣言引发更多负面情绪(如愤怒、恐惧、悲观),而非谣言引发更多积极情绪。情绪具有传染性,谣言传播负面情绪,非谣言传播正面情绪。因果分析显示惊讶连接谣言与其他情绪;悲观来自悲伤和恐惧,而乐观源于喜悦和爱。

英文摘要

Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.

2504.08609 2026-06-16 cs.CL cs.AI 版本更新

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

面向英文文本仇恨言论多标签分类的机器学习模型与数据集综述

Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter

发表机构 * Technical University of Darmstadt, Science and Technology for Peace and Security (PEASEC)(德累斯顿技术大学,和平与安全科学技术(PEASEC))

AI总结 本文综述了46篇英文文献,分析了28个适合多标签分类模型训练的数据集,揭示了标签集、大小、元概念等的异质性,并指出评估不一致、BERT和RNN偏好等关键问题,提出十项研究建议。

Comments 35 pages, 4 figures, 4 tables

详情
Journal ref
ACM Transactions on Knowledge Discovery from Data (2026)
AI中文摘要

在线仇恨言论的传播对个人、在线社区和社会整体都有严重负面影响。鉴于此以及海量仇恨内容的规模,内容审核和执法人员及研究人员对机器学习模型自动分类仇恨言论产生了兴趣。尽管大多数科学作品将仇恨言论分类视为二元任务,但实践中往往需要区分子类型,例如根据目标、严重程度或合法性,这可能在个别内容上重叠。因此,研究者创建了数据集和机器学习模型,将文本数据中的仇恨言论分类视为多标签问题。本文首次系统全面地综述了英文文献中这一新兴研究领域的科学文献(N=46)。我们贡献了28个适合训练多标签分类模型的数据集的简要概述,揭示了标签集、大小、元概念、标注过程和标注者间一致性的显著异质性。对24篇提出合适分类模型的出版物的分析进一步揭示了评估不一致以及对双向编码表示变换器(BERT)和循环神经网络(RNN)的偏好。我们识别出训练数据不平衡、依赖众包平台、小而稀疏的数据集以及缺失方法学一致性为关键开放问题,并提出了十项研究建议。

英文摘要

The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

2406.07277 2026-06-16 cs.CL cs.AI cs.MA 版本更新

Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication

说出你的语言:可解释的涌现交流中的空间关系

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton(索姆塞特大学) The Alan Turing Institute(艾伦·图灵研究所) University of Brescia(布雷西亚大学)

AI总结 本文研究了智能体如何通过空间关系交流,展示了其能发展出表达观察部分关系的语言,实现90%以上的准确率,并证明该语言可被人类解读。

Comments Accepted at NeurIPS 2024. 18 pages, 3 figures

详情
Journal ref
In Advances in Neural Information Processing Systems (Vol. 37, pp. 140113-140137) 2024
AI中文摘要

有效的交流需要能够参照观察中的特定部分相对于其他部分的能力。尽管涌现交流文献在开发各种语言属性方面取得成功,但尚未有研究展示出此类位置参照的出现。本文展示了智能体如何在观察中交流空间关系。结果表明,智能体可以发展出能够表达其观察部分之间关系的语言,在训练于需要此类交流的指称游戏中,准确率超过90%。使用词组测量方法,我们展示了智能体如何创建此类参照。此分析表明,智能体使用非组合性和组合性信息的混合来传达空间关系。我们还证明了涌现语言可被人类解读。通过与接收智能体交流测试翻译准确性,接收智能体使用该词典部分达到78%以上的准确率,证实了该涌现语言的解读成功。

英文摘要

Effective communication requires the ability to refer to specific parts of an observation in relation to others. While emergent communication literature shows success in developing various language properties, no research has shown the emergence of such positional references. This paper demonstrates how agents can communicate about spatial relationships within their observations. The results indicate that agents can develop a language capable of expressing the relationships between parts of their observation, achieving over 90% accuracy when trained in a referential game which requires such communication. Using a collocation measure, we demonstrate how the agents create such references. This analysis suggests that agents use a mixture of non-compositional and compositional messages to convey spatial relationships. We also show that the emergent language is interpretable by humans. The translation accuracy is tested by communicating with the receiver agent, where the receiver achieves over 78% accuracy using parts of this lexicon, confirming that the interpretation of the emergent language was successful.

2408.14892 2026-06-16 cs.CL cs.SD eess.AS 版本更新

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm

语义与语气特征在传达讽刺中的功能权衡

Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

发表机构 * ZhuLi(朱莉) XiyuanGao(高西元) YuqingZhang(张雨青) ShekharNayak(Shekhar Nayak) MattColer(Matt Coler)

AI总结 研究通过分析不同讽刺类型语句的声学特征,发现语义明显时语气特征不重要,而语义不明显时语气特征更关键,揭示了讽刺传达中语义与语气特征的权衡关系。

Comments accepted at Interspeech 2024

详情
AI中文摘要

本研究探讨了讽刺的声学特征,并分离了语句被用作讽刺的倾向与语气特征信号之间的相互作用。利用从电视节目中收集的讽刺语句数据集,我们分析了语句和关键短语的语气特征,这些短语属于三种不同的讽刺类别(嵌入式、命题式和施为式),它们在语义特征的强度上有所不同,并与中性表达进行比较。结果表明,在语义明显显示讽刺意义的短语中,语气特征不如语义不明显时重要,这表明在短语层面,讽刺的语气和语义特征之间存在权衡。这些发现突显了在语义密集的讽刺表达中对语气调节的依赖性降低,并揭示了塑造讽刺意图传达的细微互动。

英文摘要

This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.

2402.16515 2026-06-16 cs.CL cs.CR 版本更新

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

基于知识蒸馏和分布导师的LLM隐私数据增强方法用于医学文本分类

Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li

发表机构 * College of Science, National University of Defense Technology, Hunan, China(国防科技大学科学学院,湖南,中国) College of Computer, National University of Defense Technology, Hunan, China(国防科技大学计算机学院,湖南,中国) The CoAI Group, DCST, BNRist, Tsinghua University, Beijing(清华大学北京人工智能研究院,北京)

AI总结 本文提出一种结合LLM和知识蒸馏的隐私数据增强方法,通过分布导师控制生成分布,提升医学文本分类的隐私保护与性能。

详情
Journal ref
Neural Networks, Vol. 199, 2026, 108668
AI中文摘要

由于充足的数据不总是公开可得,研究者利用有限数据和先进学习算法,或通过数据增强(DA)扩展数据集。在私有领域进行DA需要隐私保护方法(即匿名化和扰动),但这些方法无法提供保护保证。差分隐私(DP)学习方法理论上界定了保护,但不擅长生成大规模模型的伪文本样本。本文将DP-based伪样本生成任务转移到DP-based生成样本鉴别任务,提出一种结合LLM和DP-based鉴别器的DA方法,用于私有领域文本分类。我们构建了一个知识蒸馏模型作为DP-based鉴别器:教师模型访问私有数据,指导学生如何选择私有样本,通过校准噪声实现DP。为约束DA生成的分布,我们提出一个DP-based导师,建模噪声私有分布,并通过低隐私成本控制样本生成。我们理论分析了模型的隐私保护,并通过实验证证了模型。

英文摘要

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.