arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27306 2026-05-27 cs.LG

Normal Guidance is what Attention Needs

Normal Guidance is what Attention Needs

Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

AI总结 提出Normal Guidance正则化技术,使基于注意力的多实例学习方法在3D医学图像切片级定位上超越现有方法,同时保持全扫描分类性能。

详情
AI中文摘要

我们考虑仅使用整个体积的一个二元标签(而不是每个2D切片的标签)来训练3D医学图像的分类器。在这种弱监督设置下,我们能否学习准确的切片级预测分类器?基于注意力的多实例学习(MIL)可以为每个切片生成注意力分数。然而,最近的研究表明,一个忽略图像内容的简单中心聚焦基线在3D脑部扫描的切片级分类上可以胜过基于注意力和基于Transformer的MIL。我们证明该基线在胸部和腹部CT扫描的切片级分类上也优于现有的MIL。受此基线启发,我们提出了Normal Guidance,一种正则化技术,鼓励学习的注意力分布遵循钟形曲线。在三个总计超过400万张2D切片的医学影像数据集上,我们展示了Normal Guidance使基于注意力和基于Transformer的MIL方法在切片级定位上显著优于现有技术,同时在全扫描分类上保持竞争力。

英文摘要

We consider training classifiers for 3D medical images using only one binary label for the entire volume rather than a label for each 2D slice. In such weakly supervised settings, can we learn accurate classifiers for slice-level predictions? Attention-based multiple instance learning (MIL) can produce an attention score for every slice. Yet recent work demonstrates that a simple center-focused baseline that ignores image content can outperform attention-based and transformer-based MIL at slice-level classification of 3D brain scans. We show this baseline also outperforms existing MIL at slice-level classification of thoracic and abdominal CT scans. Motivated by this baseline, we propose Normal Guidance, a regularization technique that encourages the learned attention distribution to follow a bell-shaped curve. Across three medical imaging datasets totaling over 4 million 2D slices, we show our Normal Guidance enables attention-based and transformer-based MIL methods to deliver significantly better slice-level localization than the state-of-the-art while remaining competitive at whole-scan classification.

2605.27304 2026-05-27 cs.CV

PlayClass: Automated Play Behaviour Classification in Poultry

PlayClass: 家禽自动玩耍行为分类

Prince Ravi Leow, Neil Scheidwasser, Rebecca Oscarsson, Per Jensen, Samir Bhatt, David Alejandro Duchêne

AI总结 提出PlayClass流水线,利用SAM 3长时跟踪和V-JEPA 2.1基础模型,从俯拍视频中自动分类家禽玩耍行为,达到77.0宏平均F1。

详情
Comments
Accepted at CV4Animals Workshop @ CVPR 2026
AI中文摘要

自动监测动物福利主要关注负面指标,而玩耍等积极福利行为尚未充分探索。为解决这一问题,我们提出了PlayClass,一个从俯拍围栏视频中对家禽玩耍行为进行分类的流水线。该流水线利用SAM 3通过YOLO引导的片段边界进行长时跟踪,以最小化点提示中的身份错误,并使用图像和视频基础模型的冻结嵌入进行玩耍动作分类。尽管仅从跟踪掩模中手工设计的运动特征达到了有竞争力的准确率,但V-JEPA 2.1在所有模型规模上始终优于其他骨干网络,当与手工特征结合时达到77.0宏平均F1。尽管如此,由于玩耍子类型与非玩耍行为具有相似的运动特征以及鸟间遮挡,数据集仍然具有挑战性。总体而言,我们的工作为家禽玩耍行为自动分类框架提供了令人鼓舞的证据。

英文摘要

Automated monitoring of animal welfare has largely targeted negative indicators, leaving positive welfare behaviours such as play underexplored. To address this gap, we present PlayClass, a pipeline for play-behaviour classification in poultry from top-down pen video. The pipeline leverages long-duration tracking with SAM 3 via YOLO-guided chunk boundaries to minimise identity errors in point-based prompting, and frozen embeddings from image and video foundation models for play action classification. Although handcrafted motion features from tracked masks alone achieved competitive accuracy, V-JEPA 2.1 consistently outperformed all other backbones across model scales, reaching 77.0 macro-averaged F$_1$ when combined with handcrafted features. Despite this result, the dataset remains challenging due to play sub-types sharing similar kinematic profiles with non-play and inter-bird occlusion. Overall, our work provides encouraging evidence towards automated frameworks for play behaviour classification in poultry.

2605.27299 2026-05-27 cs.CR cs.AI cs.HC cs.LG cs.SY eess.SY

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

使用次正态高斯模糊模型的IDS风险规避警报优先级排序

Murat Moran

AI总结 提出基于次正态高斯模糊数的警报优先级排序框架,通过建模威胁严重性、检测置信度和组织风险态度三种不确定性,利用排序指数实现可调安全姿态,实验证明在检测器退化下比基线方法更鲁棒。

详情
AI中文摘要

现代入侵检测系统每天生成数千条警报,但由于误报或低影响事件过多,警报疲劳严重限制了安全运营的有效性。我们通过提出一个基于次正态高斯模糊数的原则性警报优先级排序框架来解决这个问题,该框架明确建模了三种不确定性来源:威胁严重性、检测置信度和组织风险态度。每个警报被表示为一个模糊数,其核心表示严重性,展度表示不确定性,高度反映检测可靠性。我们应用排序指数对警报进行优先级排序,允许组织通过风险态度参数调整安全姿态。在CIC-IDS2017和NSL-KDD上的实验验证表明,在检测器退化下,该方法比基线方法具有更强的鲁棒性(NDCGrel@100为0.9963对比0.8215),在中等置信度警报中具有明显区分度,在稳健检测器下与基线方法接近。该框架具有理论基础、计算效率高、提供可解释推理,并且在检测器系列和校准错误场景下保持鲁棒性。

英文摘要

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.

2605.27298 2026-05-27 cs.CL

Self-Ensembling Vision-Language Models for Chart Data Extraction

用于图表数据提取的自集成视觉语言模型

Thomas Berkane, Qianyi Wang, Maimuna S. Majumder

AI总结 提出一种自集成方法,通过多次采样同一VLM的输出并聚合表格单元格,提升图表数据提取的准确性,并引入新基准WB-ChartExtract。

详情
AI中文摘要

图表能有效传达定量信息,但底层数据通常以图像形式锁定,阻碍了重用和分析。手动数字化图表耗时且易出错,因此推动了自动图表到表格的提取。最近的方法使用专门的视觉语言模型(VLM),但在数据点众多或风格变化大的图表上性能仍然滞后。我们提出了一种VLM自集成方法,该方法针对固定图表图像从同一VLM重复采样多个表格输出,并在单个表格单元格级别进行聚合。我们对齐候选表格,并对数值取每个单元格的中位数,以生成更准确的共识表格。我们的方法还包括收敛检测,一旦聚合表格稳定就停止采样,以及基于样本间离散度的不确定性估计,帮助用户评估提取可靠性。由于现有的图表提取基准包含相对简单的图表,改进空间有限,我们引入了WB-ChartExtract,这是一个基于世界银行数据构建的新基准,包含更复杂和风格多样的图表;平均而言,其图表的数据点数量是ChartQA基准中图表的7倍。在ChartQA和WB-ChartExtract上,我们的方法比单次VLM输出提高了提取准确性,在WB-ChartExtract上集成后相对改进高达23%。更广泛地说,我们的方法有助于解锁以前被锁定在图表图像中的表格数据,支持下游分析和重用。

英文摘要

Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.

2605.27296 2026-05-27 cs.CL

Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics

探究大型语言模型的文化意识:跨文化美学文体学案例研究

Jiashuo Wang, Fenggang Yu, Jian Wang, Chak Tou Leong, Xiaoyu Shen, Chunpu Xu, Jiawen Duan, Wenjie Li, Johan F. Hoorn

AI总结 通过构建C4STYLI基准(包含香港和中国大陆的高度风格化翻译电影片名和广告标语),评估大型语言模型在跨文化美学文体识别和生成方面的能力,发现模型依赖表层语言信息而非风格结构,对香港特定风格结构敏感度有限。

详情
Comments
IJCAI 2026 Human-Centred AI track
AI中文摘要

大型语言模型(LLMs)越来越多地部署在多样化的文化背景中,但它们掌握美学文体学(即策略性地使用语言以唤起文化共鸣)的能力仍未得到充分探索。我们整理了C4STYLI,一个包含来自香港和中国大陆的高度风格化翻译电影片名和广告标语的基准,通过行为识别和生产能力的视角评估LLMs。广泛评估表明,LLMs在风格识别上与人类不同,且这种识别能力在不同文本领域有所变化。此外,LLMs中的风格识别和生成性能并不一致。为了进一步检查LLMs在风格识别中是否真正捕捉到风格信息,我们使用逻辑回归探针进行了结构消融。我们发现,在香港背景下,LLMs中的风格识别主要依赖于表层语言信息而非风格结构。这表明对香港特定风格结构的敏感度有限。

英文摘要

Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.

2605.27295 2026-05-27 cs.CV

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Gemini Embedding 2:来自Gemini的原生多模态嵌入模型

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini

AI总结 提出原生多模态嵌入模型Gemini Embedding 2,通过多任务多阶段对比学习统一视频、音频、图像和文本的表示空间,在单模态、跨模态和多模态检索任务上达到最先进性能。

详情
AI中文摘要

我们介绍了Gemini Embedding 2,一种原生多模态嵌入模型,允许在统一表示空间中对视频、音频、图像和文本模态进行嵌入。我们利用Gemini的多模态能力,为所有这些模态的交错输入任意组合生成嵌入,这些嵌入在广泛的任务中具有良好的泛化能力。在多任务多阶段训练设置中应用大规模对比学习,我们在关键嵌入基准测试中取得了最先进的性能,包括涵盖多种任务的单模态、跨模态和多模态检索。我们展示了我们的嵌入模型在多种任务上表现出强大的性能(在MSCOCO上得分为62.9 R@1,在Vatex上为68.8 NDCG@10,在MTEB多语言上为69.9,在MTEB代码上为84.0),超越了专门模型的性能。这些统一的能力使Gemini Embedding 2成为下游用例(如RAG、推荐和搜索)的有前途的候选者。此外,它在不同领域(从天文学和生物科学到美术和烹饪艺术)的强大零样本性能,使其成为即使对于专业领域也非常可靠的即用型表示。

英文摘要

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

2605.27294 2026-05-27 cs.CL cs.IR

Separating Semantic Competition from Context Length in RAG Reading

在RAG阅读中区分语义竞争与上下文长度

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu, Akash Vishwakarma

AI总结 通过匹配对照实验,分离出检索增强生成中阅读器的语义竞争效应,证明性能下降部分源于竞争而非仅上下文长度。

详情
Comments
4 pages, 1 figure, 2 tables
AI中文摘要

检索增强生成(RAG)系统即使在检索到正确段落时也可能错误回答。模型仍需阅读检索到的段落,并在看似相关的段落中识别出包含答案的那一个。这种段落阅读模型称为阅读器。它的失败仅仅是因为上下文更长,还是因为其他段落与正确段落真正竞争?我们引入并展示了一种RAG阅读的匹配对照协议:保持段落数量和长度固定,但将强竞争段落替换为不那么竞争的实段。我们在SQuAD上对两个紧凑开放模型应用此对照。这种替换部分恢复了性能,对F1和答案包含的影响最强。对于Phi-2,它恢复了+6.0 EM点、+7.0答案包含点和+0.057 F1。对于Qwen2.5-1.5B,它恢复了+4.5 EM点、+9.0答案包含点和+0.068 F1。为了跟踪性能如何随竞争段落积累而变化,我们还报告了保留曲线,并在曲线未交叉半保留时用右删失半衰期进行总结。这些结果共同表明,该协议分离了与上下文长度不同的竞争效应,尽管该效应对F1和答案包含比精确匹配更清晰,并且也随片段长度变化。

英文摘要

Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.

2605.27293 2026-05-27 cs.LG stat.ML

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

BASIS: 基于单次采样信息共享的批量优势估计用于LLM推理

Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi

AI总结 提出BASIS算法,通过单次采样和批次内信息共享改进价值函数估计,在减少计算开销的同时提升策略优化性能。

详情
Comments
17 pages, 7 figures
AI中文摘要

基于可验证奖励的强化学习已成为提升大型语言模型推理能力的标准方法。现有算法在价值估计和策略学习中面临计算效率与样本效率之间的权衡。我们引入BASIS,一种无评论家的后训练算法,旨在解决这一权衡。在每个在线训练步骤中,BASIS每个提示仅采样一次,但利用整个批次中跨提示的丰富信息来改进价值函数估计。实验表明,与代表性单次采样基线REINFORCE++相比,BASIS将价值函数估计的MSE降低了69%,并且使用一次采样达到的MSE低于使用8次采样的组均值估计器。价值估计的改进转化为更好的策略优化:使用显著更少的训练时间,BASIS达到了接近多次采样GRPO型基线的性能,并且通常优于单次采样REINFORCE型基线。

英文摘要

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.

2605.27288 2026-05-27 cs.CL cs.AI cs.LG

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

并非总是谄媚:基于认知不确定性测量LLM的从众行为

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

AI总结 本文提出MUSE框架,通过区分谄媚从众和不确定性驱动的从众,揭示LLM在用户反驳时改变立场的行为机制,并发现两种从众均随用户感知专业性和建议合理性增强。

详情
AI中文摘要

大型语言模型(LLMs)已知会放弃初始立场以适应用户的反驳。虽然先前研究主要将此行为归因于从人类反馈强化学习中习得的谄媚,但我们假设从众行为也受模型在推理时的认知不确定性驱动。本文提出MUSE,一个两阶段评估框架,用于解开驱动LLM从众行为的机制。具体而言,MUSE将模型回答查询时的认知不确定性与其在后续轮次中屈服于用户反驳的可能性进行映射。我们证明驱动从众的机制不仅限于谄媚。具体来说,我们刻画了共同驱动从众的两个不同因素:谄媚从众,即模型即使对其初始回答绝对确定也会与用户反驳保持一致;以及不确定性驱动从众,即模型从众可能性随其不确定性增加而增加。此外,我们进行消融研究,证明谄媚从众和不确定性驱动从众均随1)LLM对用户感知专业性的增加和2)用户建议的合理性增加而增长。更广泛地说,MUSE通过区分对齐诱导的谄媚和训练语料驱动的不确定性,为更有针对性的干预策略提供信息。

英文摘要

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

2605.27287 2026-05-27 cs.CV

A Dynamic Programming Framework for Discovering Count and Values of Multilevel Image Thresholding

一种用于发现多级图像阈值计数和值的动态规划框架

Eslam Hegazy, Mohamed Gabr

AI总结 提出一种基于动态规划和改进最小误差阈值准则的自动多级阈值方法,能自动确定阈值数量,在速度上优于传统动态规划方法,但SSIM和PSNR略低。

详情
AI中文摘要

多级图像阈值化是当今计算机视觉应用中重要的预处理算法。由于大多数常见的阈值化方法将期望的阈值数量作为用户输入,因此能够从输入图像本身自动确定合适阈值数量的阈值化方法具有优势。本文详细介绍了一种基于动态规划算法和改进的最小误差阈值(MET)准则的新型阈值化方法。通过实证统计研究,指出了该方法为何更优。此外,在自然、卫星和医学测试图像的综合集合上,将该方法与其它最先进方法进行了扩展比较。数值结果表明,当阈值数量较高时,所提出的MET-DP方法比传统的动态规划阈值化方法耗时少得多。该方法能够为大多数不同类型的测试图像检测出合适的阈值数量。然而,以阈值数量作为输入的传统方法产生的阈值化图像在结构相似性指数(SSIM)和峰值信噪比(PSNR)值上高于MET-DP。源代码可在https://w3id.org/met-dp/article1-code找到。

英文摘要

Multilevel Image thresholding is an important preprocessing algorithm in computer vision applications nowadays. Since most common thresholding methods take the desired count of thresholds as input by the user, thresholding methods that automatically determines a suitable count of thresholds from the input image itself are advantageous. In this article, a novel thresholding method based on a dynamic programming algorithm and a modification of Minimum Error Thresholding (MET) criterion is thoroughly presented. An empirical statistical study is performed to pinpoint why this proposed method is superior. Moreover, an extended comparison between this proposed method and other state-of-the-art methods is performed on a comprehensive set of natural, satellite and medical test images. The numerical results show that the proposed MET-DP method takes much less time than traditional dynamic programming thresholding methods when the number of thresholds is high. The proposed method can detect a suitable count of thresholds for most of tested images of different types. However, traditional methods that take the count of thresholds as input produce thresholded images of higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) values than MET-DP. Source code can be found on https://w3id.org/met-dp/article1-code

2605.27284 2026-05-27 cs.RO cs.AI

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

FineVLA:面向可操控视觉-语言-动作策略的细粒度指令对齐

Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

AI总结 提出FineVLA框架,通过构建细粒度数据集和训练策略,在保持任务成功率的同时实现机器人动作的细粒度可控性。

详情
Comments
26 pages, 7 figures, 25 tables
AI中文摘要

视觉-语言-动作(VLA)模型日益被期望不仅完成机器人任务,还能遵循人类关于如何执行这些任务的指令。然而,现有的机器人数据集通常将轨迹与粗略的目标级语言配对,留下执行关键细节(如活动臂、接近方向和接触区域)未指定。这限制了可操控策略学习和机器人视频理解。我们引入了FineVLA,一个用于动作对齐的细粒度VLA监督的开放框架。该框架包括:(1)一个数据构建工具,统一了来自10个开源机器人数据集的85K任务中的972,247条轨迹,并构建了FineVLA-Data,一个包含47,159条细粒度轨迹的人工验证数据集;(2)一个包含500个视频、10,816个原子事实和1,030个VQA问题的留出基准;(3)一个机器人专用的VLM标注器,用于可扩展的细粒度标注;(4)一个使用细粒度和原始目标级指令的受控混合训练的可操控VLA策略。我们的实验得出了三个发现。首先,细粒度监督不会牺牲目标级成功率:在不同设置下,仅使用细粒度指令相比仅使用原始指令成功率提高了1.4到8.1个百分点。其次,细粒度指令和原始指令互补,遵循一致的倒U形趋势,在FG:Raw = 1:2到1:1时达到峰值。最佳混合设置在RoboTwin模拟中达到86.8%/82.5%的成功率,在真实世界双臂操作中达到62.7/100(相比之下仅使用原始指令为49.9)。第三,细粒度监督改善了可操控控制:最大的真实世界增益出现在姿态(+23)、颜色(+18)和接近方向(+18)上——这些因素中目标级指令没有提供指导。总体而言,细粒度语言应增强目标级指令:指定如何执行以及实现什么。项目页面:https://finevla.xlang.ai/

英文摘要

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

2605.27281 2026-05-27 cs.LG stat.ML

Causal Risk Minimization for High-Dimensional Treatments

高维处理变量的因果风险最小化

Nikita Dhawan, Arnav Paruthi, Andrew Kim, Lovedeep Gondara, Jekaterina Novikova, Chris J. Maddison

AI总结 针对高维处理空间(如文本)的因果推断,提出通过分解因果误差为矩平衡误差序列并优化高阶平衡目标,以及将高维处理投影到低维属性的方法,实现无需属性特定训练的因果估计。

详情
Comments
18 pages, 4 figures
AI中文摘要

预测具有多种可能变化的干预效果(例如,影响心理健康结果的治疗内容或推动股价变动的财报电话会议记录)在多个领域中非常有用。然而,经典的因果估计量通常假设所有可能的干预都被观察到,这在干预变化广泛的情况下(例如,在所有文本字符串的空间中)是不可行的。我们采用了一种将因果推断重新表述为学习问题的著名方法,以处理高维处理空间。具体来说,在标准假设(如无未观测混杂)下,我们证明因果误差可分解为一系列递增阶数的矩平衡误差,并设计了直接改进因果估计的目标函数。我们还展示了如何将高维处理的效果投影到低维处理属性上,这使得单个模型能够回答多个因果问题,而无需额外的属性特定训练。我们在高维连续、离散和文本处理设置中经验性地评估了我们的估计量,其中文本处理使用了亚马逊评论的半合成数据集。我们的实验证明了高阶平衡误差优化的优势以及投影因果估计与属性特定估计的竞争性能。

英文摘要

Predicting the effect of interventions with many possible variations, e.g., therapeutic content that affects mental health outcomes or an earnings call transcript that drives movement in share price, is useful across several domains. However, classical causal estimators tend to assume that all possible interventions are observed, which is infeasible when interventions vary widely, for instance, in the space of all text strings. We adapt a well-known approach of recasting causal inference as a learning problem, to address high-dimensional treatment spaces. Specifically, under standard assumptions like no unobserved confounding, we show that causal error decomposes into a series of moment-balancing errors of increasing order, and design objectives that directly improve causal estimation. We also show how to project the effect of a high-dimensional treatment onto lower-dimensional treatment attributes, which allows a single model to answer several causal questions without additional attribute-specific training. We empirically evaluate our estimators in settings with high-dimensional continuous, discrete, and text treatments, the last of which used a semi-synthetic dataset of Amazon Reviews. Our experiments demonstrate the benefit of higher-order balance error optimization and competitive performance of projected causal estimates with attribute-specific estimators.

2605.27269 2026-05-27 cs.LG stat.AP

Transfer Learning using 66 Diseases for Disease Forecasting Applications

使用66种疾病的迁移学习进行疾病预测应用

Lauren J Beesley, Alexander C Murph, Dave Osthus, Lauren A Castro

AI总结 本研究通过迁移学习整合66种传染病及多种数据流,发现大多数情况下加入其他数据流能提升预测性能,但数据质量至关重要,并构建了公开数据库。

详情
AI中文摘要

疾病预测模型通常依赖于单一数据流,这使得模型在历史数据短或噪声大时变得脆弱。最近表现最佳的模型表明,综合同一疾病的多个报告系统可以提升性能。其他近期工作进一步扩展了这一想法,使用迁移学习利用不同疾病的数据来训练某一疾病的预测模型。我们极大地扩展了这些方法,在涵盖66种传染病和多个数据流的数据上训练机器学习模型。我们研究了整合不同数据流对预测20种不同疾病数据流的价值。我们发现,在绝大多数(84.9%)考虑的时间序列和模型结构中,整合其他数据流改善了预测。然而,我们的工作强调,添加数据的质量很重要,添加与目标数据流极其不同的数据有时会降低预测性能。这项工作的一个主要贡献是编制了一个公开可用的数据库,供传染病预测社区使用。

英文摘要

Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.

2605.27268 2026-05-27 cs.CL cs.AI

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

迷失在采样中:通过词覆盖率评估大语言模型中的词汇可达性

Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu, Javier Coronado-Blázquez, Pedro Reviriego

AI总结 提出词覆盖率(WCS)指标,量化标准采样过滤器(如Top-p、Top-k、Min-p)如何抑制低频率高信息词汇的生存率,揭示解码机制对语言多样性的影响。

详情
Comments
15 pages, 6 figures
AI中文摘要

现代大语言模型(LLM)常因生成重复和同质化文本而受到批评,尽管它们拥有庞大的潜在词汇量。以往研究关注模型知识和训练数据,我们则探究解码机制在抑制语言多样性中的作用。我们引入词覆盖率(WCS),该指标量化了标准采样过滤器(如Top-$p$、Top-$k$和Min-$p$)在数学上剔除上下文适当的人类词汇的程度。WCS并非评估静态知识,而是衡量低频率、高信息人类词汇的词汇存活率作为采样参数的函数。通过审计人类撰写的语料片段中的开放权重模型,我们识别出哪些合理的词汇选择因解码器而变得不可达,即使它们存在于概率空间中。我们的结果提供了定量证据,表明行业标准的采样默认值充当了无意的审查机制,将人类表达的独特纹理平滑为同质化的话语。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架,为在生成模型中保留人类语言多样性提供了诊断工具。

英文摘要

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

2605.27259 2026-05-27 cs.LG

Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

Kan扩展变换器:注意力、扩散和预测-分离自条件的范畴统一

Sridhar Mahadevan

AI总结 提出Kan扩展变换器(KETs)作为多种Transformer实现的统一范畴框架,将Transformer层视为加权结构化扩展算子,并通过预测-分离机制实现有效的自条件化,实验表明预测-分离机制比改变邻域族带来更大性能提升。

详情
Comments
30 pages
AI中文摘要

我们提出Kan扩展变换器(KETs)作为多种Transformer实现的统一范畴框架。核心主张是,Transformer层可以被视为加权结构化扩展算子:标准注意力是单邻域情况,几何Transformer风格的关联混合是稀疏边限制情况,而KET是高阶单纯形情况。这一视角也阐明了与扩散式补全的桥梁。当扩展算子作用于分离的预测载体而非教师强制隐藏状态时,它成为一种有效的自条件化机制,在不泄露未来黄金令牌的情况下暴露非因果结构。我们在Penn Treebank、WikiText-2和WikiText-103上对12种不同的Transformer实现进行了全面的实验验证,这些实现在严格因果和预测-分离机制上有所不同。在严格因果设置中,二次KET是WikiText-2和WikiText-103上比较的因果架构中最强的模型。然而,在所有数据集上,最大的收益来自预测-分离机制,而非仅改变邻域族。

英文摘要

We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.

2605.27254 2026-05-27 cs.LG cs.AI

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

LUCoS: 表格基础模型的潜在无监督上下文选择

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero

AI总结 针对表格基础模型在低标签场景下的上下文选择问题,提出LUCoS方法,利用无监督先验拟合网络(PFN)的潜在几何结构选择代表性medoids作为上下文,在67个数据集上优于随机选择和原始空间方法。

详情
Comments
Comments: 18 pages, 4 figures, supplementary appendices included
AI中文摘要

选择哪些实例进行标注是低标签表格学习中的一个关键挑战。对于最近的表格基础模型(如TabPFN),上下文选择直接决定预测性能。有监督的oracle实验表明,在相同标注预算下,精心选择的标注上下文集可以显著优于随机选择。然而,在TFM文献中,冷启动设置(即必须在任何标签可用之前选择实例)很少受到关注。这个问题本质上是几何问题。在视觉和语言领域,基础模型诱导出嵌入空间,其中简单的几何选择方法是有效的。相比之下,表格实例选择迄今为止主要是在原始表格空间中进行,而该空间缺乏自然的度量;异构类型、混合尺度以及非线性交互使得原始空间距离对于上下文构建不可靠,并且随着预算增加,原始空间选择在大多数数据集上表现低于随机。我们提出LUCoS(潜在无监督上下文选择),该方法用无监督先验拟合网络(PFN)诱导的潜在几何替换原始特征几何,并选择代表性medoids作为上下文。在67个OpenML-CC18数据集上,跨六个低标签预算评估,LUCoS在平均AUC、ACC和F1上排名第一,结论在指标和数据集级别的稳健性检查中保持稳定。增益分解揭示了一个简单机制:在最小预算下,主要收益来自强制覆盖;随着预算增加,决定性因素变为衡量覆盖的表示空间。LUCoS缓解了原始特征空间选择的失败,表明可靠的无监督上下文选择更少依赖于选择器的复杂性,而更多依赖于在有意义的表示几何中定义代表性。

英文摘要

Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that carefully chosen labeled context sets can strongly outperform random selection under the same labeling budget. However, the cold-start setting, where instances must be selected before any labels are available, has received little attention in the TFM literature. This problem is fundamentally geometric. In vision and language, foundation models induce embedding spaces where simple geometric selection methods are effective. In contrast, tabular instance selection has so far been performed predominantly in the original tabular space, which lacks a natural metric; heterogeneous types, mixed scales, and nonlinear interactions make raw-space distances unreliable for context construction, and original-space selection falls below random on the majority of datasets as the budget grows. We propose LUCoS (Latent Unsupervised Context Selection), which replaces raw-feature geometry with the latent geometry induced by embeddings from an unsupervised Prior-Fitted Network (PFN) and selects representative medoids as context. Evaluated on 67 OpenML-CC18 datasets across six low-label budgets, LUCoS ranks first under mean AUC, ACC, and F1, with conclusions stable across metrics and dataset-level robustness checks. A gain decomposition reveals a simple mechanism: at the smallest budgets, the main benefit comes from enforcing coverage; as the budget increases, the decisive factor becomes the representation space in which coverage is measured. LUCoS mitigates failures of original feature space selection, showing that reliable unsupervised context selection depends less on selector sophistication than on defining representativeness in a meaningful representation geometry.

2605.27249 2026-05-27 cs.AI cs.CL

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Gumbel机器:通过Gumbel噪声引导生成反事实学生写作

Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara, Andrew Lan

AI总结 提出Gumbel机器,一种利用β-Hindsight控制解码算法生成既符合评分标准又与学生原文相似的反事实文本的模块化方法。

详情
Comments
preprint
AI中文摘要

跨学科教学的有效方法是提供高质量工作的示例。然而,示例可能与学生的当前工作存在显著差异,使得学生难以模仿。理想的学习示范是学生工作的反事实版本,即与学生自身工作相似但有所改进的版本。现有的使用大型语言模型(LLMs)进行反事实文本生成的自动化方法导致了难以转化为实际应用的领域特定系统。我们提出了Gumbel机器,一种灵活、模块化的反事实生成方法,它利用LLM的指令遵循能力,同时鼓励与参考事实文本的相似性。我们方法的核心是一种新颖的受控解码算法β-Hindsight控制,该算法在反事实生成过程中利用潜在随机性作为可调的相似性控制机制。在根据各种标准评分的学生写作数据集上的实验表明,我们的方法在生成既符合评分标准又与参考文本相似的反事实文本方面是有效的。

英文摘要

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

2605.27246 2026-05-27 cs.LO cs.AI math.LO

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

多种逻辑,一种方法论:在形式化推理中倡导逻辑多元主义(预印本)

Christoph Benzmüller, Daniel Kirchner, Luca Pasetto

AI总结 本文基于LogiKEy逻辑多元知识表示与推理方法论,主张在统一元逻辑框架内支持对象逻辑层面的逻辑多元主义,并警告逻辑帝国主义对跨学科复用的阻碍。

详情
Comments
21 pages, 6 figures; to appear (preprint)
AI中文摘要

这份立场声明回顾了二十年来在经典高阶逻辑(HOL)中浅嵌入非经典逻辑的工作,该研究扩展为HOL中的一系列逻辑嵌入,并启发了LogiKEy逻辑多元知识表示与推理方法论。本文在LogiKEy等统一元逻辑框架内,以计算形而上学为基础,论证了对象逻辑层面的逻辑多元主义。更广泛地说,它倡导现代证明助手对逻辑多元主义的原则性支持,并警告逻辑帝国主义——即在大规模理论发展中僵化采用单一基础逻辑——这阻碍了LogiKEy旨在实现的跨学科复用。

英文摘要

This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HOL), a line of research that expanded into a range of logic embeddings in HOL and inspired the LogiKEy logic-pluralistic knowledge representation and reasoning methodology. This paper advances the case for logical pluralism at object-logic level within a unifying meta-logical framework such as LogiKEy, grounding the argument in computational metaphysics. More broadly, it advocates principled support for logical pluralism in modern proof assistants, and cautions against logical imperialism -- the rigid adoption of a single foundational logic for large-scale theory developments -- which impedes the interdisciplinary reuse that LogiKEy is designed to enable.

2605.27245 2026-05-27 cs.LG

Symbolic Regression via Latent Iterative Refinement

通过潜在迭代细化的符号回归

Xieting Chu, Sriram Vishwanath, Vijay Ganesh

AI总结 提出潜在方程嵌入(LEE)框架,通过迭代推断在功能基础化的潜在空间中缩小符号回归的推断差距,生成更简单且准确的表达式。

详情
Comments
Preprint. 21 pages, 11 figures
AI中文摘要

符号回归(SR)旨在寻找拟合观测数据的封闭形式数学表达式。神经SR方法通过训练编码器将观测数据直接映射到表达式来摊销搜索,但这种摊销推断在其一次性预测与真实后验之间留下了残余的摊销差距。我们提出潜在方程嵌入(LEE),这是一个通过在功能基础化的潜在空间中进行迭代摊销推断来缩小这一差距的框架。LEE学习一个共享的潜在空间Z,配备三个组件:编码器f_theta,将符号标记和数值观测联合嵌入到单个潜在向量z中;表达式解码器g_expr,从z重建公式;以及评估解码器g_eval,从z预测函数值,明确地将潜在空间基于功能行为。在推断时,LEE通过将解码后的表达式与观测数据联合重新编码来执行迭代细化,逐步改进潜在估计。LEE将编码器本身用作学习到的推断优化器:每个重新编码步骤隐式计算候选与数据之间的不匹配。由于g_eval在z上是可微的,我们另外将连续梯度下降与离散重新编码交错进行,产生一个混合迭代和梯度细化过程。在SRBench上,跨三个噪声水平,针对涵盖遗传规划、符号-神经混合和预训练Transformer的19个基线,LEE生成的表达式比最强精度导向的基线(包括Operon、GP-GOMEA、TPSR、RAG-SR和GenSR)简单2-10倍,复杂度为8-11,而后者为20-90。这些结果推进了精度-复杂度帕累托前沿的低复杂度区域,并显示出随着噪声增加而优雅退化。

英文摘要

Symbolic regression (SR) seeks closed-form mathematical expressions that fit observed data. Neural SR methods amortize the search by training an encoder to map observations directly to expressions in a single pass, but this amortized inference leaves a residual amortization gap between its one-shot prediction and the true posterior. We propose Latent Equation Embedding (LEE), a framework that closes this gap through iterative amortized inference in a functionally grounded latent space. LEE learns a shared latent space Z equipped with three components: an encoder f_theta that jointly embeds symbolic tokens and numerical observations into a single latent vector z; an expression decoder g_expr that reconstructs formulas from z; and an evaluation decoder g_eval that predicts function values from z, explicitly grounding the latent space in functional behavior. At inference, LEE performs iterative refinement by re-encoding decoded expressions jointly with observations, progressively improving the latent estimate. LEE uses the encoder itself as a learned inference optimizer: each re-encoding step implicitly computes the mismatch between the candidate and the data. Because g_eval is differentiable in z, we additionally interleave continuous gradient descent with discrete re-encoding, yielding a hybrid iterative and gradient refinement procedure. On SRBench across three noise levels, against 19 baselines spanning genetic programming, symbolic-neural hybrids, and pre-trained Transformers, LEE produces expressions 2--10x simpler than the strongest accuracy-oriented baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, with complexity 8--11 versus 20--90. These results advance the low-complexity region of the accuracy-complexity Pareto frontier and show graceful degradation as noise increases.

2605.27243 2026-05-27 cs.CV

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

检索头能看见图像吗?长上下文视觉语言模型中的多模态检索头

Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song

AI总结 本文提出一种多模态检索头检测方法,发现视觉语言模型中仅有4.4-10.2%的注意力头贡献了50%的正检索分数,这些头对长上下文推理至关重要,且可直接用于文档检索提升性能。

详情
Comments
Work in Progress
AI中文摘要

大型视觉语言模型越来越依赖长上下文建模来推理文档、小时级视频和长周期智能体轨迹,要求它们能在交错的文本和图像中定位相关证据。先前的工作使用大语言模型中的检索头研究了这种行为,但其基于复制的标准在证据出现在图像中时并不直接适用。我们引入了一种多模态检索头检测方法,对从问题标记到文本或视觉证据的注意力进行评分。通过这种方法,我们表明多模态检索头是稀疏的、内在的且因果重要的:仅4.4-10.2%的注意力头贡献了50%的正检索分数,而屏蔽前5%选定的头会使MMLongBench-Doc从48.2%降至5.7%,SlideVQA从71.2%降至8.9%,而随机头屏蔽的破坏性要小得多。进一步分析表明,这些头在模态间部分共享,但在每个模态内保持动态,随着上下文长度和“草堆”模态的变化,图像检索头比文本检索头变化更大。无需进一步训练,我们发现这些头也可直接用于对视觉丰富文档进行排序:在MMDocIR上,Qwen3-VL-8B选定的头评分在页面检索上比最强基线提高了7.7/7.4宏/微平均Recall@1,在布局检索上提高了6.3/6.8点。

英文摘要

Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.

2605.27240 2026-05-27 cs.CL

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

ENPMR-Bench: 情感支持代理的主动记忆检索基准

Xing Fu, Yulin Hu, Mengtong Ji, Haozhen Li, Yixin Sun, Weixiang Zhao, Yanyan Zhao, Bing Qin

AI总结 提出ENPMR-Bench基准,基于马斯洛需求层次评估情感支持代理主动推断用户潜在情感需求并检索适当记忆的能力,实验表明当前检索范式存在显著缺陷。

详情
AI中文摘要

记忆增强的语言代理越来越多地部署在情感支持等情感应用中,在这些应用中,理解和响应用户的潜在情感需求至关重要。然而,现有研究通常将记忆视为事实检索的工具,忽视了其在塑造用户情感体验中的作用。在这项工作中,我们引入了ENPMR-Bench,一个用于评估情感需求感知的主动记忆检索(ENPMR)的基准,这是一种核心能力,使代理能够推断用户的潜在情感需求并主动检索适当的记忆以支持共情交互。基于马斯洛需求层次,ENPMR-Bench包括超过1,800个记忆增强对话,并定义了情感需求与支持性记忆类型之间的结构化映射。实验结果表明,当前的检索范式,包括基于嵌入和LLM驱动的方法,都存在显著缺陷,共情得分明显落后于黄金记忆条件。虽然思维链提示在一定程度上改善了推断的情感需求与检索记忆之间的一致性,但性能差距仍然显著。总之,这些发现揭示了当前代理的关键局限性,并指出了通过需求敏感的记忆检索推进个性化情感支持的方向。

英文摘要

Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.

2605.27239 2026-05-27 cs.CL

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

时间同步性预测情感语料库中的标注质量

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

AI总结 通过分析Setswana情感数据集,发现标注者间一致性随时间下降的主要原因是时间同步性,即同时标注的样本一致性高,而间隔较长的标注一致性低。

详情
AI中文摘要

当标注活动跨越数周或数月且标注者池较小时,标注质量难以维持。我们提出了一个Setswana情感数据集,包含3,565条推文,由三名母语标注者在八个批次中标注,并考察了标注者间一致性(IAA)随时间下降的原因。尽管总体Randolph自由边际Kappa为$κ= 0.76$,属于“优秀”,但每批次$κ$在整个标注任务中下降了超过32个百分点。通过六项针对性分析,我们发现:(i) 标签混淆集中在负面/中性边界;(ii) 两名标注者表现出与自动驾驶标注一致的运行长度漂移;(iii) $κ$的主要预测因子是时间同步性:一分钟内标注的推文达到$κ= 0.98$,而相隔超过一天标注的推文仅达到$κ= 0.65$。标注速度和推文级语言特征与$κ$无显著关联。我们评估了三种开放多语言编码器和专有模型(GPT-5和Gemini)在三类情感分类任务上的表现;微调相比预训练基线提升了29到43个宏F1分数,其中GPT-5少样本学习总体领先(62.2宏F1)。我们发布了数据集、每条标注的时间戳和分析代码,以支持未来非洲语言NLP资源的可重复质量审计。

英文摘要

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.

2605.27236 2026-05-27 cs.LG physics.ao-ph

Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening

基于特征和深度学习模型用于TROPOMI甲烷羽流筛选的可解释比较

Solomiia Kurchaba, Joannes D. Maasakkers, Berend J. Schuit, Ilse Aben

AI总结 本研究比较了基于特征(SVC、随机森林、XGBoost)和基于图像(ResNet-18、ResNet-34)的模型在甲烷羽流-伪影分类中的性能,并通过SHAP可解释性分析为操作筛选提供指导。

详情
AI中文摘要

连续且全球性地检测大量甲烷排放是全球变暖减缓的关键步骤。卫星观测(例如来自S5P/TROPOMI)结合羽流检测算法可以在这一努力中发挥关键作用。然而,并非所有看起来像甲烷排放羽流的TROPOMI羽流检测都是实际排放的结果。数据中相当一部分类似羽流的特征是检索伪影。此类伪影可能是由海拔或反照率梯度变化、高浓度气溶胶、海岸线、水体等引起的。先前的工作通过支持向量机分类器(SVC)解决了羽流-伪影分类问题,该分类器在由领域专家设计的大量基于观测的标量特征上训练。然而,这种方法将算法接收的信息范围限制在专家认为重要的内容上,破坏了像素之间的空间关系,并在统计聚合过程中丢失信息。在本研究中,我们在平衡和不平衡评估设置下比较了基于特征(SVC、随机森林、XGBoost)和基于图像(ResNet-18、ResNet-34)的模型用于甲烷羽流-伪影分类。为了解释结果,我们将基于SHAP的可解释性应用于两个模型家族。我们的发现为操作甲烷筛选工作流程(如CAMS甲烷热点探索器)中的模型选择提供了实用指导。

英文摘要

Continuous and global detection of large methane emissions is a crucial step for global warming mitigation. Satellite observations, such as from S5P/TROPOMI, combined with plume detection algorithms, can play a key role in this effort. However, not all TROPOMI plume detections that look like methane emission plumes are the result of actual emissions. A significant part of the plume-like features in the data are retrieval artifacts. Such artifacts could be the result of variations in elevation or albedo gradients, high concentrations of aerosols, coastal lines, water bodies, etc. Previous work approached the problem of plume-artifact classification by means of a Support Vector Machine Classifier (SVC), trained on an extensive set of observation-based scalar features designed by domain experts. However, such an approach limits the information scope received by the algorithm to what is deemed to be important by the experts, breaks the spatial relationship between pixels, and loses information during the process of statistical aggregation. In this study, we compare feature-based (SVC, Random Forest, XGBoost) and image-based (ResNet-18, ResNet-34) models for methane plume-artifact classification under balanced and imbalanced evaluation settings. To interpret the results, we apply SHAP-based explainability to both model families. Our findings provide practical guidance for model selection in operational methane-screening workflows such as the CAMS Methane Hotspot Explorer.

2605.27235 2026-05-27 cs.CV

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

MRT:用于大规模分层图像生成与编辑的掩码区域变换器

Zhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou, Yifan Pu, Yuchi Liu, Yalong Bai, Ethan Smith, Yuhui Yuan

AI总结 提出MRT,一个200亿参数的掩码区域扩散模型,通过统一文本到层、图像到层和层到层任务,并引入溢出感知画布层,实现高效的多层透明图像生成与编辑。

详情
Comments
CVPR 2026
AI中文摘要

分层图像生成与编辑是一项基础能力,能够实现生成视觉内容的逐层重用、编辑和组合,类似于自然语言中的词级编辑。尽管其重要性,但在大规模场景下仍是一个未充分探索的领域。为解决这一问题,我们提出了MRT,一个200亿参数的掩码区域扩散模型,专为多层透明图像生成与编辑设计,并在超过1000万个涵盖多种宽高比和文本提示的多语言设计样本上训练。为充分利用这一规模,我们做出了两项关键技术贡献。首先,我们在共享的掩码区域扩散框架内统一了三个互补任务,包括文本到层、图像到层和层到层,其中选择性标记掩码实现了灵活的逐层生成与编辑。其次,为实现溢出层生成,我们引入了一个溢出感知画布层,用于处理边界不一致性并支持半透明背景合成,从而生成超出可见画布边界的完整可编辑层。此外,我们应用扩散蒸馏实现了8步实时多层生成,且质量下降极小。大量实验表明,我们的框架在所有三个任务上显著优于先前的最先进方法(包括各种商业系统),为多层透明图像生成建立了新基准。值得注意的是,根据用户研究结果,我们的模型在图像到层质量上显著优于同期Qwen-Image-Layered模型,同时在图像到层推理中实现了10-100倍的推理速度提升,并将激活GPU内存消耗降低50-90%。

英文摘要

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.

2605.27220 2026-05-27 cs.CL cs.IR

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

覆盖幻觉:从检索前路由失败到生产RAG系统中的检索后级联

Zafar Hussain, Kristoffer Nielbo

AI总结 本文通过丹麦国家百科全书的案例研究,发现合成查询高估了LLM增强的需求(覆盖幻觉),并提出一种检索后级联策略,按成本递增顺序执行工作流,仅在无结果时升级到LLM增强,从而在无需训练开销的情况下提升质量并降低延迟。

详情
AI中文摘要

在现代RAG流水线中,HyDE和查询扩展等查询增强方法被应用于每个查询,导致大量的LLM推理成本和端到端延迟增加。这种开销在实际生产流量中的经验依据仍未得到充分探索。我们以丹麦国家百科全书为案例研究,评估了来自生产流量和合成条件的20,000个查询-工作流对上的五种检索工作流。在该系统中,合成查询表明超过90%的查询需要LLM增强才能实现高检索覆盖率。然而,在我们的生产延迟策略下,只有27.8%的真实用户查询需要LLM增强。我们将这种差距称为覆盖幻觉,并将其归因于合成查询与真实查询分布之间的结构性不匹配。检索前路由无法解决这一差距,因为LLM增强的需求只有在搜索索引后才能揭示,这一结果得到了我们对四种机器学习范式的评估的证实。这种仅从查询无法检测到的覆盖差距,促使我们采用检索后级联策略,该策略按成本递增顺序运行工作流,仅当某一步骤未返回文档时才升级到LLM增强。该级联策略完全无需训练开销或辅助服务基础设施,在质量上比Always-HyDE提高了+0.140综合总体分数,延迟降低了31.8%,并且72.2%的真实用户查询无需LLM增强即可得到服务。

英文摘要

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.

2605.27219 2026-05-27 cs.LG stat.ML

Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis

基于核方法的非线性数据整合用于数据协作分析

Yamato Suetake, Yuta Kawakami, Shunnosuke Ikeda, Yuichi Takano

AI总结 针对分散保密数据协作分析中线性整合方法重建风险高且无法对齐非线性变换的问题,提出非线性核整合(NKI)方法,通过核岭回归和特征值问题获得全局最优解,并引入图正则化和中心化约束以捕获几何和目标变量信息,在图像分类任务中提升了准确率并降低了重建风险。

详情
Comments
50 pages, 7 figures
AI中文摘要

分散保密数据集的协作分析很重要,但原始数据集的直接共享常受隐私和机构限制。数据协作(DC)分析通过各方特定的混淆函数将每个数据集转换为隐私保护的中间表示,并使用锚数据集将它们整合为公共协作表示。然而,许多现有的DC分析方法依赖线性变换进行数据混淆和整合,这可能增加重建风险。尽管非线性降维可以缓解这一风险,但传统的线性整合方法无法准确对齐非线性变换产生的中间表示。此外,现有的整合方法主要最小化各方之间的差异,并未明确纳入对下游分析有用的几何或目标变量信息。为克服这些限制,我们首先将线性核整合(LKI)公式化为一种线性整合方法,然后对其进行核化以获得非线性核整合(NKI)。NKI通过核岭回归和特征值问题获得全局最优解。我们还引入了图正则化和中心化约束,使得目标表示能够捕获对下游分析有用的几何和目标变量信息。在图像分类任务上的实验表明,在非线性降维下,NKI比现有的线性整合方法提高了分类准确率,而目标变量感知的图正则化和中心化进一步带来了增益。结果还表明,降维选择显著影响分类准确率和重建风险。

英文摘要

Collaborative analysis of decentralized confidential datasets is important, but direct sharing of original datasets is often restricted by privacy and institutional constraints. Data collaboration (DC) analysis transforms each dataset into privacy-preserving intermediate representations via party-specific obfuscation functions and integrates them into common collaboration representations using an anchor dataset. However, many existing DC analysis methods rely on linear transformations for data obfuscation and integration, which may increase reconstruction risk. Although nonlinear dimensionality reduction can mitigate this risk, conventional linear integration methods cannot accurately align intermediate representations produced by nonlinear transformations. Moreover, existing integration methods mainly minimize discrepancies among parties and do not explicitly incorporate geometric or target-variable information useful for downstream analysis. To overcome these limitations, we first formulate linear kernel integration (LKI) as a linear integration method and then kernelize it to obtain nonlinear kernel integration (NKI). NKI admits a globally optimal solution via kernel ridge regression and an eigenvalue problem. We also introduce graph regularization and a centering constraint so that the target representation can capture geometric and target-variable information useful for downstream analysis. Experiments on image classification tasks demonstrate that NKI improves classification accuracy over existing linear integration methods under nonlinear dimensionality reduction, with further gains from target-variable-aware graph regularization and centering. The results also show that dimensionality reduction choices substantially affect both classification accuracy and reconstruction risk.

2605.27210 2026-05-27 quant-ph cs.AI

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Qiskit QuantumKatas: 为LLM评估改编微软的量子计算练习

Juan Cruz-Benito, Ismael Faro

AI总结 本文将微软的QuantumKatas量子计算课程从Q#移植到Qiskit,并构建评估框架,用于系统评估大型语言模型在量子计算任务上的能力。

详情
AI中文摘要

我们将微软的QuantumKatas——一个成熟的量子计算课程——从Q#改编到最广泛采用的量子计算框架Qiskit,并打包一个用于系统LLM评估的评估框架。由此产生的基准测试包含26个类别中的350个任务,涵盖从基本门到高级算法(Grover、Simon、Deutsch-Jozsa)、纠错、密钥分发和量子游戏。每个任务包括自然语言提示、规范解和通过经典电路模拟的确定性测试验证。通过基于QuantumKatas经过验证的教学设计而不是从头创建任务,我们继承了有原则的难度递进和全面的概念覆盖,同时贡献了框架改编、评估基础设施和实证分析。我们评估了7种提示配置下的16个LLM——总共39,200次模型运行——以证明基准测试的实用性。三个关键发现出现:(1)基准测试有效区分模型能力,最佳配置通过率从32.3%到83.1%不等,前沿模型与开源模型之间平均差距为26.1个百分点;(2)模型在实现已知算法方面表现良好(SimonsAlgorithm 82.1%,BasicGates 81.6%),但在问题编码方面表现不佳(SolveSATWithGrover 34.4%,DistinguishUnitaries 40.0%);(3)思维链提示显示出适度双峰效应——它是三个模型的最佳策略(其中两个根据供应商文档明确进行了推理调优),但降低了其余模型的性能,使其总体上处于中游(平均56.3%),落后于少样本-5(57.8%)。我们发布基准测试、评估框架和基线结果,以支持量子计算中LLM能力的研究。

英文摘要

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas' proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations -- a total of 39,200 model runs -- to demonstrate the benchmark's utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect -- it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.

2605.27209 2026-05-27 cs.AI

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

在噪声中学习行动:通过噪声环境增强智能体鲁棒性

Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

AI总结 提出NoisyAgent框架,通过在训练中引入用户噪声和工具噪声,提升智能体在真实世界噪声环境下的鲁棒性和泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展促进了LLMs作为能够推理、规划和工具使用的交互式智能体的广泛部署。尽管在现有基准测试中表现强劲,但此类智能体在部署到现实世界环境时往往表现出显著退化,因为现实环境本质上是随机且不完美的。我们认为,这种差异源于理想化训练设置与现实交互动态之间的根本性不匹配,当前范式依赖于精心策划的任务指令和稳定、可控的环境。为了解决这一差距,我们提出了NoisyAgent,一个明确将环境不完美性纳入智能体学习过程的智能体训练框架。我们识别出现实场景中交互噪声的两个主要来源:用户噪声,捕捉用户交互中的模糊性和变异性;以及工具噪声,反映工具执行中的失败和异常。我们通过修改用户交互模式和模拟训练环境中的工具执行结果,将此类扰动引入训练流程。为了稳定训练同时鼓励智能体处理日益具有挑战性的不完美性,噪声仅应用于部分轨迹,并随着模型适应当前噪声水平而逐步增加难度。大量实验表明,我们的方法在噪声和动态环境下持续提升智能体鲁棒性。我们的分析揭示,在噪声条件下训练也在理想化基准测试中带来了性能提升,这表明对环境噪声的受控暴露促进了更可泛化的推理和决策行为。我们的发现强调了建模交互不完美性对于弥合智能体训练与现实部署之间差距的重要性。

英文摘要

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

2605.27205 2026-05-27 eess.IV cs.AI

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

TWIST:面向应用感知无线数字孪生的闭环令牌同步

Sige Liu, Kezhi Wang

AI总结 提出TWIST框架,通过闭环令牌同步和模式条件不等错误保护,在有限通信资源下实现应用感知的无线数字孪生状态同步,提升交通状态推断性能并降低同步成本。

详情
AI中文摘要

无线数字孪生需要在有限且时变的通信资源下,对随时间演变的物理场景及其数字副本进行重复同步。对于以感知为中心的数字孪生,像素域传输或均匀保护的比特流可能与孪生侧应用消耗的语义状态不匹配。本文提出TWIST,一种面向应用感知无线数字孪生的闭环令牌同步框架。TWIST将每个物理观测表示为一个令牌,并通过无线链路同步该状态,而非优化视觉重建。令牌位置按任务相关性分组,并通过低、中、高同步模式下的模式条件不等错误保护进行保护。在孪生侧,解码置信度将不可靠的硬令牌决策转换为擦除,在更新语义孪生状态之前由补全模型恢复。恢复后的状态支持交通状态推断,并生成紧凑的反馈统计信息,包括信道质量、接收器不确定性、语义漂移和应用优先级,用于后续模式自适应。在动态道路场景数字孪生场景上的实验表明,与固定模式和仅信道自适应策略相比,TWIST改善了交通状态推断和语义孪生状态同步,同时相对于始终高传输降低了平均同步成本。

英文摘要

Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited and time-varying communication resources. For perception-centric twins, pixel-domain transmission or uniformly protected bitstreams can be mismatched to the semantic state consumed by twin-side applications. This paper proposes TWIST, a closed-loop token synchronization framework for application-aware wireless digital twins. TWIST represents each physical observation as a token and synchronizes this state over a wireless link, rather than optimizing visual reconstruction. Token positions are grouped by task relevance and protected through mode-conditioned unequal error protection under low-, medium-, and high-synchronization modes. At the twin side, decoding confidence converts unreliable hard token decisions into erasures, which are restored by a completion model before updating the semantic twin state. The recovered state supports traffic-state inference and generates compact feedback statistics, including channel quality, receiver uncertainty, semantic drift, and application priority, for subsequent mode adaptation. Experiments on a dynamic road-scene digital-twin scenario show that TWIST improves traffic-state inference and semantic twin-state synchronization compared with fixed-mode and channel-only adaptation strategies, while reducing the average synchronization cost relative to always-high transmission.

2605.27204 2026-05-27 cs.CL cs.IR

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview: 基于LLM的图消息传递的科学论文评估

Pujun Zheng, Wanying Ren, Jiacheng Yao, Guoxiu He, Star X. Zhao

AI总结 提出GraphReview框架,通过图消息传递整合论文内在质量、同期关联和历时关联,利用LLM生成节点先验和边比较证据,结合个性化PageRank进行质量排序、决策预测和审稿生成,在决策和排序指标上平均提升29.7%。

详情
AI中文摘要

科学论文评估通常不仅涉及评估稿件本身,还需要将其与同期研究和先前文献联系起来。然而,现有的基于LLM的方法通常分别建模这些信号,缺乏跨论文传播审稿证据的统一机制。我们提出$ extbf{GraphReview}$,一个基于图的LLM框架,将论文评估形式化为在语义论文图上进行审稿信号的消息传递。该图联合捕捉内在质量、同期论文之间的同步链接以及指向先前工作的历时链接。LLM用于估计节点级质量先验,并通过成对论文比较生成边级比较证据,而个性化PageRank整合审稿信号用于质量排序、决策预测和审稿生成。为了生成更高质量的图证据,我们提出了奖励诱导的最大似然目标来训练LLM骨干网络。实验表明,GraphReview始终优于最强基线,在决策和排序指标上平均提升29.7%,包括准确率提升23.7%,Spearman's $ρ$提升57.6%。它还生成更高质量的审稿文本,并在不同时间段和会议场所中有效泛化。代码可在https://github.com/ECNU-Text-Computing/GraphReview获取。

英文摘要

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $ρ$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.