arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.29952 2026-05-29 cs.LG

From Short Histories to Long Futures: Horizon-Aware Graph Neural Networks for Long Horizon Forecasting

从短历史到长未来：面向长时域预测的视界感知图神经网络

Zesheng Liu, Maryam Rahnemoonfar

发表机构 * Department of Computer Science and Engineering, Lehigh University（计算机科学与工程系，莱维大学）； Department of Civil and Environmental Engineering, Lehigh University（土木与环境工程系，莱维大学）

AI总结提出一种多视界图神经网络模拟器，通过共享图骨干网络和增量预测策略，联合优化多步超前预测，实现长时域稳定且准确的地球物理系统模拟。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情

AI中文摘要

由于强非线性动力学、全物理模拟的高计算成本以及单步自回归代理在数十年滚动中产生的误差累积，地球物理系统的精确长期预测十分困难。深度神经网络可作为高效模拟器，但大多数仅训练用于下一步预测，且随着预测视界增长常出现漂移或不稳定。我们提出一种多视界图神经网络模拟器，在统一模型中学习从单个当前时间到多个未来超前时间的状态到状态转换。物理域表示为图，其中节点对应具有时变地球物理属性的空间位置，边编码局部空间相互作用。给定当前图状态，模型预测关键场（冰厚度和冰速度）在所有节点上的未来演化，使用共享图骨干网络和每个目标变量的独立输出分支。为提高稳定性，网络预测相对于当前状态的状态增量，然后将其加回以重建未来状态。训练联合优化所有超前时间，使用统一回归目标，推理采用从粗到细的滚动方式，以较大步长推进并有选择地以较短步长细化，以减少漂移并避免冗余计算。在数十年期松岛冰川模拟上的实验表明，我们的方法在长期精度和稳定性上均优于（i）直接从初始状态预测每个未来时间的基线模型和（ii）标准单步自回归滚动，为下游气候和海平面研究提供了更可靠的模拟器。

英文摘要

Accurate long-range prediction of geophysical systems is difficult due to strongly nonlinear dynamics, the high computational cost of full-physics simulations, and the error accumulation that arise when one-step autoregressive surrogates are rolled out over decades. Deep neural network can serve as efficient emulators, but most are trained only for next-step prediction and often drift or become unstable as the forecast horizon grows. We propose a multi-horizon graph neural network emulator that learns state-to-state transitions from a single current time to multiple future lead times within one unified model. The physical domain is represented as a graph, where nodes correspond to spatial locations with time-varying geophysical attributes and edges encode local spatial interactions. Given the current graph state, the model predicts the future evolution of key fields, ice thickness and ice velocities at all nodes, using a shared graph backbone with separate output branches for each target variable. To improve stability, the network predicts state increments relative to the current state, which are then added back to reconstruct future states. Training jointly optimizes all lead times with a unified regression objective, and inference uses a coarse-to-fine rollout that advances with larger jumps and selectively refines with shorter jumps to reduce drift and avoid redundant computation. Experiments on multi-decadal Pine Island Glacier simulations show that our approach achieves higher long-range accuracy and improved stability than both (i) an initial-state baseline that predicts each future time directly from the starting state and (ii) a standard single-step autoregressive rollout, producing a more reliable emulator for downstream climate and sea-level studies.

URL PDF HTML ☆

赞 0 踩 0

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克院信息研究所）； Saarland Informatics Campus（萨尔兰州信息校园）； Saarland University（萨尔兰州大学）； The University of Edinburgh（爱丁堡大学）； Samsung AI Center, Cambridge（三星AI中心，剑桥）

AI总结针对视觉语言模型在隐式跨模态有害语义推理上的不足，提出MuPHI数据集和MuPHIRM训练框架，通过多视角奖励优化联合语义学习，提升有害检测与推理质量及分布外鲁棒性。

详情

AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害，需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型（VLM）擅长对感知线索进行字面推理，但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力，我们引入了多模态语用有害解释（MuPHI）数据集，其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别，并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力，我们提出了MuPHIRM，一种推理增强的训练框架，通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量，同时与训练和推理时基线相比，表现出优越的分布外鲁棒性。我们的发现表明，面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

URL PDF HTML ☆

赞 0 踩 0

2605.29940 2026-05-29 cs.AI

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

使大语言模型通过反馈从流式经验中学习合成

Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue, Bingyu Zhu, Longtao Huang, Xiongtao Zhang, Zeyu Yang, Zhixuan Chu, Jungang Lou

发表机构 * Huzhou Normal University（湖州师范学院）； Alibaba Group（阿里巴巴集团）； Zhejiang University（浙江大学）； Zhejiang Key Laboratory of Intelligent Education Technology and Application（浙江省智能教育技术与应用重点实验室）

AI总结提出StreamSynth设置和SynLearner框架，使模型通过任务流积累经验并利用反馈提升合成数据生成性能。

详情

AI中文摘要

大语言模型（LLMs）已被广泛用于合成数据生成，显著降低了标注成本。然而，现有研究大多将合成视为一组孤立任务，忽略了一个更基本的问题：模型能否通过积累过去任务的经验并将其迁移到未来任务来学习合成。在这项工作中，我们引入了StreamSynth，一种新的设置，其中合成任务顺序到达，历史任务的经验为未来合成提供信息信号。为了解决这一设置，我们提出了SynLearner，一个通用框架，使合成模型能够在任务流上获取可重用的合成经验。SynLearner不是为每个任务独立生成数据，而是鼓励模型探索多样化的合成模式，从反馈中学习，并在任务演化中平衡样本质量与集合级多样性。在多个基准上的大量实验表明，SynLearner有效地利用了早期任务的经验来改进后期任务的合成性能，表现出一致的跨任务可迁移性。这些发现为StreamSynth的可行性提供了证据，并突显了合成数据生成作为一个经验驱动过程，可以从任务流中受益。

英文摘要

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

URL PDF HTML ☆

赞 0 踩 0

2605.29937 2026-05-29 cs.RO cs.LG

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

Fisher保持引导：用于安全扩散控制的免训练流形约束

Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng, Zhi Li, Zhaoliang Wan, Lu Qi, Hui Cheng

发表机构 * Sun Yat-sen University, Guangzhou, China（中山大学，广州，中国）； Insta360 Research, Shenzhen, China（Insta360研究院，深圳，中国）

AI总结提出一种免训练的Fisher保持引导方法，通过低秩雅可比分解计算Fisher保持更新，并利用截断Fisher去噪敏感性作为不确定性信号，在视觉导航中实现可靠且高效的轨迹预测。

Comments ICML2026

详情

AI中文摘要

扩散模型在视觉导航中的航路点预测是有效的，但当更新偏离训练流形时，标准采样和测试时引导可能产生不可靠或低效的轨迹。我们提出带有外积跨度投影的Fisher保持引导，这是一种免训练的推理方法，在优化任务目标的同时避免与分布外动作相关的大Fisher漂移。我们的方法通过低秩雅可比分解计算Fisher保持更新，每步仅需一次反向传播，支持实时使用。我们进一步引入截断Fisher去噪敏感性作为不确定性信号，并将其用于鲁棒的多样本动作混合。在玩具和真实导航基准上的实验，包括基于TSDF引导的Maze2D、使用官方扩散策略权重的PushT，以及仿真和真实机器人上的视觉导航，均表明与强扩散策略基线相比，无需额外训练即可获得一致的性能提升。

英文摘要

Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.29935 2026-05-29 cs.CV cs.AI

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

发表机构 * Jiangsu Cytoderm Intelligent Technology Co., Ltd., China（江苏细胞膜智能科技有限公司，中国）； Xi'an Jiaotong University, Xi'an, China（西安交通大学，中国）； Tsinghua University, Beijing, China（清华大学，中国）； University of Science and Technology of China, Hefei, China（中国科学技术大学，中国）

AI总结提出CityGen，一种基于扩散模型的生成框架，通过高清地图条件和城市级视觉提示实现零标签城市适应，提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情

AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估，这阻碍了它们在新城市部署时的可扩展性。然而，外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计，限制了它们在整体评估中的可扩展性和有效性。在本文中，我们引入了CityTransfer-Bench，一个地理上不重叠的基准，用于评估跨城市泛化在感知、分割和规划任务上的表现，并提出了CityGen，一个基于扩散的生成框架，通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明，CityGen在多个任务上持续提高了跨城市鲁棒性，为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.29933 2026-05-29 cs.LG

CLUBench: A Clustering Benchmark

CLUBench：一个聚类基准测试

Feng Xiao, Dazhi Fu, Chris Ding, Jicong Fan

发表机构 * The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳））

AI总结本文提出CLUBench，一个包含24种算法在131个数据集上的综合聚类基准，通过大规模实验分析超参数调优、数据类型、预训练嵌入、大语言模型聚类等，揭示传统算法仍具竞争力，并结合预训练嵌入可提升效率。

详情

AI中文摘要

聚类是数据科学中的一个基本问题，有着悠久的研究历史，产生了许多富有洞察力的算法。尽管取得了这些进展，但缺乏一个系统且大规模的经验评估，同时考虑传统算法、基于深度学习的方法以及最近基于基础模型的聚类，导致对算法选择和部署的指导有限。为了填补这一空白，我们引入了CLUBench，一个全面的聚类基准，包含24种不同原理的算法，在131个数据集上进行了评估，涵盖表格、文本和图像数据，涉及178,815次实验。重要的是，我们对(i)超参数调优的影响、(ii)数据类型和特征的影响、(iii)预训练嵌入的影响、(iv)基于大语言模型的聚类、(v)算法的相似性以及(vi)性能矩阵的低秩结构的分析，为聚类研究提供了有意义的见解和有前景的途径。例如，我们的研究揭示：1) 所有评估的深度聚类方法在平均性能方面并不比表现最佳的传统聚类算法（如KMeans、SpeClu）具有显著优势；2) 对于图像和文本聚类任务，将预训练嵌入与传统聚类算法（如KMeans、SpeClu）相结合提供了有效且高效的聚类；3) 即使在大模型日益占据主导地位的时代，聚类仍然是一个具有挑战性和非平凡的问题。此外，我们提出利用跨模型性能矩阵中的低秩结构来高效近似实际应用中的整体性能评估。我们进一步展示了基于所有超参数配置下的性能矩阵进行模型选择的可行性。

英文摘要

Clustering is a fundamental problem in data science with a long-standing research history, yielding numerous insightful algorithms. Despite this progress, a systematic and large-scale empirical evaluation that jointly considers conventional algorithms, deep learning-based methods, and recent foundation model-based clustering remains largely absent, leading to limited guidance on algorithm selection and deployment. To address this gap, we introduce CLUBench, a comprehensive clustering benchmark comprising 24 algorithms of diverse principles evaluated on 131 datasets across tabular, text, and image data, involving 178,815 experiments. Importantly, our analyses of (i) the impact of hyperparameter tuning,(ii) the impact of data types and characteristics,(iii) the impact of pretrained embeddings,(iv) large language model-based clustering,(v) the similarity of algorithms, and (vi) the low-rank structures of performance matrices, yield meaningful insights and promising pathways for clustering research. For instance, our study reveals that: 1) All evaluated deep clustering methods do not exhibit a significant advantage compared with the top-performing conventional clustering algorithms (e.g., KMeans, SpeClu) in terms of average performance; 2) For image and text clustering tasks, combining pretrained embeddings with conventional clustering algorithms (e.g., KMeans, SpeClu) offers effective and efficient clustering; 3) Clustering remains a challenging and nontrivial problem, even in the era of increasingly dominant foundation models. Moreover, we propose to use the low-rank structure in cross-model performance matrices to efficiently approximate the overall performance evaluation in practical applications. We further demonstrate the feasibility of model selection based on the performance matrices across all hyperparameter configurations.

URL PDF HTML ☆

赞 0 踩 0

2605.29932 2026-05-29 cs.LG cs.CV

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

治疗条件扩散用于预测神经退行性疾病进展

Danylo Boiko, Viktoriia Mishkurova

发表机构 * Innoloft Inc.（Innoloft公司）； Bogomolets National Medical University（博戈莫列茨国家医学大学）

AI总结提出一种治疗条件扩散框架，通过条件化生成过程于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量，预测高保真未来脑状态，在临床保真度上显著优于基线。

Comments 9 pages, 5 figures, 1 table

详情

AI中文摘要

预测帕金森病等神经退行性疾病的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常产生忽略纵向神经影像丰富结构的标量临床评分，而传统生成方法则遭受解剖细节丢失和细微进展模式模糊的问题。为此，我们引入了一种新颖的治疗条件扩散框架，通过将生成过程条件化于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量，预测高保真的未来脑状态。该流程使用基于Transformer的编码器表示非线性、时间依赖的药理学动态，并通过一个关注生物关键区域的多权重感兴趣区域掩码优化生成。实验评估表明，我们的框架保持了清晰的解剖边界，并在临床保真度上显著优于基线，实现了MSE降低14.0%，MAE降低7.2%，SSIM提高4.9%。

英文摘要

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

URL PDF HTML ☆

赞 0 踩 0

2605.29931 2026-05-29 cs.AI eess.AS

It`s All About Speed: AI`s Impact on Workflow in Music Production

一切都关乎速度：AI对音乐制作工作流程的影响

Finn McClellan, Fabio Morreale

发表机构 * Waipapa Taumata Rau - University of Auckland, Auckland (Aotearoa - New Zealand)（瓦伊帕塔玛拉大学——奥克兰大学，奥克兰（奥特亚罗——新西兰））； Sony AI, Barcelona (Spain)（索尼AI，巴塞罗那（西班牙））

AI总结通过民族志研究，探讨AI和自动化工具如何影响音乐制作工作流程，重点关注录音工程师、混音师和制作人的使用体验与态度，并分析速度、可控性与创造性自主权之间的张力及其缓解方法。

Comments Audio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UK

2605.29927 2026-05-29 cs.CL cs.AI cs.LG

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗？LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

发表机构 * Concordia University（康科德大学）； Mila - Quebec AI Institute（魁北克人工智能研究所）； University of Copenhagen（哥本哈根大学）； Universite Claude Bernard Lyon（克莱尔蒙特-伯恩大学）； McGill University（麦吉尔大学）

AI总结本研究提出PlanAhead框架，通过自动难度分类和四种计划表示（顺序子目标、叙述、伪代码、检查清单）的对比实验，发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情

AI中文摘要

尽管最近取得了进展，基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明，许多这些失败源于规划中的弱点，但替代自然语言计划表示的影响尚未被探索。为了解决这个问题，我们引入了PlanAhead，一个静态规划器-执行器框架，评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别，无需人工标注即可实现一致的难度分级。然后，我们在被分类为困难的任务上系统评估了4种不同的计划表示：顺序子目标、叙述、伪代码和检查清单；跨越不同系列的多模态LLM驱动的代理（OpenAI、阿里巴巴和谷歌）。为了解释随机变异性，我们引入了两个新的评估指标：达成率（AR）和解决任务一致性（STC）。我们的结果表明，计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

URL PDF HTML ☆

赞 0 踩 0

2605.29926 2026-05-29 cs.LG

A Triple-Modal Contrastive Learning Framework with Sequence, Graph, and 3D Features for Drug-Target Interaction Prediction

一种融合序列、图和3D特征的三模态对比学习框架用于药物-靶标相互作用预测

Le Xu, Xi Zhang, Dan Luo, Ting Wang, Xuan Lin

发表机构 * School of Computer Science, Xiangtan University, Xiangtan 411105, China（湘潭大学计算机科学学院）

AI总结提出TriMod-DTI框架，通过融合药物和蛋白质的1D序列、2D图和3D结构，并采用三模态对比学习策略对齐潜在空间表示，从而提升药物-靶标相互作用预测性能。

Comments 12 pages, 5 figures, ISBRA 2026

详情

AI中文摘要

准确预测药物-靶标相互作用（DTI）对药物发现至关重要。现有方法通常依赖单模态表示（如序列或图）或仅结合两种模态，忽视了3D结构特征。为解决这一挑战，我们提出TriMod-DTI，一种三模态对比学习框架，融合药物和蛋白质的1D序列、2D图和3D结构，获得用于DTI预测的通用且互补的特征表示。我们设计了一个特征提取器，用于捕获三种模态下的药物和靶标特征，从而丰富其表示。我们进一步提出了一种三模态对比学习策略，以在潜在空间中对齐同一药物或蛋白质的不同模态表示。通过构建跨模态的正负样本对，该方法增强了模型的判别能力。在三个基准数据集上的实验表明，TriMod-DTI优于最先进的方法。消融研究验证了每种模态的贡献。此外，案例研究突显了其在DTI预测和药物发现中的实际潜力。

英文摘要

Accurate prediction of drug-target interactions (DTI) is critical for drug discovery. Existing methods often rely on single-modal representations (e.g., sequences or graphs) or combine only two modalities, overlooking 3D structural features. To address this challenge, we propose TriMod-DTI, a triple-modal contrastive learning framework that incorporates 1D sequences, 2D graphs, and 3D structures of drugs and proteins, obtaining the universal and complementary feature representations for DTI prediction. We design a Feature Extractor to capture drug and target features across the three modalities, thereby enriching their representations. We further propose a triple-modal contrastive learning strategy to align different modal representations of the same drug or protein in the latent space. By constructing cross-modal positive and negative sample pairs, this approach enhances the model's discriminative ability. Experiments on three benchmark datasets demonstrate that TriMod-DTI outperforms state-of-the-art methods. The ablation studies validate the contributions of each modality. Moreover, case studies highlight its practical potential for DTI prediction and drug discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.29919 2026-05-29 cs.AI cs.MA

On the Geometry of Games and their Solvers

论博弈及其求解器的几何结构

Yaqi Sun, Julian Ma, David Mguni

发表机构 * Queen Mary University of London（伦敦玛丽女王大学）； University College London（伦敦大学学院）

AI总结提出一种结构感知的求解器合成框架，通过学习连续求解器对齐的博弈几何表示，实现自适应均衡计算并揭示求解器行为的连续区域。

详情

AI中文摘要

博弈论和生成对抗网络等学习系统中的一个核心挑战是理解哪些算法能够在异质博弈景观中高效计算均衡。均衡计算通常按求解器和博弈类别分别研究，产生了强局部保证但碎片化的求解器行为视图。现有的离散分类法往往无法完整解释算法成功的原因。我们通过一个将博弈与有效求解器动力学联系起来的求解器-博弈映射来研究这一问题。经典理论识别出该映射的孤立区域，但对中间或重叠区域提供的见解有限，表明可解性由定义连续求解器对齐博弈几何的潜在结构属性控制。我们通过结构感知的求解器合成来形式化这一视角。一个学习到的结构识别器将每个博弈映射到低维求解器对齐表示，一个策略将该表示映射到有效的原始机制，从而跨区域调整求解器行为。这揭示了特定求解器动力学有效的区域，以及需要原始机制混合而非单一主导求解器的区域。一个有界残差充当局部校正器和诊断信号，用于不完整的求解器基或表示。该框架同时产生自适应求解器和分析视角：具有相似优化动力学的博弈聚类在一起，揭示了算法有效性的连续区域和重叠的求解器行为。实验表明，固定原始机制表现出系统性的区域不匹配，而学习到的表示将博弈空间组织成与求解器行为对齐的结构化地图。这些结果表明，应将均衡计算视为学习求解器机制和映射可解性几何的联合问题。

英文摘要

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

URL PDF HTML ☆

赞 0 踩 0

2605.29911 2026-05-29 cs.LG cs.CV

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

通过逐像素生成图像插值减少空间推进薄膜冷却分析中的实验测试

Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

发表机构 * Heilbronn University of Applied Sciences（海德堡应用科学大学）； Center for Machine Learning（机器学习中心）； Max-Planck-Str. 39（马克斯-普朗克街39号）； German Aerospace Center (DLR)（德国航空航天中心（DLR））； Institute of Space Propulsion（空间推进研究所）

AI总结提出一种基于轻量级前馈神经网络和位置编码的机器学习方法，从稀疏实验测量中进行图像回归，以减少推进系统薄膜冷却研究中的物理测试需求。

Comments Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: 10.13009/EUCASS2025-285

详情

DOI: 10.13009/EUCASS2025-285

AI中文摘要

我们提出了一种从稀疏实验测量中进行图像回归的机器学习方法。我们展示了该方法在推进系统开发中薄膜冷却研究中的应用，旨在减少对大量物理测试的需求。我们的方法采用带有位置编码的轻量级前馈神经网络，根据输入参数生成图像。在真实和合成数据上的验证表明，该方法在减少30%测量量的同时，实现了高图像相似度（RMSE < 8%，SSIM > 93%）。我们进一步提出了一种知识驱动的扩展，用于生成图像的局部适应性。该方法显著减少了所需测试次数，同时保持了高质量数据，从而能够高效优化冷却剂喷射器配置，其应用范围超越航空航天领域。

英文摘要

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

URL PDF HTML ☆

赞 0 踩 0

2605.29900 2026-05-29 cs.LG cs.IT math.IT

OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

OVA-IB：用于多模态对齐的一对多信息瓶颈

Tianchao Li, Shujian Yu, Xinrui Zu, Zhaolong Wei, Jeremy Gummeson, Jack C. P. Cheng, Robert Jenssen

发表机构 * Hong Kong University of Science and Technology（香港科学与技术大学）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； UiT – The Arctic University of Norway（挪威北极大学）； University of Copenhagen（哥本哈根大学）； Norwegian Computing Center（挪威计算中心）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结提出基于信息瓶颈的一对多对齐框架OVA-IB，通过充分性对比下界和最小性正则化实现任意数量模态的对齐，在分类、回归和跨模态检索任务中表现鲁棒。

详情

AI中文摘要

对比学习对于对齐配对视图或模态是有效的，但超出两个模态的对齐仍然具有挑战性且相对未被充分探索。成对的CLIP风格损失将多模态对齐分解为独立的双向比较，因此没有显式建模多个模态之间的高阶依赖关系。最近的超越成对目标从统计或几何角度处理这个问题，但任意模态对齐仍然缺乏一个原则性的标准来定义每个模态相对于其他模态应该保留和压缩什么。我们通过信息瓶颈原则重新审视任意模态对齐。在多模态学习中，充分性应保留可从其余模态预测的信息，而最小性应压缩不被其余模态支持的模态特定信息。这自然导致一对多视角，其中每个模态相对于其余模态进行表征。我们提出OVA-IB，一个用于任意模态对齐的信息瓶颈框架。OVA-IB优化一个可处理的一对多对比下界用于充分性，该下界与双总相关风格目标相连，使用无参数的几何感知投影分数，并通过用其余模态诱导的表示分布来约束每个表示对其自身输入的依赖，导出一个可处理的上界正则化器用于最小性。在分类、回归、模态无关评估和跨模态检索基准上的实验展示了强大且鲁棒的性能。

英文摘要

Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.

URL PDF HTML ☆

赞 0 踩 0

2605.29897 2026-05-29 cs.CL

ExCAM: Explainable Cultural Awareness Metrics

ExCAM：可解释的文化意识度量

Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka, Masao Utiyama, Steffen Eger

发表机构 * University of Mannheim（曼海姆大学）； University of Technology Nuremberg（纽伦堡技术大学）； National Institute of Information and Communications Technology（信息与通信技术国家研究所）

AI总结提出ExCAM，首个可识别、评分并解释指令-输出对中文化错误的专用评估度量，在平衡测试集上达到80%准确率。

Comments preprint

详情

AI中文摘要

评估大型语言模型的文化意识对于确保生成文本的公平性和应用在全球范围内的泛化能力至关重要。最近的基准通过问答或文本生成任务探索食物等文化物品或压力情境下的行为等价值观。然而，创建这些基准需要耗时且昂贵的人工标注。此外，评估自由文本中文化意识的基准很少，且往往依赖过时的评估机制。为弥补这一空白，我们引入了ExCAM，一种可解释的文化意识度量，据我们所知，这是第一个专门用于识别、评分和解释指令-输出对中文化错误的评估度量。为了训练和评估ExCAM，我们引入了ExCAM40k，一个由九个现有基准组成的数据集，我们对其进行了重新格式化并增加了合成错误。与包括GPT-5在内的多个基线相比，ExCAM在平衡测试集上实现了高达80%的最高错误检测准确率。因此，ExCAM为自由文本的细粒度、可解释的文化评估开辟了道路。

英文摘要

Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

URL PDF HTML ☆

赞 0 踩 0

2605.29894 2026-05-29 cs.CV

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练智能体而非专家：学习利用异构专家进行多轮视觉推理

Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

发表机构 * Sun Yat-sen University（中山大学）； HKUST（香港科技大学）； Harbin Institute of Technology（哈尔滨理工大学）

AI总结提出VisHarness，一种可训练的视觉智能体，通过解耦高层感知推理与低层任务执行，学习利用异构视觉专家模型，以轻量训练实现多轮交互下的通用视觉任务求解。

详情

AI中文摘要

计算机视觉的最新进展产生了大量用于检测、分割、计数和其他视觉任务的强大专用模型。然而，这些模型通常针对孤立的任务形式进行优化，使得直接支持通用视觉智能变得困难，尤其是当任务需要复杂的语言理解和密集的小物体感知时。在本文中，我们提出了VisHarness，一种可训练的视觉智能体，它将高层感知、推理和决策与低层任务执行解耦。VisHarness不是训练模型来解决特定的视觉任务，而是学习利用一组精心设计的异构视觉专家。这种范式保留了智能体的通用智能，同时充分利用了专用视觉模型在具体视觉任务中的精度优势。仅通过轻量训练，VisHarness就能学习到可泛化的视觉专家利用策略，并通过与视觉专家模型的多轮交互，在各种复杂条件下解决常见的基础视觉任务。为了在实时环境中实现高效的在策略强化学习训练，我们引入了动态视觉记忆归档，这缓解了与视觉专家模型多轮交互导致的快速累积的视觉令牌开销。在涵盖推理分割、广义指代分割、密集小物体检测和指代计数的四个代表性基准上的实验表明，VisHarness显著优于现有的通用模型，并与任务专用模型相比取得了具有竞争力或更优的性能。

英文摘要

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

URL PDF HTML ☆

赞 0 踩 0

2605.29893 2026-05-29 cs.AI

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

冗余还是必要？检测智能体轨迹中冗余步骤的基准

Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

发表机构 * Huawei Technologies（华为技术有限公司）； Noah Arks’ Lab（Noah Arks实验室）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）

AI总结针对LLM智能体轨迹中的冗余步骤检测问题，提出RedundancyBench基准，包含标注轨迹的数据集，并评估三种方法，发现最佳方法仅达到24.88%的检测分数。

详情

AI中文摘要

基于LLM的智能体通过多步推理和工具使用在解决复杂任务方面表现出强大的能力。然而，现有的评估协议主要关注任务成功，忽略了智能体行为的一个关键方面：执行效率。在实践中，智能体轨迹通常包含冗余步骤，这些步骤消耗大量资源但对任务完成贡献甚微。在这项工作中，我们提出并定义了一个新的研究领域：智能体轨迹的 extbf{冗余步骤检测}。为了支持这一倡议，我们引入了 extbf{RedundancyBench}，这是一个新的基准，包含多样化的任务和精心标注的轨迹，其中每个步骤根据其对任务完成的贡献进行标记。利用RedundancyBench，我们开发并评估了3种代表性方法，以回答轨迹中的步骤是冗余还是必要的问题。我们的结果表明，即使是最优方法在检测冗余步骤方面也仅达到24.88%的分数，而有些方法的表现甚至不如随机猜测。这些结果突显了该任务的复杂性以及在该领域进一步研究的必要性。 ootnote{本文的代码和数据集均可在\href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}获取。}

英文摘要

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

URL PDF HTML ☆

赞 0 踩 0

2605.29891 2026-05-29 cs.CV

DVSM: Decoder-only View Synthesis Model Done Right

DVSM: 正确的仅解码器视图合成模型

Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang

发表机构 * NVIDIA ； National Taiwan University（国立台湾大学）

AI总结提出仅解码器架构DVSM，通过隐式KV-cache表示场景，在相同渲染复杂度下以更少参数超越编码器-解码器变体，并利用共享权重、基础模型先验和分阶段块大小优化效率与质量，在多个基准上实现新视点合成的最优结果。

Comments Code at https://github.com/NVLabs/dvsm

详情

AI中文摘要

近期的大型视图合成模型（LVSMs）倡导一种编码器-解码器架构，将重建和渲染分离到不同的网络中。我们重新审视了这种设计。通过控制实验，我们表明仅解码器架构（将场景隐式表示为KV-cache）在相同渲染复杂度下使用更少参数，性能优于编码器-解码器变体。进一步分析表明，在颜色输入重建网络和仅相机渲染网络之间共享权重，能更好地对齐同一视点下的特征，从而促进图像合成。基于这一发现，我们的模型DVSM进一步结合了基础模型先验和分阶段块大小调整，以改进效率与质量的权衡。我们的结果在多个基准上为新颖视图合成设立了新的最先进水平，在某些情况下，甚至在密集输入视图下优于每场景优化的3DGS。

英文摘要

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

URL PDF HTML ☆

赞 0 踩 0

2605.29889 2026-05-29 cs.CL cs.AI

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

内部表示，而非临床知识：明显的大语言模型分诊失败源于何处

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

发表机构 * Macquarie University（麦考瑞大学）； Politecnico di Bari（巴里理工大学）； NSW Health（新南威尔士州卫生部）； Independent Researcher（独立研究者）

AI总结本研究通过稀疏自编码器特征分析，发现大语言模型在分诊任务中表现不佳源于输出格式限制，而非临床知识表示缺陷。

Comments 9 pages main text, 27 pages total including appendices; 7 figures, 25 tables

详情

AI中文摘要

患者语音临床分诊基准报告显示，在受限的多选输出中，消费级大语言模型存在较高的分诊不足率，但同样的案例在自由文本中得分不同。我们探究输出格式是否改变了模型的\emph{临床表示}，还是仅改变了从保留表示到答案的映射。使用Gemma 3 4B/12B IT和Qwen3-8B中的稀疏自编码器（SAE）特征，我们发现相同的医学特征在两种格式下对共享临床叙述激活，但在所有模型的每个案例的多选决策标记处变得{沉默}。三种独立方法（自然语言自编码器言语化、决策标记logit归因和顶部特征表征）一致认为，驱动决策logit的是支架和格式特征，而非医学特征。行为上，多选惩罚在结构化和自然语言输入下均反转，选项顺序洗牌排除了位置偏差，且差距主要由偏差一个决策（模型选择与黄金答案相邻的敏锐度字母）主导，而非知识失败。因此，失败源于输出格式，而非临床表示。

英文摘要

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

URL PDF HTML ☆

赞 0 踩 0

2605.29888 2026-05-29 cs.LG cs.AI

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA: 面向RL后训练中数据污染的逐层表示分析

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim

发表机构 * Yonsei University（延世大学）； Seoul National University（首尔国立大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出LaRA框架，通过逐层表示分析检测强化学习后训练中的污染数据，利用扰动敏感性、方向坍缩和局部表示刚性三个指标，优于现有输出级方法。

Comments Work in Progress

详情

AI中文摘要

强化学习（RL）后训练已被证明能提升大型语言模型（LLMs）的推理能力。然而，关于RL后训练中数据污染问题的探索很少，这可能损害训练过程本身的泛化能力和评估可靠性。现有的检测方法主要依赖于输出级信号，如似然或熵，这对于RL训练的模型变得不可靠，因为RL通过轨迹级奖励而非token似然来塑造行为。我们提出LaRA，一个用于检测RL后训练LLMs中数据污染的逐层表示分析框架。LaRA引入了三个互补指标，测量受控扰动下的扰动敏感性、方向坍缩和局部表示刚性。我们发现污染会在各层产生渐进式的几何偏差，包括放大的扰动敏感性、更强的方向坍缩和增强的局部刚性。基于我们的发现，我们还开发了一个污染检测协议，聚合跨层和跨指标的表示级偏差。在RL训练推理模型上的实验表明，我们的协议在污染检测方面优于现有的输出级基线。

英文摘要

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.

URL PDF HTML ☆

赞 0 踩 0

2605.29886 2026-05-29 cs.CL cs.AI

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1: 学习结构化评论用于检索增强生成

Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li

发表机构 * Nankai University（南开大学）； Beihang University（北航）； Guangxi Normal University（广西师范大学）

AI总结提出CRITIC-R1框架，通过强化学习将RAG评论建模为结构化错误诊断问题，设计保守判断对齐和诊断质量对齐奖励函数，提升检索增强生成的答案质量。

Comments 17 pages,13 figures

详情

AI中文摘要

检索增强生成（RAG）通过引入外部证据改进了知识密集型问答。然而，现有的RAG方法仍然存在幻觉和细微推理错误。最近的研究引入外部评论来优化RAG输出，但它们通常提供粗粒度且结构薄弱的反馈，表现出过度激进的干预，导致噪声大且不可靠的优化，限制了其纠正效果。为解决这些问题，我们提出了CRITIC-R1，一个结构化评论框架，将RAG评论制定并学习为使用强化学习（RL）的显式错误诊断问题。我们的框架将常见的RAG错误分类为多个诊断维度，包括判定、错误位置、推理分析和修复生成。为了学习这些能力，我们设计了两个奖励函数：保守判断对齐（CJA）首先鼓励校准的高层判断，同时减轻过度激进现象；而诊断质量对齐（DQA）通过门控奖励进一步改进细粒度诊断反馈。我们使用基于GRPO的RL训练评论模型，并从外部LLM教师模型收集过程级监督。在五个QA基准上的实验表明，CRITIC-R1在强RAG基线上持续提高了答案质量。我们的源代码可在 https://anonymous.4open.science/r/critic-r1-FCB0 获取。

英文摘要

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

URL PDF HTML ☆

赞 0 踩 0

2605.29885 2026-05-29 cs.LG cond-mat.dis-nn math.OC math.RT stat.ML

Open Problem: Separating Geometric and Algorithmic Compression via Cayley-Table Completion

开放问题：通过凯莱表完成分离几何压缩与算法压缩

Dongsung Huh

发表机构 * Dongsung Huh

AI总结提出凯莱表完成作为测试缺失的算法复杂度最小化归纳偏置的规范问题，并挑战社区将连续平坦性先验推广以自主发现离散算法公理。

Comments 6 pages. Submitted to the Conference on Learning Theory (COLT) 2026 Open Problem track

详情

AI中文摘要

现代统计学习理论和深度学习主要从连续容量控制（如基于范数的正则化、间隔最大化、低秩偏置）的角度来表征泛化。虽然在连续领域非常成功，但深度学习始终无法外推精确的算法或离散代数规则，这反映出缺失了向算法复杂度最小化的归纳偏置。我们提出凯莱表完成作为这一缺失偏置的规范测试平台，作为矩阵完成的离散代数对应物。正如矩阵分解结合权重衰减产生对低线性秩的隐式几何偏置，最近的结果表明，算子值张量分解结合平坦性先验产生对精确离散结合性的隐式算法偏置。我们提出了为凯莱表建立形式化精确恢复界限的开放问题，并挑战社区将连续平坦性先验推广，以自主发现更广泛的离散算法公理，而无需组合搜索。

英文摘要

Modern statistical learning theory and deep learning characterize generalization primarily in terms of continuous capacity control (e.g., norm-based regularization, margin maximization, low-rank bias). While highly successful in continuous domains, deep learning consistently fails to extrapolate exact algorithmic or discrete algebraic rules, reflecting a missing inductive bias toward algorithmic complexity minimization. We propose the Cayley-table completion as the canonical testbed for this missing bias, serving as the discrete algebraic counterpart to matrix completion. Just as matrix factorization combined with weight decay yields an implicit geometric bias toward low linear rank, recent results demonstrate that operator-valued tensor factorizations paired with a flatness prior yield an implicit algorithmic bias toward exact discrete associativity. We pose the open problem of establishing formal exact recovery bounds for Cayley-table completion, and challenge the community to generalize continuous flatness priors to autonomously discover broader discrete algorithmic axioms without combinatorial search.

URL PDF HTML ☆

赞 0 踩 0

2605.29881 2026-05-29 cs.CV cs.AI

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati（印度理工学院果阿班加）

AI总结提出BRACS框架，通过监测视觉注意力并仅在接地退化时进行闭式修正，无需训练即可有效减少LVLM中的物体幻觉。

详情

AI中文摘要

大型视觉语言模型（LVLMs）经常幻觉出输入图像中不存在的物体，这主要是因为随着解码进行，视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态，但它们存在三个关键限制：缺乏明确的接地目标，即使在模型已经良好接地时也进行干预，以及使用固定的修正强度，无法适应接地失败的严重程度。我们提出BRACS（屏障调控自适应闭式引导），一种无需训练的引导框架，通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地，并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算，无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明，BRACS在幻觉基准上持续优于先前方法，将CHAIR$_s$降低9.4个点，将POPE F1提高2.7个点，同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效，运行速度为贪心解码吞吐量的80%，平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.29873 2026-05-29 cs.AI

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Moment-KV: 基于动量的解码时KV缓存压缩用于长文本生成

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati（印度理工学院瓜哇蒂）

AI总结提出Moment-KV方法，利用动量驱动的时序注意力聚合在解码阶段压缩KV缓存，以提升长文本生成质量并保持解码延迟。

详情

AI中文摘要

键值（KV）缓存仍然是大型语言模型（LLM）在长文本生成任务中部署的主要瓶颈。先前的工作通常对预填充和解码缓存应用均匀压缩，但压缩预填充缓存会破坏关键上下文从而降低性能。虽然保留预填充缓存至关重要，但解码阶段的压缩仍未被充分探索，现有方法依赖于固定的近期窗口或瞬时注意力。我们对注意力动态的分析揭示了强时间模式：关键标记在长时间范围内获得持续注意力，而局部推理涉及短暂的爆发。静态启发式方法无法捕捉这种行为，导致重要标记被过早驱逐或陈旧标记被保留。我们提出Moment-KV，一种基于动量驱动的时序注意力聚合的解码时KV缓存压缩方法。我们的方法将标记重要性建模为连续演化的状态，其中注意力通过衰减进行聚合，捕捉长期影响和近期相关性。实验表明，Moment-KV在长文本生成任务中显著提高了生成保真度（2.3-3.2%），同时保持了解码延迟。

英文摘要

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.

URL PDF HTML ☆

赞 0 踩 0

2605.29864 2026-05-29 cs.RO

LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation

LLM引导的未来假设用于多步机器人操作中的视野感知探索

Mohammad Khoshnazar, Andrew Melnik, Michael Beetz

发表机构 * Institute of Artificial Intelligence, University of Bremen（人工智能研究所，不莱梅大学）

AI总结提出未来经验条件化（FEC）框架，利用LLM生成短期未来视频作为结构化先验，结合行为克隆和强化学习微调，提升多步机器人操作中的探索和策略适应能力。

详情

AI中文摘要

多步机器人操作需要在场景如何演化的不确定性下行动，这使得探索和策略适应具有挑战性。我们研究了短期、任务一致的未来视频能否为控制和强化学习微调提供有用的结构化先验。我们通过未来经验条件化（FEC）形式化这一思想，这是一种简单的接口，将闭环策略条件化于短期未来视频的潜在表示上。在我们的模拟设置中，未来片段通过三个阶段生成：一个基于当前场景状态初始化的任务本体上运行的LLM推理器，一个无机器人的数字孪生展开预期物体运动，以及一个无需推理时分割的掩码自由视频扩散模型，用于合成机器人一致的未来片段。我们主要使用BC和BC+RL实例化这一未来条件化接口，并在RoboCasa和CALVIN上与无未来、GT未来、生成未来和错误未来条件下的未来条件化流式流策略（SFP）基线进行比较。生成的未来比无未来条件化提高了性能，而不匹配的未来则降低了性能，我们的BC+RL实例化实现了最强的整体结果。对CALVIN的8个任务的平均BC+RL学习曲线分析进一步表明，GT未来改进最快，生成未来比无未来更早且更高水平地改进，而错误未来在整个训练过程中保持为零。这些结果表明，短期未来视频可以在不完美的未来预测下作为探索和策略适应的有用结构化先验。https://enact2026.github.io/

英文摘要

Multi-step robot manipulation requires acting under uncertainty about how the scene will evolve, making exploration and policy adaptation challenging. We study whether short-horizon, task-consistent future videos can provide useful structured priors for control and reinforcement-learning fine-tuning. We formalize this idea through Future-Experience Conditioning (FEC), a simple interface that conditions closed-loop policies on a latent representation of a short future video. In our simulation setup, future clips are generated in three stages, an LLM reasoner operating over a task ontology initialized from the current scene state, a robot-free digital-twin rollout of the intended object motion, and a mask-free video diffusion model that synthesizes a robot-consistent future clip without requiring segmentation at inference. We instantiate this future-conditioning interface primarily with BC and BC+RL, and compare against a future-conditioned Streaming Flow Policy (SFP) baseline on RoboCasa and CALVIN under NoFuture, GTFuture, GenFuture, and WrongFuture. Generated futures improve performance over no-future conditioning, while mismatched futures degrade it, and our BC+RL instantiation achieves the strongest overall results. An average BC+RL learning-curve analysis across 8 CALVIN tasks further shows that GTFuture improves fastest, GenFuture improves earlier and to a higher level than NoFuture, and WrongFuture remains at zero throughout training. These results suggest that short-horizon future videos can serve as useful structured priors for exploration and policy adaptation under imperfect future predictions. https://enact2026.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.29863 2026-05-29 cs.LG

STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction

STAP: 一种基于洗牌令牌化的超长上下文无词汇表移动应用预测器

Chengyu Fan, Hang Liu

发表机构 * School of Nuclear Science and Technology, University of Science and Technology of China（科学技术大学核科学与技术学院）； Department of Statistics and Finance, University of Science and Technology of China（科学技术大学统计与金融系）

AI总结提出STAP模型，通过洗牌机制将应用身份替换为虚拟索引，并利用超长上下文处理行为序列，实现无固定词汇表的跨数据集零样本移动应用预测。

Comments 15 pages, 9 figures, 5 tables Preprint submitted to Expert Systems with Applications

详情

AI中文摘要

预测用户将启动的下一个移动应用对于智能设备资源管理和主动辅助至关重要。现有模型依赖于固定的应用词汇表，这阻碍了它们在不同应用生态系统中的泛化能力。许多模型还依赖于用户特定知识，这使冷启动场景下的部署复杂化。我们提出STAP，一种基于Transformer的模型，消除了对固定词汇表的需求。STAP通过洗牌机制将真实应用身份替换为随机重新分配的虚拟索引，并通过超长上下文设计处理行为序列来补偿丢弃的语义信息。理论分析表明，在给定足够长的上下文的情况下，尽管映射是匿名的，预测分布仍收敛到正确分布。在两个来自不同大陆的数据集上的实验表明，STAP实现了强大的跨数据集零样本预测准确性——这是所有现有固定词汇表方法本质上不适用的情况——同时其在每个数据集内的冷启动性能与领先模型保持竞争力。此外，我们引入了一种部署策略，使模型在连续推理期间能够保持足够长的上下文，同时将延迟控制在可接受范围内。

英文摘要

Predicting the next mobile application a user will launch is essential for intelligent device resource management and proactive assistance. Existing models rely on fixed app vocabularies, which prevents them from generalizing across different app ecosystems. Many also depend on user-specific knowledge, which complicates deployment in cold start scenarios. We propose STAP, a Transformer-based model that eliminates the need for a fixed vocabulary. STAP replaces true app identities with randomly reassigned virtual indices via a shuffle mechanism, and compensates for discarded semantic information by processing behavioral sequences with an ultra-long context design. A theoretical analysis shows that, given a sufficiently long context, the predicted distribution converges to the correct one despite the anonymity of the mapping. Experiments on two datasets from different continents demonstrate that STAP achieves strong cross-dataset zero-shot prediction accuracy -- a setting where all existing fixed-vocabulary methods are inherently inapplicable -- while its cold start performance within each dataset remains competitive with leading models. Furthermore, we introduce a deployment strategy that enables the model to retain a sufficiently long context during continuous inference while keeping latency within acceptable bounds.

URL PDF HTML ☆

赞 0 踩 0

2605.29860 2026-05-29 cs.LG cs.AI

ESPO: Early-Stopping Proximal Policy Optimization

ESPO：早期停止的近端策略优化

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）； Peking University（北京大学）

AI总结提出ESPO算法，通过在强化学习训练大语言模型时在线检测轨迹失败并提前终止，节省计算资源并提升数学推理性能。

详情

AI中文摘要

当大语言模型在强化学习过程中，在轨迹早期出现错误的推理步骤时，标准算法会强制其继续生成直到最大步长，从而在从未获得正奖励的令牌上浪费计算资源，并用失败后的噪声污染优势估计。我们提出ESPO（早期停止的近端策略优化），该算法能够在线检测轨迹失败并提前终止轨迹生成。在每个生成步骤中，ESPO仅利用采样过程中已计算出的logits计算一个替代遗憾值，并在平滑累积遗憾值显著超过其估计值时终止。截断轨迹被视为具有终止奖励的吸收失败状态，将负的时间差分误差集中在检测到的失败步骤附近，无需任何额外的奖励模型或人工标注。在基于DeepSeek-R1-Distill-Qwen-7B训练的数学推理任务上，ESPO在AIME 2024（46.28% vs. 45.25%）、AMC 2023（85.83% vs. 82.94%）和MATH-500（87.42% vs. 85.43%）上超越了PPO，同时累计节省了超过20%的轨迹生成令牌。

英文摘要

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

URL PDF HTML ☆

赞 0 踩 0

2605.29858 2026-05-29 cs.CV

Masked Diffusion Vision-Language Models for Temporal Action Localization

用于时序动作定位的掩码扩散视觉语言模型

Fengshun Wang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University（武汉大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出掩码扩散视觉语言模型（MDVLM）用于时序动作定位，通过双向注意力迭代去噪联合优化语义和边界，并引入边界感知掩码和步级IoU奖励解决训练不匹配问题。

详情

AI中文摘要

时序动作定位（TAL）需要在未修剪视频中识别目标事件并精确定位其开始和结束时间。最近的视觉语言公式改进了语义推理并支持语言条件输出，但其自回归解码器仍然从左到右生成令牌，阻止了后续语义证据修正早期时间戳预测。我们将掩码扩散视觉语言模型（MDVLM）适配到TAL，使得语义令牌和边界令牌在具有双向注意力的迭代去噪过程中保持可编辑，从而允许时间边界和语义内容共同细化。然而，直接适配会产生两个TAL特定的不匹配：标准掩码扩散训练随机均匀地破坏所有位置，但时间令牌在有足够语义上下文时更可靠；令牌级交叉熵不反映时序IoU。为了解决这些不匹配，我们引入了一个计划训练目标，该目标使用边界感知掩码和步加权重构来排练时间令牌的后期恢复，同时引入步级IoU奖励，在去噪过程中提供重叠感知监督。标准序列级交叉熵项提供基础重构信号。在ActivityNet-RTL、ActivityNet-1.3和THUMOS-14上的实验表明，MDVLM-TAL在时序推理和边界定位方面均优于自回归视觉语言基线，在更严格的时序IoU标准下尤其显著。

英文摘要

Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions. We adapt masked diffusion vision-language models (MDVLMs) to TAL so that semantic tokens and boundary tokens remain editable throughout iterative denoising with bidirectional attention, allowing temporal boundaries and semantic content to be refined jointly. Direct adaptation, however, creates two TAL-specific mismatches: standard masked diffusion training corrupts all positions uniformly at random, but the time tokens are more reliable when enough semantic context is available; and token-level cross-entropy does not reflect temporal IoU. To address these mismatches, we introduce a Planned Training Objective that uses boundary-aware masking and step-weighted reconstruction to rehearse the late recovery of time tokens, together with a Step-Level IoU Reward that provides overlap-aware supervision during denoising. A standard sequence-level cross-entropy term provides the base reconstruction signal. Experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 show that MDVLM-TAL improves both temporal reasoning and boundary localization over autoregressive vision-language baselines, with especially strong gains under stricter temporal IoU criteria.

URL PDF HTML ☆

赞 0 踩 0

2605.29857 2026-05-29 cs.LG

Feedback-to-Rubrics: Can We Learn Expert Criteria from Inline Comments?

从内联评论到评分标准：我们能从内联评论中学习专家标准吗？

Kotaro Yoshida, So Kuroki, Yuki Imajuku, Taishi Nakamura, Ryunosuke Iwai, Haruki Goda, Takuya Akiba

发表机构 * Sakana AI ； Institute of Science Tokyo（东京科学研究所）

AI总结提出从内联评论中学习可复用的自然语言评分标准的方法，通过迭代优化评分标准来预测评论并支持自动修订。

2605.29856 2026-05-29 cs.CV

Building and Road Recognition in Dense Urban Informal Settlements: A Dataset and Benchmark

密集城市非正规住区中的建筑与道路识别：数据集与基准

Hongyu Long, Jiaxuan Liu, Rui Cao

发表机构 * Guangdong Provincial Project（广东省项目）； Guangzhou-HKUST(GZ) Joint Funding Program（广州-香港科技大学联合基金计划）； AI Research and Learning Base of Urban Culture Project（城市文化AI研究与学习基地项目）

AI总结针对城市村等高密度非正规住区缺乏精细标注数据的问题，构建了首个高分辨率遥感数据集DenseUIS，并评估了现有深度学习模型，揭示了其在处理密集非正规形态上的局限性，为复杂高密度环境下的精细城市制图提供了基准。

Comments 5 pages, 4 figures;

详情

AI中文摘要

作为一种普遍存在的非正规住区形式，城中村对可持续城市发展和治理提出了重大挑战。精确绘制其基础设施至关重要，然而，现有的遥感数据集主要关注正规城市环境，缺乏针对城中村典型的高密度建筑模式和狭窄道路网络的精细标注数据。为填补这一空白，我们引入了 extit{DenseUIS}数据集，这是首个专门用于极度密集城市非正规住区中建筑和道路提取的高分辨率遥感数据集，覆盖了中国深圳和广州的126个城中村。此外，我们对该数据集上的最先进深度学习模型进行了全面评估。实验结果表明，现有方法在处理密集非正规住区的独特形态模式方面存在局限性，凸显了对专门方法的需求。因此， extit{DenseUIS}为推进复杂高密度非正规环境中的精细城市制图提供了一个稳健的基准。该数据集公开于https://github.com/rui-research/DenseUIS。

英文摘要

As a widespread form of informal settlements, urban villages present significant challenges for sustainable urban development and governance. Precise mapping of their infrastructure is essential, however, existing remote sensing datasets primarily focus on formal urban environments, lacking fine-grained annotated data for the high-density building patterns and narrow road networks typical of urban villages. To address this gap, we introduce the \textit{DenseUIS} dataset, the first high-resolution remote sensing dataset specifically designed for building and road extraction in extremely dense urban informal settlements, covering 126 urban villages across Shenzhen and Guangzhou in China. Furthermore, we conduct a comprehensive evaluation of state-of-the-art deep learning models on this dataset. Experimental results reveal the limitations of existing methods in handling the unique morphological patterns of dense informal settlements, underscoring the need for specialized approaches. \textit{DenseUIS} therefore provides a robust benchmark for advancing fine-grained urban mapping in complex and high-density informal environments. The dataset is publicly available at https://github.com/rui-research/DenseUIS.

URL PDF HTML ☆

赞 0 踩 0

2605.29850 2026-05-29 cs.LG

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

MIRAGE：用于全脑fMRI编码的自适应多模态门控

Abdulkadir Gokce, Badr AlKhamissi, Martin Schrimpf

发表机构 * Qwen3-Omni-30B-A3B-Thinking（通义千问3- Omni-30B-A3B-Thinking）

AI总结提出MIRAGE框架，通过原生多模态骨干网络和自适应特征门控，实现全脑fMRI对自然视听刺激的高精度编码，并证明原生多模态特征优于后期融合的单模态特征。

Comments Preprint. First two author contributed equally

详情

AI中文摘要

近期任务优化神经网络的进展已将编码模型确立为预测大脑对自然刺激反应的有力工具，然而现有方法大多依赖单模态表示。全模态基础模型和丰富的多模态神经数据集的出现，使得能够联合整合跨被试的视觉、听觉和语言信息的编码模型成为可能。我们提出MIRAGE，一个用于预测全脑fMRI对自然视听刺激反应的脑编码框架。MIRAGE通过原生多模态骨干网络和跨层自适应特征门控实现了最先进的性能。这些表示随后与基于transformer的脑编码器和跨皮层分区的被试特定线性头相结合。控制比较表明，原生多模态特征在架构层次和骨干网络上始终优于独立单模态特征的事后聚合。除了预测准确性，学习的注意力权重可直接检查以解释骨干网络上的模态特定门控分布，每种模态在皮层上描绘出不同的解剖模式。综合这些结果，提出了原生多模态特征的自适应逐层聚合作为全脑编码的一种可泛化、可解释且准确的方法。

英文摘要

Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables encoding models that jointly integrate visual, auditory, and linguistic information across subjects. We introduce MIRAGE, a brain encoding framework for predicting whole-brain fMRI responses to naturalistic audiovisual stimuli. MIRAGE achieves state-of-the-art performance via a native multimodal backbone and adaptive feature gating across layers. These representations are then combined with a transformer-based brain encoder and a subject-specific linear head over the cortical parcels. Controlled comparisons show that natively multimodal features consistently outperform post-hoc aggregation of independent unimodal features, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable to interpret the modality-specific gating profile over the backbone, and each modality traces a distinct anatomical pattern across cortex. Together, these results propose adaptive layer-wise aggregation of natively multimodal features as a generalizable, interpretable, and accurate approach for whole-brain encoding.

URL PDF HTML ☆

赞 0 踩 0