arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00656 2026-06-02 cs.LG cs.AI

Demystifying the Optimal Fair Classifier in Multi-Class Classification

揭秘多类分类中的最优公平分类器

Li Zhang, Yuyuan Li, XiaoHua Feng, Jiaming Zhang, Fengyuan Yu, Chaochao Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) College of Computer Science(计算机科学学院) Technology, Zhejiang University(技术,浙江大学) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 本文针对多类分类中的公平性问题,提出了一种在公平约束下最优分类器的概率公式,并设计了两种属性盲算法(处理中与处理后)以逼近最优精度-公平帕累托前沿。

Comments Accepted to ICML 2026

详情
AI中文摘要

确保不同群体之间的公平公正对待,特别是在多类分类任务中,由于机器学习模型中固有的持续偏差,构成了重大挑战。大多数现有的偏差缓解技术针对二元设置,而多维输出和复杂公平机制的存在使得它们扩展到多类场景既不直接也不有效。在本文中,我们研究了公平分类中两个基本且未解决的挑战:(i)刻画多类设置中的最优精度-公平前沿,以及(ii)设计在不同训练阶段达到此最优值的实用算法。为应对这些挑战,我们首先指定了公平约束下最优分类器的解析可处理概率公式。在此基础上,我们提出了两种属性盲算法以在实践中实施公平要求:一种是通过约简方法在训练期间进行公平干预的处理中方法,以及一种通过插件估计微调输出概率的处理后方法。理论分析表明,两种方法都收敛到最优精度-公平帕累托前沿。在多个数据集上进行的实验证明了我们的方法在平衡精度和公平性方面的优越性能。

英文摘要

Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i) characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii) designing practical algorithms that attain this optimum in different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.

2606.00651 2026-06-02 cs.LG cs.AI cs.CL

MESA: Improving MoE Safety Alignment via Decentralized Expertise

MESA: 通过去中心化专家提升MoE安全对齐

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中安全能力集中于少数专家导致的脆弱性,提出MESA框架,通过最优传输理论实现专家安全职责去中心化分配与路由细化,在保持实用性的同时提升防御性能。

Comments 18 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

混合专家(MoE)架构高效扩展大型语言模型(LLM),通过动态路由将输入分配给相关专家,以降低计算成本的同时增强容量,但引入了一个关键漏洞:安全稀疏性,即安全能力集中在少数专家中,使其容易受到对抗性绕过。同时,传统的对齐方法统一调整所有参数,忽略了它们的功能差异,并无意中降低了性能。为了解决这些挑战,我们提出了MESA(MoE安全对齐),一个针对基于MoE的LLM的定向对齐框架,策略性地去中心化安全责任以最大化覆盖范围,同时最小化对实用性的干扰。基于最优传输(OT)理论,MESA通过两种机制运作:(1)专家容量重新分配使用传输成本矩阵将安全职责分配给最具成本效益的专家,以及(2)动态路由细化约束路由器精确激活这些去中心化模块。实验表明,MESA在保持有用性的同时,对各种有害基准实现了稳健的防御性能。代码可在https://github.com/lorraine021/MESA获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

2606.00647 2026-06-02 cs.CL cs.AI

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

LinguIUTics 在 PsyDefDetect 中的研究:用于心理防御机制分类的迭代不平衡感知微调 Qwen3-8B

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh(计算机科学与工程系,伊斯兰技术大学,达卡,孟加拉国)

AI总结 针对对话文本中心理防御机制检测的类别不平衡问题,提出基于 QLoRA 微调 Qwen3-8B 的迭代不平衡感知方法,通过分组分层交叉验证、少数类轮询词汇增强和后处理流水线,在 PsyDefDetect 2026 共享任务中达到宏 F1 0.3917,排名第4。

Comments Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA

详情
AI中文摘要

检测对话文本中的心理防御机制仍然是一个具有挑战性的临床自然语言处理问题。针对 PsyDefDetect 2026 共享任务(九类话语分类,通过宏 F1 评估),我们的团队 LinguIUTics 在官方正类排行榜上取得了 0.3917 的宏 F1 分数,在 21 个注册团队中排名第 4,比 Ministral-8B 任务基线(宏 F1 0.3148)提高了 7.7 个绝对点(相对提升 24.4%)。由于严重的类别不平衡,BERT 系列编码器和零样本 LLM 在稀有类别上被证明无效,因此我们转向对 Qwen3-8B 进行 QLoRA 微调。我们利用三个关键策略:分组分层交叉验证(防止泄漏)、少数类轮询词汇增强,以及包含 logit 偏置调整和集成混合的后处理流水线。这些组件共同缩小了验证集与排行榜之间的差距,并显著提高了少数类的召回率,将关键的“Unclear”类别(第8级)从接近零的性能提升到 F1 分数 0.797。

英文摘要

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to an F1 score of 0.797.

2606.00642 2026-06-02 cs.AI cs.CR

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

隐藏的思考并非秘密:大型语言模型中的推理痕迹暴露

Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa, Chia-Mu Yu

发表机构 * National Yang Ming Chiao Tung University(国家阳明交通大学) UC Berkeley(伯克利大学)

AI总结 本文提出推理暴露提示(REP)方法,通过影子模型生成的示范以辅助代码格式包装,从受害者模型中引出用户可见的推理痕迹,显著提高暴露痕迹与内部痕迹的相似性并保留有用推理信号。

详情
AI中文摘要

推理痕迹已成为改进和转移大型语言模型能力的有价值学习信号。特别是,详细痕迹有助于将推理行为从更强的教师模型蒸馏到较弱的学生模型。能力转移的价值促使许多部署了推理模型的系统隐藏原始内部痕迹,最多向用户暴露摘要和答案。因此,我们提出这样的问题:这种接口级别的痕迹隐藏是否能防止用户通过提示获得有用的推理监督?我们通过推理暴露提示(REP)研究这个问题,这是一种轻量级的上下文引出方法,使用影子模型生成的示范以辅助代码格式包装,从受害者模型中引出用户可见的推理痕迹。在常见的推理数据集、不同的受害者模型和不同的学生模型蒸馏中,REP显著提高了暴露痕迹与REP条件内部痕迹之间的相似性,同时保留了有用的推理信号。

英文摘要

Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.

2606.00640 2026-06-02 cs.CV

An Attribute-Based Measure of Video Complexity

基于属性的视频复杂度度量

Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati, Aashu Singh, Shlok Kumar Mishra, David Jacobs, Nuno Vasconcelos

发表机构 * UMIACS-University of Maryland College Park(马里兰大学College Park分校UMIACS) University of California San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Meta AI

AI总结 提出VideoABC框架,通过属性空间量化估计视频-问题对在视频大语言模型上的失败概率,实现非参数复杂度度量。

详情
AI中文摘要

提出了一种新的框架,用于估计视频-问题对给视频大语言模型带来的复杂度,即基于属性的视频复杂度(VideoABC)。视频复杂度定义为视频大语言模型在给定视频-问题对上的失败概率。VideoABC是一种非参数复杂度度量,使用参考视频数据集和预定义的视频属性词汇表(这些属性对复杂度有信息量,例如场景复杂度或与问题相关的视频事件速度)。在训练阶段,参考视频被投影到这些属性空间中,然后进行量化。计算每个量化单元的期望ABC。给定一个新视频及其在属性空间中的投影,通过关联量化单元的期望ABC来估计复杂度。为了能够使用小规模参考视频数据集,结合了两种量化器:k-means量化器(能对参考数据集分布内的样本进行准确复杂度估计)和通用格点量化器(保证对分布外样本的泛化)。受心理物理学研究中目标-干扰物操纵的启发,提出了一种合成视频生成程序,用于在训练期间填充格点量化器的单元,从而计算其期望ABC。实验结果表明,即使使用非常低维的属性表示,VideoABC也有效,其性能大大优于“视频大语言模型作为评判者”等方法,且复杂度更低。最后,VideoABC分数在定义良好的属性方面的可解释性,揭示了基准测试的属性组成如何影响其复杂度。

英文摘要

A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge' with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.

2606.00637 2026-06-02 cs.RO

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

全局-局部注意力分解用于人形感知运动中的地形编码

Shengcheng Fu, Yang Zhang, Zhanxiang Cao, Liyun Yan, Yizhi Chen, Yunpeng Yin, Yue Gao

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Humanoid Robot (Shanghai) Co., Ltd.(人形机器人(上海)有限公司)

AI总结 提出全局-局部注意力分解(GLAD)方法,通过粗到细编码器分离全局地形感知和局部立足点选择,实现人形机器人在稀疏立足点和受限环境中的鲁棒运动。

详情
AI中文摘要

尽管强化学习显著推进了人形运动,感知策略在稀疏立足点地形和受限环境中仍然存在困难。在这些场景中成功需要广泛的地形感知和精确的立足点选择,而传统编码器常常纠缠这两种感知角色。为了解决这一挑战,我们提出了用于人形运动地形编码的全局-局部注意力分解(GLAD)。通过基于机器人中心高程图的粗到细编码器实现,GLAD明确分离了这些目标:全局注意力分支利用注意力池化总结周围地形上下文,而状态条件局部注意力分支稀疏化并编码精确的立足点相关几何。这种显式注意力分解防止了细粒度空间线索的稀释,同时减少了训练开销。实验表明,GLAD能够在具有挑战性的间隙、踏脚石和楼梯上实现可靠运动。此外,学习到的策略表现出涌现的地形响应行为,在简单速度指令下自主跟随狭窄路径并避开障碍物,无需显式导航规划器。在搭载机载LiDAR的Unitree G1人形机器人上的实际部署中,所提方法在多种稀疏立足点和障碍物丰富领域实现了鲁棒的零样本仿真到现实迁移。

英文摘要

Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.

2606.00635 2026-06-02 cs.LG

How Neural Losses Shape VAE Latents

神经损失如何塑造VAE潜在变量

Giorgio Strano, Luca Cerovaz, Michele Mancusi, Tommaso Mencattini, Emanuele Rodolà

发表机构 * Sapienza University of Rome(罗马大学萨皮恩扎分校) Paradigma, Inc.(Paradigma公司) Moises Systems, Inc.(Moises系统公司) EPFL(苏黎世联邦理工学院)

AI总结 本文研究感知损失和对抗损失等神经重建损失如何改变VAE的率失真问题,证明其减少潜在表示信息量并改变潜在空间几何结构,使表示更各向同性且不确定性分布更均匀。

详情
AI中文摘要

现代VAE很少使用标准$β$-VAE目标隐含的点态似然进行训练。在实践中,尽管缺乏对如何改变模型潜在动态的理解,点态重建常与感知损失和对抗损失结合。我们表明,重建损失的选择重塑了率失真问题本身,改变了潜在表示的信息内容和几何结构,这些变化可能仅从重建中无法察觉。首先,我们证明并实证验证,用神经项(如感知和对抗目标)增强点态重建会减少存储在潜在表示中的信息量。其次,我们展示神经重建损失系统地改变了潜在空间的几何结构:它们使表示更各向同性,并更均匀地将不确定性分布在潜在维度上,产生不同的后验方差分布。这些发现强调了率失真权衡并非理解VAE行为的全面视角,我们提出一种更机械的方法来研究失真度量的选择如何重塑优化问题。

英文摘要

Modern VAEs are rarely trained with the pointwise likelihood implied by the standard $β$-VAE objective. In practice, pointwise reconstruction is often combined with perceptual and adversarial losses, despite a lack of understanding of how this changes the latent dynamics of the model. We show that the choice of reconstruction loss reshapes the rate-distortion problem itself, altering both the information content and the geometry of the learned latent space in ways that may be invisible from reconstructions alone. First, we prove and verify empirically that augmenting pointwise reconstruction with neural terms, such as perceptual and adversarial objectives, reduces the amount of information stored in the latent representations. Second, we show that neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. These findings highlight how the rate-distortion tradeoff is not a comprehensive lens to understand the behavior of VAEs, and we propose a more mechanistic approach to investigate how the choice of a distortion metric reshapes the optimization problem.

2606.00634 2026-06-02 cs.CL cs.LG

French parsing enhanced with a word clustering method based on a syntactic lexicon

基于句法词典的词聚类方法增强的法语解析

Anthony Sigogne, Matthieu Constant, Eric Laporte

发表机构 * Université Paris-Est(巴黎-est大学) LIGM(语言与信息学实验室)

AI总结 本文通过将法语句法词典(Lexicon-Grammar)的数据整合到概率解析器中,并应用聚类方法于法语树库的动词,提高了基于概率上下文无关文法的解析性能。

Journal ref Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), 2011, Dublin, Ireland, pp.22-27

详情
AI中文摘要

本文评估了从法语句法词典(Lexicon-Grammar, Gross, 1994)中提取的数据整合到概率解析器中的效果。我们表明,通过对法语树库(Abeillé et al., 2003)中的动词应用聚类方法,基于概率上下文无关文法(Petrov et al., 2006)的解析器在法语上获得了准确的性能。

英文摘要

This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).

2606.00630 2026-06-02 cs.CV stat.ML

A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery

脑肿瘤手术中术中超声到MR合成的系统基准测试

Olga Esteban-Sinovas, Santiago Cepeda, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit Río Hortega University Hospital(里奥霍尔特ega大学医院神经外科部门,神经血管单元) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC)(生物医学成像与计算分析专项组(GEIBAC)) Instituto de Investigación Biosanitaria de Valladolid (IBioVALL)(瓦尔拉多利德生物医学研究 institute(IBioVALL))

AI总结 针对脑肿瘤手术中术中超声(ioUS)到MR图像合成问题,本研究在公共ReMIND数据集上系统比较了6种生成器、4种推理模式和2种目标,结合图像保真度指标和下游分割评估,发现感知质量(LPIPS)与下游效用最相关,而SSIM与效用负相关,SynDiff-2.5D在下游分割中表现最佳。

详情
AI中文摘要

术中超声(ioUS)在脑肿瘤手术中是一种多功能、成本效益高的模态,但其解释困难:采集平面非标准,伪影具有模态特异性,且其外观与术前MRI(手术规划工具、分割模型和外科医生经验所依赖的)显著不同。从ioUS合成类似MRI的图像可以使基于MRI的基础设施在术中无需额外扫描即可重复使用。大多数先前的工作孤立地评估单一架构;据我们所知,没有基准测试在共同协议下涵盖架构范式、推理机制和下游任务端点。我们在公共ReMIND数据集(76名患者;153对ioUS/T2w和104对ioUS/FLAIR研究;60/16患者级训练/保留测试集划分)上填补了这一空白。六个生成器(四个GAN基线:Pix2Pix、SwinPix2Pix、CycleGAN、CUT;Transformer增强的ResViT;以及少步扩散模型SynDiff)分别在四种推理机制(2D、2.5D、2D+3D细化、全3D)和两种目标(仅T2w;T2w+FLAIR多任务)下训练,共产生48个实验。图像保真度指标(SSIM、PSNR、MAE、LPIPS)辅以nnU-Net v2下游分割评估(肿瘤和切除腔)以及按组织学分级和再次手术的亚组分析。没有一种架构在所有轴上占优,而且关键的是,感知质量与下游效用最密切相关(LPIPS,r=-0.66,p<0.001),而更高的SSIM与更差的效用相关(r=-0.64,p<0.001);SynDiff-2.5D最好地保留了下游分割(U_Dice=0.55)。因此,应报告或优先考虑感知和下游任务指标而非全局SSIM,并且架构选择应取决于手术阶段、患者病史和临床目标。

英文摘要

Intraoperative ultrasound (ioUS) is a versatile, cost-effective modality in brain tumour surgery, but its interpretation is difficult: acquisition planes are non-standard, artefacts are modality-specific, and its appearance differs markedly from the preoperative MRI on which surgical-planning tools, segmentation models and the surgeon's experience rely. Synthesising MRI-like images from ioUS could let this MRI-based infrastructure be reused intraoperatively without an extra scan. Most prior work evaluates a single architecture in isolation; to our knowledge, no benchmark has spanned architectural paradigms, inference regimes and downstream-task endpoints under a common protocol. We address this gap on the public ReMIND data set (76 patients; 153 paired ioUS/T2w and 104 paired ioUS/FLAIR studies; 60/16 patient-level train/held-out split). Six generators (four GAN baselines: Pix2Pix, SwinPix2Pix, CycleGAN, CUT; the transformer-augmented ResViT; and the few-step diffusion model SynDiff) were each trained under four inference regimes (2D, 2.5D, 2D + 3D-refinement, full-3D) and two targets (T2w only; T2w + FLAIR multi-task), yielding 48 experiments. Image-fidelity metrics (SSIM, PSNR, MAE, LPIPS) were complemented by an nnU-Net v2 downstream segmentation evaluation (tumour and resection cavity) and by subgroup analyses by histological grade and reoperation. No architecture dominated every axis, and, critically, perceptual quality tracked downstream utility most closely (LPIPS, r=-0.66, p<0.001), whereas higher SSIM was associated with worse utility (r=-0.64, p<0.001); SynDiff-2.5D best preserved downstream segmentation (U_Dice=0.55). Perceptual and downstream-task metrics should therefore be reported alongside or in preference to global SSIM, and architecture choice conditioned on surgical phase, patient history and clinical objective.

2606.00629 2026-06-02 cs.SD cs.HC cs.LG eess.AS

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

质量音频原型:统一声音检索与程序化生成的系统原型

Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams, Israel Mason-Williams, Emmanouil Benetos, Joshua Reiss

发表机构 * GitHub

AI总结 提出QuAP系统,通过统一基于内容的音频检索和实时程序化生成,并集成规则辅助参数指导,降低声音设计中的操作距离,经主观评估和用户测试验证了其有效性和实用性。

Comments DaFx 2026

详情
AI中文摘要

声音设计工作流经常在耗时的库搜索和复杂的程序化合成之间摇摆,从业者通常依赖独立的工具分别应对每个挑战。本文介绍了质量音频原型(QuAP),一个工作原型,它在单一界面中统一了基于内容的音频检索和程序化声音生成,减少了叙事概念与其声音实现之间的操作距离。QuAP集成了基于相似性的检索引擎与实时程序化音频模型,并辅以基于规则的助手,提供基于感知的参数指导,给出源自经验优化的定义和建议,而不需要先验的合成知识。初步评估证实了这种方法的可行性:主观评估显示六个嵌入合成模型中有五个在质量上具有统计显著性的提升,编码器消融研究在音效数据集上确立了首选的检索架构。与16名从业者的用户评估证实了该工具的工作流实用性,所有参与者一致认为参数助手在保持创作自主性的同时降低了程序化交互的门槛。

英文摘要

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

2606.00628 2026-06-02 cs.CL

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

通过动态令牌选择实现分布对齐的自蒸馏的鲁棒推理

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University(北京未来区块链与隐私计算先进创新中心,北京航空航天大学) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院)

AI总结 针对自蒸馏中参考答案引入风格偏差导致模型模仿表面形式而非学习推理模式的问题,提出分布对齐自蒸馏(DASD),通过动态过滤高困惑度令牌来保留逻辑修正并抑制风格噪声,在数学、代码和常识推理任务上提升鲁棒性。

Comments 12 pages, 13 figures

详情
AI中文摘要

自蒸馏通过将参考答案重写为更符合模型自身分布的训练数据来提高学习效率。然而,参考答案也引入了强烈的风格偏差,导致生成模型模仿表面形式而非学习有用的推理模式。我们观察到重写数据包含大量高困惑度(PPL)令牌,这些令牌来自两个不同的来源:有益的知识增强逻辑修正,以及由参考模仿引起的有害风格漂移。平等对待所有此类令牌会破坏基础模型的原始分布并降低性能,尤其是在困难推理任务上。为了解决这个问题,我们提出了分布对齐自蒸馏(DASD),它使用答案感知的参考模型生成候选令牌,并根据基础模型的置信度动态过滤它们。DASD 保留编码有用逻辑知识的令牌,同时抑制分布不对齐的风格噪声。在数学、代码和常识推理基准上的实验表明,DASD 始终优于竞争基线,减少了高 PPL 令牌,并提高了不同难度任务的鲁棒性。

英文摘要

Self-distillation improves learning efficiency by rewriting reference answers as training data that better matches the model's own distribution. However, reference answers also introduce strong stylistic biases, causing the generative model to imitate surface forms rather than learn useful reasoning patterns. We observe that the rewriting data contains a large number of high-perplexity (PPL) tokens, coming from two distinct sources: beneficial knowledge-enhancing logical corrections, and harmful stylistic drift induced by reference imitation. Treating all such tokens equally can disrupt the base model's original distribution and degrade performance, especially on difficult reasoning tasks. To address this, we propose Distribution-Aligned Self-Distillation (DASD), which uses an answer-aware reference model to generate candidate tokens and dynamically filters them according to the base model's confidence. DASD preserves tokens that encode useful logical knowledge while suppressing distributionally misaligned style noise. Experiments on math, code, and commonsense reasoning benchmarks show that DASD consistently outperforms competitive baselines, reduces high-PPL tokens, and improves robustness across tasks of varying difficulty.

2606.00622 2026-06-02 cs.CV

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

MM-Snowball:多模态多轮对话中的幻觉雪崩评估与缓解

Yue Jiang, Xue Jiang, Lihua Zhang, Zhiqiang Wang, Yuhang Lu, Peng Wang, Bo Han, Feng Zheng, Dingkang Yang

发表机构 * College of Intelligent Robotics and Advanced Manufacturing, Fudan University(复旦大学智能机器人与先进制造学院) Southern University of Science and Technology(南方科技大学) TMLR Group, Hong Kong Baptist University(香港 Baptist 大学 TMLR 团体) MM Lab, CUHK(CUHK 多模态实验室) RAMS Lab, Huawei Technologies Co., Ltd.(华为技术有限公司 RAMS 实验室)

AI总结 针对多模态大模型在多轮对话中因初始错误累积导致幻觉雪崩的问题,提出首个细粒度诊断基准MM-Snowball,并设计无训练的冲突感知视觉校正方法CAVR,通过表示级刷新视觉锚定和logit级修正输出分布来缓解雪崩效应。

Comments Accepted by The International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多模态大语言模型(MLLMs)展现出显著的视觉理解能力,但在交互环境中的可靠性受到幻觉雪崩的严重破坏:一种初始错误在对话轮次间放大,导致连贯性崩溃的现象。这种失败揭示了一个根本性的脆弱性,即模型逐渐忽视视觉锚定,转而过度依赖受污染的文本历史。现有基准主要局限于单轮VQA,无法捕捉长程交互中错误传播的复杂动态。为解决这一问题,我们引入了MM-Snowball,这是首个用于细粒度诊断对话中幻觉雪崩的基准。广泛评估表明,我们的基准对即使是先进的MLLMs也构成了重大挑战,并揭示了现有为单轮VQA设计的缓解方法的无效性。为对抗这种退化,我们提出了冲突感知视觉校正(CAVR)。这种无训练方法通过协同双机制缓解雪崩:在表示级刷新视觉锚定,并在logit级修正输出分布,有效地将模型重新锚定到视觉事实。实验表明,CAVR达到了最先进的性能,为更可靠的交互式AI提供了一条有希望的路径。数据和代码可在 https://frenkie-chiang.github.io/MM-Snowball 获取。

英文摘要

Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: https://frenkie-chiang.github.io/MM-Snowball

2606.00620 2026-06-02 cs.CV

FlowNar: Scalable Streaming Narration for Long-Form Videos

FlowNar: 面向长视频的可扩展流式叙述

Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, Juergen Beyerer

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院) Lamarr Institute for Machine Learning(拉马尔机器学习研究所) University of Bonn(波恩大学)

AI总结 提出FlowNar框架,通过动态上下文管理和CLAM模块实现有界视觉记忆与计算复杂度,在流式视频叙述中兼顾高质量与高效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

近期的大型多模态模型(LMMs)主要针对离线场景设计,难以适应流式视频的动态需求。虽然最近的在线适配改进了实时处理,但仍面临关键的可扩展性挑战,资源需求通常随视频时长至少线性增长。为突破这一瓶颈,我们提出FlowNar,一种用于可扩展流式视频叙述的新型框架。FlowNar的核心是一种用于历史视觉上下文移除的动态上下文管理策略,结合我们的CLAM(跨线性注意力记忆)模块用于流式视觉历史保留,确保有界的视觉内存使用和计算复杂度,这对高效流式处理至关重要。我们还引入了一个现实的自条件评估协议和补充评估指标,以在类似部署的条件下评估流式叙述模型。在Ego4D、EgoExo4D和EpicKitchens100数据集上的实验表明,FlowNar在强基线上显著提高了叙述质量,同时保持高效,支持处理10倍长的视频,并实现3倍更高的吞吐量(FPS)。代码可在https://github.com/zeyun-zhong/FlowNar获取。

英文摘要

Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic self-conditioned evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on the Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10$\times$ longer videos and achieving 3$\times$ higher throughput (FPS). The code is available at https://github.com/zeyun-zhong/FlowNar.

2606.00619 2026-06-02 cs.CL cs.AI

MemPro: Agentic Memory Systems as Evolvable Programs

MemPro:作为可进化程序的智能体记忆系统

Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He

发表机构 * East China Normal University(东华师范大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出MemPro框架,将整个记忆构建-检索管道视为可进化程序,通过故障模式引导的编辑-调试迭代优化,在多个长时任务数据集上超越静态和提示级进化基线。

Comments 20 pages, 14 figures

详情
AI中文摘要

长时程自主智能体需要记忆系统来保留历史信息、跟踪演化状态并在有限上下文窗口之外重用相关知识。现有的智能体记忆系统通常遵循记忆构建-检索(MCR)管道,但往往主要适应记忆库,而在部署后保持周围管道固定。这种固定管道设计难以处理异构的任务特定故障模式,并且可能随着时间推移与规模和结构演化的记忆库产生错位。为解决这些限制,我们提出MemPro,一种系统级进化框架,将整个MCR管道视为可进化程序,而不仅仅是适应记忆库或提示文本。MemPro维护一个可运行记忆系统实现的版本树,其中进化智能体迭代选择有前途的版本,诊断重复出现的故障,并通过故障模式引导的编辑-调试改进创建改进的子版本。在LongMemEval、LoCoMo、HotpotQA和NarrativeQA上的实验表明,MemPro在几次迭代内持续优于强静态和提示级进化基线,随着进化持续改进,并实现了良好的性能-成本权衡。代码可在https://github.com/wanghai673/MemPro获取。

英文摘要

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction-retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed-pipeline design struggles to handle heterogeneous task-specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system-level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory-system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure-mode-guided edit-debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt-level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance-cost trade-off. Code is available at https://github.com/wanghai673/MemPro.

2606.00618 2026-06-02 cs.AI

Efficient Test-time Inference for Generative Planning Models

生成式规划模型的高效测试时推理

Robert Gieselmann, Mihai Samson, Federico Pecora, Jeremy L. Wyatt

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出一种改进的开放-封闭列表搜索算法,结合生成模型和启发式模型,在测试时高效推理,提升生成式规划模型的解质量和计算效率。

详情
AI中文摘要

生成式模型已成为人工智能规划的强大范式,但其性能仍受训练数据分布的限制。一种方法是通过扩展测试时计算来改进推理过程中生成的解决方案。更高效的替代方案是优化推理过程本身。在本文中,我们展示了经典开放-封闭列表(OCL)搜索的修改版本提供了这样一种高效的推理过程。我们的算法协同了两个学习组件:一个从中间状态执行快速推演的生成模型,以及一个在候选推理路径中优先排序的启发式模型。关键贡献包括新颖的探索控制机制以及将学习模型集成到OCL框架中。在多个组合规划领域中,我们的方法在计算效率和解质量上均优于神经符号搜索基线和经典求解器。

英文摘要

Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.

2606.00613 2026-06-02 cs.CL cs.AI

Linguistics-Aware Non-Distortionary LLM Watermarking

语言学感知的无失真LLM水印

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

发表机构 * Yonsei University(延世大学) Rensselaer Polytechnic Institute(罗切斯特理工学院)

AI总结 提出LUNA水印方法,通过语言自适应非失真二元锦标赛采样器,在保持文本质量的同时实现高检测性能,在12种设置中AUROC达0.9959且中位困惑度偏移仅0.045。

详情
AI中文摘要

水印应能识别语言模型输出而不降低质量或限制验证仅由模型提供者进行。多语言部署使这更加困难,因为形态、分词和书写系统的变化会改变水印证据自然进入的位置。我们引入LUNA,一种语言自适应水印,结合了无模型检测和标准随机密钥模型下的单令牌无失真。LUNA从外部语料库中的词性上下文估计归一化下一标记熵,并用其设置无失真二元锦标赛采样器的深度;检测器从文本、分词器、词性标注器和密钥重建相同的调度。我们在六种类型多样的语言和两个领域上评估了八种主要基线。LUNA在十二种设置中达到了0.9959的AUROC和最低的平均绝对中位困惑度偏移0.045;其95%自助法区间[0.022, 0.073]低于所有基线区间。LUNA还记录了最低的平均Self-BLEU、Distinct-1、surprisal和熵偏移。它是唯一同时在大多数设置中实现AUROC > 0.99和绝对中位困惑度偏移低于0.1的方法,在12种设置中的9种达到该状态,而没有任何基线在超过2种设置中达到。我们的代码可在https://github.com/Shinwoo-Park/luna_watermark获取。

英文摘要

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark

2606.00611 2026-06-02 cs.AI

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

TRACE: 面向长程智能体安全的轨迹风险感知压缩

Zhepei Hong, Lin Wang, Liting Li, Haokai Ma, Junfeng Fang, Fei Shen, Dan Zhang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) South China Normal University(华南师范大学)

AI总结 提出轨迹风险感知压缩方法TRACE,通过压缩器-阅读器架构将长轨迹压缩为潜在证据状态,以聚合稀疏风险信号并提升长程安全检测准确率。

详情
AI中文摘要

长程LLM智能体在长轨迹中产生安全证据,其中稀疏、延迟和组合的风险信号常常逃脱局部审核。现有的轮次级或短上下文检测器难以在长时间跨度内可靠地保留和聚合此类证据。我们将长程智能体安全检测重新定义为轨迹级证据压缩,并提出面向长程智能体安全的轨迹风险感知压缩(TRACE)。TRACE采用压缩器-阅读器设计:压缩器在轨迹级监督下将完整轨迹编码为紧凑的潜在证据状态,阅读器以该潜在证据状态作为安全参考来判断原始轨迹。该设计有助于聚合分散的风险线索并减少过早的证据丢失。在ASSEBench、Pre-Ex-Bench和R-Judge上,TRACE在所有评估基线上取得了最佳准确率,相比强基线最高提升12.6个百分点。在LongSafety上,TRACE随着上下文长度增加表现出更小的性能下降。注意力可视化和案例研究表明,压缩后的参考有助于阅读器聚焦于风险关键片段并恢复跨步证据。代码可在https://github.com/Peregrine123/TRACE_official获取。

英文摘要

Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

2606.00609 2026-06-02 cs.LG cs.AI

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL:用于缓解跨领域冲突的能力感知强化学习

Rui Zhang, Xinle Wu, Yao Lu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出CARE-RL框架,结合协议感知奖励生成与能力感知优化,通过PA-GRM和DACSP方法缓解多领域强化学习中的奖励不可靠与能力干扰问题。

详情
AI中文摘要

具有可验证奖励的强化学习在面向推理的大语言模型中取得了显著进展,但由于非可验证任务中奖励不可靠以及跨领域能力干扰,将其扩展到多领域强化学习仍具挑战性。我们提出CARE-RL,将协议感知奖励生成与能力感知优化相结合,以缓解跨领域冲突。对于非可验证任务,协议感知生成式奖励模型(PA-GRM)在生成轨迹条件奖励之前构建提示级别的评估协议和模式,从而实现对开放式响应的任务自适应且可比较的评估。对于多领域优化,方向感知能力子空间投影(DACSP)从先前的强化学习阶段提取历史能力方向,并通过放大对齐分量、抑制冲突分量以及保留正交更新来调节后续更新。在数学、聊天和指令遵循基准上的实验表明,CARE-RL始终优于标准的多领域强化学习基线,在Qwen2.5-7B和Qwen3-4B上分别达到47.9和50.7的总平均分。

英文摘要

Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.

2606.00606 2026-06-02 cs.CV

FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection

FiSeR:用于跨域AI图像检测的细粒度源表示

Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma

发表机构 * Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma(作者团队)

AI总结 针对合成图像检测器在域迁移下泛化能力差的问题,提出层次对比学习框架FiSeR,通过粗粒度和细粒度对比目标联合优化,在跨域评估中平均AUROC提升+10.22。

详情
AI中文摘要

现实世界的合成图像检测器在域内表现强劲,但在域迁移下通常泛化能力差。通过无监督UMAP投影,我们发现自然和合成特征在未见数据集上仍部分可分,但性能仍然下降,表明分类头过度拟合训练域伪影。因此,关键在于学习更具迁移性的表示,使决策标准对域迁移更稳定和鲁棒。基于合成图像由多种生成器生成的结构事实,我们提出一个层次对比学习框架,在保留生成器身份信息的同时提高自然和合成图像之间的可分离性。它联合优化(i)自然和合成图像之间的粗粒度对比目标和(ii)使用生成器身份的合成图像之间的细粒度对比目标。在WildFake上训练,我们的方法在跨域评估中,在与强基线DIRE相同的设置下,在Chameleon、AIGIBench、Community Forensics和GenImage上平均AUROC提升+10.22。对于少样本适应,我们冻结骨干网络,并在每类10个标记样本上拟合SVM头,在12个广泛使用的检测器上平均,AIGIBench的AUROC提升+10.64,Chameleon提升+17.41。我们的代码公开在:https://github.com/heyongxin233/FiSeR。

英文摘要

Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: https://github.com/heyongxin233/FiSeR.

2606.00605 2026-06-02 cs.LG stat.ML

Looped Transformers with Layer Normalization Provably Learn the Power Method

带有层归一化的循环Transformer可证明地学习幂方法

Lyumin Wu, Chenyang Zhang, Yuan Cao

发表机构 * School of Computing & Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 本文通过主成分预测任务,证明带有层归一化的循环线性Transformer在梯度下降训练下会收敛到实现幂方法的解,揭示了层归一化带来的算法隐式偏差。

Comments 70 pages, 8 figures

详情
AI中文摘要

Transformer在广泛的应用中取得了显著成功,越来越多的研究表明其部分优势来自于学习和执行算法程序的能力。然而,我们对Transformer如何学习此类算法的理解仍然有限,尤其是在存在层归一化(LN)的情况下。在这项工作中,我们研究主成分预测作为理解带有LN的Transformer训练动态的具体测试平台。我们证明,通过梯度下降训练的带有LN的循环线性Transformer收敛到实现幂方法的解,其中每个自注意力层执行一次幂迭代。值得注意的是,模型仅针对主成分预测进行训练,而非明确监督其实现幂方法。因此,我们的发现揭示了带有LN的循环Transformer的“算法隐式偏差”:主成分预测原则上可以通过多种机制实现,但梯度下降选择了实现幂方法的一种。我们进一步提供了带有和不带有LN的Transformer之间的具体比较:即使有幂迭代的逐层指导,没有LN的Transformer也无法精确学习幂方法,而带有LN的对应Transformer可以,导致主成分预测中可证明的性能差距。据我们所知,我们的结果首次对带有LN的循环和单层Transformer的训练动态进行了理论分析,并阐明了LN在Transformer模型中的作用。

英文摘要

Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.

2606.00602 2026-06-02 cs.CV

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

ASAP: 基于解剖感知语义自适应预训练的医学体素表示学习

Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang, Haoran Lai, Wenxin Ma, Wei Wei, Zhiyang He, Xiaodong Tao, Rui Yan, Qingsong Yao, Shaohua Kevin Zhou

发表机构 * School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China(生物医学工程学院,生命科学与医学系,中国科学技术大学) Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research(医学影像、机器人、分析计算与学习(MIRACLE)实验室,YRD-RIGHT,中国科学技术大学苏州研究院) Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology(江苏省多模态数字孪生技术重点实验室) Biomedical Basic Research Center (BBRC) of Jiangsu Province(江苏省生物医学基础研究中心) Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC(放射科,中国科学技术大学第一附属医院,生命科学与医学系,中国科学技术大学) Anhui IFLYTEK CO., Ltd(安徽科大讯飞股份有限公司) School of Medicine, Stanford University(医学院,斯坦福大学) State Key Laboratory of Precision and Intelligent Chemistry, Hefei, Anhui, China(安徽省精密与智能化学重点实验室,合肥,安徽,中国)

AI总结 提出ASAP框架,通过解剖感知知识注入、语义自适应对齐与融合,从胸部CT扫描和放射学报告中学习可迁移且可解释的体素表示,在15个数据集和22个下游任务上取得最先进性能。

Comments MICCAI2025 extention

详情
AI中文摘要

从医学体素扫描中学习可迁移和可解释的表示仍然具有挑战性,因为存在复杂的解剖结构和放射学报告提供的弱、异质监督。在本文中,我们提出了解剖感知语义自适应预训练(ASAP),一个用于从大规模胸部CT扫描及其对应放射学报告中进行细粒度医学体素表示学习的原理性视觉-语言预训练框架。ASAP集成了三个关键组件:(1)解剖感知知识注入模块,通过现成的分割工具融入器官级结构先验,以促进解剖上一致的表示;(2)语义自适应选择性对齐机制,动态地将句子级别的发现与局部体素区域关联;(3)语义自适应融合模块,在双模态掩码建模范式下,实现解剖信息视觉特征与基于文本线索之间的有效交互。除了方法论贡献外,我们还为胸部CT上的医学体素视觉-语言预训练建立了一个全面的基准,涵盖15个数据集和22个下游任务,包括异常分类、分割、疾病预后预测、报告生成、词汇分类、跨模态检索和视觉问答。该基准提供了标准化的评估协议,以系统评估在不同临床设置和数据制度下的表示质量。大量实验表明,ASAP在跨任务和数据集上一致地实现了最先进的性能,在有限监督和分布偏移下尤其显著,验证了其在学习可迁移和临床有意义的体素表示方面的有效性。

英文摘要

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

2606.00596 2026-06-02 cs.CL

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

面向负责任且基于认识论的多语言大语言模型在计算社会科学与人文学科中的应用

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文重新概念化多语言推理大语言模型为解释学工具,提出一个基于理论框架来评估其在社会科学与人文学科研究中的文化对齐、跨语言稳定性和推理忠实性。

Journal ref Proceedings of LLMs4SSH Workshop at LREC 2026, Palma de Mallorca, Spain, 2026

详情
AI中文摘要

大语言模型在多语言能力和推理能力方面迅速发展,使其能够整合到社会科学与人文学科研究流程中。然而,现有的评估范式仍然基于任务型NLP基准,未能解决解释有效性、文化情境性和认识论中介问题。本文重新概念化多语言推理大语言模型为解释学工具,这些工具在不同语言和文化背景下积极构建意义生产。借鉴解释学、技术哲学、科学技术研究、多语言NLP研究和计算社会科学方法论,我们为评估社会科学与人文学科研究中的多语言推理开发了一个理论基础的框架。我们阐述了一个严格的实验协议,包含文化对齐、跨语言稳定性和推理忠实性的可操作化指标,以及针对解释性研究任务定制的透明度要求。我们通过一个涉及多语言政治话语分析的具体应用场景来说明该框架。本文为将多语言推理大语言模型负责任地整合到计算社会科学基础设施中提供了概念和方法论基础。

英文摘要

Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.

2606.00593 2026-06-02 cs.CL cs.AI

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) School of Software Technology, Zhejiang University(浙江大学软件技术学院)

AI总结 提出SPADER强化学习框架,通过逐步同伴优势(SPA)机制和多样性感知探索奖励,解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题,实验表明在多个数据集上提升了召回率和F1分数。

详情
AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体,以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理,但大多数方法专注于具有单一正确答案的任务。相比之下,许多现实世界中的查询需要发现一组全面的有效答案,这种设置被称为多答案问答。这种设置带来了两个挑战:长搜索轨迹上的细粒度信用分配,以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER,一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势(SPA),一种无评论家的逐步信用分配机制,它通过决策步骤对齐并行轨迹,并根据同伴回报估计优势。它还包括一个多样性感知探索奖励,通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明,SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

2606.00592 2026-06-02 cs.CV

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

通过PRISM:原则感知、可解释和多尺度的视觉设计评估

Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag

发表机构 * Ohio State University(俄亥俄州立大学) Adobe Research(Adobe研究院)

AI总结 提出PRISM基准和一种多尺度评估框架,通过原则扰动和分层分析实现可解释的设计质量评估。

Journal ref CVPR 2026 Findings

详情
AI中文摘要

有效的视觉传达源于多个设计原则的和谐,如可读性、对比度、对齐、重叠和连贯性,这些原则共同支配着传达者的清晰度和意图。虽然人类设计师会整体性地考虑这些原则,但机器智能体通常将它们压缩成一个单一的启发式分数,提供有限的可解释性和诊断精度。为了解决这一差距,我们引入了PRISM(原则感知、可解释和结构引导的设计修改),这是一个基准,它沿着可测量的设计原则系统地扰动Crello数据集中的专业布局。该基准包含10万个扰动训练样本和1万个扰动验证设计,每个样本隔离特定的原则违规,以进行关于设计质量的多模态推理的受控分析。我们表明,像Qwen-2.5-VL和GPT-4o-mini这样的模型对有针对性的原则退化在很大程度上不敏感,而GPT-4o表现出全局意识但缺乏细粒度的解耦。基于这些见解,我们提出了一个多尺度评估框架,该框架集成了用于定量评估的轻量级评分器、用于局部反馈的指令调优视觉语言模型以及用于全局推理的基于提示的方法。我们的框架提供了设计失败的可解释解释。利用这些局部见解,我们展示了改善布局质量的有针对性的改进。PRISM和我们的框架共同为可解释的、具有设计素养的多模态推理系统奠定了基础。

英文摘要

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Hanyang University(翰阳大学)

AI总结 提出VRPO方法,通过强化学习将静态对齐损失替换为生成式表示策略优化目标,动态平衡表示一致性与生成质量,在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情
AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力,但由于生成表示与判别表示之间的弱对齐,训练效率仍然较低。虽然表示对齐框架(如REPA)通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛,但其外部监督的对齐损失是静态的,在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标,无法动态平衡表示一致性和生成质量,导致判别收益有限,且无法以任务自适应方式优化对齐。为了解决这个问题,我们提出了VRPO,一种基于强化学习的优化策略,用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束,而是将表示对齐视为一个奖励引导的过程:模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示,朝向有语义意义的方向,同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中,引入可忽略的计算成本,并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明,我们的VRPO-Alignment显著提高了收敛速度和保真度,在相同计算预算下,与REPA相比,FID提升高达1.8,训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

2606.00582 2026-06-02 cs.AI

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

PropLLM:面向网络故障诊断的传播感知场景重建

Zongzong Wu, Ming Zhao, Fengxiao Tang, Nei Kato

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会) High Performance Computing Center of Central South University(中南大学高性能计算中心)

AI总结 提出PropLLM,首次将逐跳场景重建范式与LLM生成推理能力结合,通过双知识图谱和时序因果传播注意力机制,从端点告警回溯定位根因并确定故障类型,在真实Wi-Fi多模态故障数据集上诊断准确率提升3.9%,根因定位准确率提升4.7%,幻觉率降低50.8%。

详情
AI中文摘要

网络故障沿着拓扑和协议依赖关系逐层传播,然而运维系统通常只观察到传播链末端的症状告警,此时不同的根因故障可能产生高度相似的端点症状。现有方法(无论是基于规则、机器学习还是大语言模型)本质上都是将告警集一次性映射到诊断结果,在结构上无法解决这种端点歧义性。本文提出PropLLM,首次将逐跳场景重建范式与LLM的生成推理能力相结合。从端点告警出发,PropLLM沿着传播路径逐跳回溯,在每一跳从双层知识图谱中检索可验证的事实证据,同时提出的时序因果传播注意力机制将已知的拓扑因果先验直接编码到注意力计算中,引导模型沿着正确的因果方向前进,最终通过完全基于证据的因果链定位根因并确定故障类型。在真实Wi-Fi多模态故障数据集上,PropLLM的故障类型诊断准确率比最强基线提升3.9%,根因定位准确率提升4.7%,幻觉率降低50.8%。在TeleLogs 5G数据集上的补充实验进一步证明了所提方法在不同网络场景下的有效性。

英文摘要

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

2606.00579 2026-06-02 cs.CL cs.CV

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

沙盒化编码智能体是竞争性的全模态任务求解器

Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) MBZUAI

AI总结 本文提出沙盒化编码智能体,仅通过文本+图像访问和工具使用,即可在全模态任务中匹配甚至超越原生全模态模型,并通过技能注入和训练配方Code-X进一步提升性能。

Comments Paper under review

详情
AI中文摘要

随着多模态大语言模型越来越多地针对视频和音频,人们通常认为这类任务需要原生全模态模型。我们表明情况并非总是如此:仅具有文本+图像访问权限和沙盒化工具使用接口的编码智能体,可以在多个音频-视频基准测试中匹配,并在某些设置中超越最先进的原生全模态模型和预定义的多模态智能体框架。我们的轨迹分析表明,它们的优势来自于编写代码和编排工具,以从转录、帧和其他模态信号中提取相关证据,从而将全模态任务转化为检索和信息处理问题,而不是摄取整个媒体流。我们进一步通过失败分类和过程级轨迹分析来刻画它们的局限性,并表明简单的技能注入(包括人工编写和自蒸馏的技能)能显著提高性能。为了探索开源激发,我们引入了Code-X,一种包含OmniCoding轨迹数据集和可验证奖励的训练方案,并在Qwen-3.5-9B和Qwen-3.6-27B上提供了基线。最后,我们认为下一个前沿是多模态处理,并引入了TerminalBench-O,一个用于现实世界全模态处理任务的过程级基准。代码将在https://github.com/Dongping-Chen/OmniCoding提供。

英文摘要

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

2606.00576 2026-06-02 cs.RO

Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation

面向移动操作的动态弹性时空语义记忆与混合定位

Zhijie Yan, Shufei Li, Ze Zhang, Xin Liu, Yuhang Zheng, Zuoxu Wang

发表机构 * School of Mechanical Engineering and Automation, Beihang University(北京航空航天大学机械工程及自动化学院) Department of Systems Engineering, City University of Hong Kong(香港城市大学系统工程系) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出DREAM框架,通过在线构建时空语义体素记忆、冗余感知记忆剪枝和混合定位,实现无预建地图的动态室内移动操作,将长时任务成功率提升至55%-70%。

Comments Code, CAD model, and real-robot demonstrations are available at https://bjhyzj.github.io/dream-web

详情
AI中文摘要

动态室内环境中的可靠移动操作需要一种场景表示,该表示在环境变化时保持几何一致性、可语义查询且计算量可控。现有系统通常依赖预建地图、静态场景假设或高精度相机位姿,当目标物体被重新放置或位姿估计被修正时,可能导致场景信息过时或错位。本文提出DREAM,一个真实机器人移动操作框架,它集成感知、记忆、定位、导航和操作,在无预建地图的未知室内环境中运行。DREAM通过由LiDAR-惯性-视觉SLAM后端注册的RGB-D观测构建在线时空语义体素记忆。它进一步引入位姿图感知的冗余感知记忆剪枝(RMP),在位姿修正后更新历史观测,同时保持长时观测历史有界。对于目标定位和重新获取,DREAM结合语言条件3D检索、开放词汇图像检测和基于多模态大语言模型的语义验证。在四个动态室内实验室场景中的真实机器人实验表明,DREAM将长时任务成功率从DynaMem的40%-60%提升至55%-70%,同时在各场景中保持0.37-0.63 GB的内存占用和0.43-0.53秒的在线记忆更新时间。

英文摘要

Reliable mobile manipulation in dynamic indoor environments requires a scene representation that remains geometrically consistent, semantically queryable, and computationally bounded as the environment changes. Existing systems often rely on pre-built maps, static-scene assumptions, or highly accurate camera poses, which can lead to stale or misaligned scene information when target objects are relocated or pose estimates are corrected. This paper presents DREAM, a real-robot mobile manipulation framework that integrates perception, memory, localization, navigation, and manipulation in previously unseen indoor environments without a pre-built map. DREAM constructs an online spatio-semantic voxel memory from RGB-D observations registered by a LiDAR-inertial-visual SLAM backend. It further introduces pose-graph-aware Redundancy-Aware Memory Pruning (RMP) to update historical observations after pose corrections while keeping long-horizon observation history bounded. For target localization and reacquisition, DREAM combines language-conditioned 3D retrieval, open-vocabulary image detection, and multimodal large language model based semantic verification. Real-robot experiments in four dynamic indoor laboratory scenes show that DREAM improves long-horizon task success rates from 40%-60% with DynaMem to 55%-70%, while maintaining a memory footprint of 0.37-0.63 GB and an online memory-update time of 0.43-0.53 s across scenes.

2606.00573 2026-06-02 cs.LG

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

LASER: 面向高效低精度视觉-语言模型的损失感知奇异值分解与秩分配

Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang

发表机构 * Tandon School of Engineering, New York University(纽约大学工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学数学科学学院) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 提出LASER框架,通过损失感知的奇异值分解和跨层秩分配,结合混合量化方案,实现视觉-语言模型在低精度推理下的高效压缩与加速。

详情
AI中文摘要

视觉-语言模型(VLM)具有强大的多模态推理能力,但其巨大的计算开销和高参数数量使得在资源受限设备上部署面临挑战。低秩分解已成为一种有前景的压缩技术,然而现有方法通常优化局部矩阵重建误差,依赖均匀或启发式的秩分配,并且主要关注注意力投影,而前馈网络尚未得到充分探索。在本文中,我们提出 extit{LASER}( extbf{L}oss- extbf{A}ware extbf{S}ingular-value d extbf{E}composition and extbf{R}ank allocation),一种面向高效低精度VLM推理的低秩压缩框架。LASER从模型损失的二阶近似推导出曲率加权的SVD目标,并使用Kronecker分解的Fisher信息来引导分解朝向下游性能而非单纯的重建。我们进一步引入基于校准梯度的损失感知跨层秩分配策略,使得跨层的参数预算分配更加有效。最后,我们通过一种结合SVD与量化的混合方案,将低秩压缩扩展到FFN层。评估结果表明,LASER在低精度推理下相比先前工作实现了超过2.3倍的解码加速,同时保持了强大的准确性。

英文摘要

Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textit{LASER} (\textbf{L}oss-\textbf{A}ware \textbf{S}ingular-value d\textbf{E}composition and \textbf{R}ank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than $2.3\times$ decoding speedup over previous work while preserving strong accuracy under low-precision inference.

2606.00572 2026-06-02 cs.LG

Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction

时空多任务图Transformer用于行程级公交预测

Oluwaleke Yusuf, Adil Rasheed, Frank Lindseth

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU)(工程 cybernetics 部,挪威科学与技术大学(NTNU)) Department of Computer Science, Norwegian University of Science and Technology (NTNU)(计算机科学部,挪威科学与技术大学(NTNU))

AI总结 提出SMT-GraphFormer,一种将行程级公交预测建模为序列到序列问题的时空多任务图Transformer,通过图嵌入、上下文编码器和多门专家混合模块,在挪威特隆赫姆公交数据上优于停靠级基线方法。

Comments 25 pages, 7 figures, 11 tables, including appendix. Code available at https://github.com/Outsiders17711/SMTGraphFormer

详情
AI中文摘要

来自公共交通系统的乘客计数数据揭示了城市出行模式,对于规划、运营和优化至关重要。然而,站点和线路之间的非线性时空相互依赖性使得建模和预测具有挑战性。现有方法通常依赖于固定的时间、空间或站点级公式,限制了它们捕捉行程内演变和网络上下文的能力。本研究提出了SMT-GraphFormer,一种时空多任务图Transformer,将行程级公交预测构建为序列到序列建模。给定一条线路的站点序列和行程级上下文,模型预测连续的上下车人数,并将延误和停靠时间作为编码器侧的辅助任务。关键组件包括用于多关系站点相似性的图嵌入、用于天气和时间信息的上下文编码器,以及一个多门专家混合模块,该模块为上下车预测生成任务特定的解码器表示。对挪威特隆赫姆的公共公交数据进行评估表明,SMT-GraphFormer优于站点级表格基线,消融研究考察了每个组件的贡献。序列化公式在下车预测上取得了显著提升(R²提高+0.24),并在上车、延误和停靠时间上持续改进,证实了显式行程级序列偏差和目标间依赖性的价值。这些发现展示了基于Transformer的序列建模在捕捉公共交通复杂时空动态方面的潜力,并强调了针对公交数据定制的架构相对于现成表格模型的价值。所提出的框架为数字孪生环境中的场景分析提供了与预测范围无关的基础,支持规划者和公交运营商的知情决策。

英文摘要

Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non-linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop-level formulations, limiting their ability to capture within-trip evolution and network context. This study proposes SMT-GraphFormer, a spatiotemporal multi-task graph transformer that frames trip-level transit prediction as sequence-to-sequence modelling. Given a line's stop sequence and trip-level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder-side surrogate tasks. Key components include graph embeddings for multi-relational stop similarity, a context encoder for weather and temporal information, and a multi-gate mixture-of-experts module that produces task-specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT-GraphFormer outperforms stop-level tabular benchmarks, with ablation studies examining each component's contribution. The sequential formulation yields substantial gains on alighting prediction ($+$0.24 in $R^2$) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip-level sequential bias and inter-target dependencies. These findings demonstrate the potential of transformer-based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off-the-shelf tabular models. The proposed framework provides a horizon-agnostic basis for scenario analysis in digital twin environments, supporting informed decision-making by planners and transit operators.