arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2605.29531 2026-05-29 cs.SD cs.CV cs.LG

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出CAFNet模型,通过三元分类和边界回归联合检测部分伪造音频,在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

音频深度伪造检测通常作为二分类问题研究,但部分篡改语音(其中一段短合成片段被拼接进真实语音)构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音,还需要定位篡改发生的位置。我们提出了CAFNet,一个576k参数的架构,联合处理这两个任务:它在单次前向传播中执行三元分类(真实、完全伪造或半真)并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)和色度短时傅里叶变换(Chroma-STFT)特征,随后使用双向长短期记忆(BiLSTM)回归头进行边界预测。在组合的多语言音频深度伪造检测语料库(MLADDC)T2+T3测试集上,CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积(AUC),边界定位平均绝对误差(MAE)为0.075秒,中位误差为0.052秒。在二分类检测中,它达到96.76%的准确率和3.20%的等错误率(EER),以超过500倍的参数减少优于微调的XLS-R 300M(78.31%)和AST 87M(93.03%)。跨数据集研究进一步表明,即使在降低骨干学习率的情况下,标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

2605.29525 2026-05-29 cs.LG

Learning to Perturb Hidden Representations for Generalizable Deep Learning

学习扰动隐藏表示以实现可泛化深度学习

Hua Li

发表机构 * Henan University(河南大学)

AI总结 提出学习扰动激活(LPA)方法,通过自适应地扰动隐藏层激活并利用PGD学习类别级扰动,提升模型泛化能力,在平衡分类、长尾分类和域泛化任务上优于现有方法。

详情
AI中文摘要

深度神经网络通过级联表示处理数据:输入特征、隐藏激活、logits和损失。虽然输入、logit和标签层面的扰动已被系统研究,但构成网络大部分计算的中间隐藏激活尚未得到统一的扰动分析。本文建立了隐藏激活扰动的统一框架,揭示了Dropout、Manifold Mixup、对抗特征扰动及相关方法都施加了特定形式的激活扰动,但采用类别无关或随机策略。我们推测扩张性扰动(增加激活范数)起到正增强作用,而收缩性扰动(减少激活范数)起到负增强作用,并且扰动层决定了效果类似于输入级增强(浅层)还是logit级操作(深层)。我们提出学习扰动激活(LPA),该方法在选定的隐藏层自适应地扰动激活,并通过PGD学习类别级扰动。我们进一步提供了将激活扰动与平坦最小值和通过层的扰动放大联系起来的理论分析。在平衡分类、长尾分类和域泛化上的实验表明,LPA一致优于现有方法,并为logit扰动方法(如LPL)提供互补优势。

英文摘要

Deep neural networks process data through a cascade of representations: input features, hidden activations, logits, and loss. While perturbations at the input, logit, and label levels have been systematically studied, the intermediate hidden activations, which constitute the bulk of the network's computation, have received no unified perturbation analysis. In this paper, we establish a unified framework for hidden activation perturbation, revealing that Dropout, Manifold Mixup, adversarial feature perturbation, and related methods all impose specific forms of activation perturbation but with class-agnostic or random strategies. We conjecture that expansive perturbation (increasing activation norm) acts as positive augmentation, while contractive perturbation (decreasing activation norm) acts as negative augmentation, and that the perturbation layer determines whether the effect resembles input-level augmentation (shallow layers) or logit-level manipulation (deep layers). We propose Learning to Perturb Activations (LPA), which adaptively perturbs activations at a selected hidden layer with class-level perturbations learned via PGD. We further provide theoretical analysis connecting activation perturbation to flat minima and perturbation amplification through layers. Experiments on balanced classification, long-tail classification, and domain generalization demonstrate that LPA consistently outperforms existing methods and provides complementary benefits to logit perturbation methods such as LPL.

2605.29523 2026-05-29 cs.LG

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

K-FinHallu:面向韩语金融多轮RAG的幻觉检测基准

Eunbyeol Cho, Yunseung Lee, Mirae Kim, Jeewon Yang, Youngjun Kwak, Edward Choi

发表机构 * KAIST AI(KAIST人工智能实验室) Financial Tech Lab(金融科技实验室) KakaoBank Corp(Kakao银行公司)

AI总结 提出K-FinHallu基准,通过构建多轮对话和层次化幻觉分类,评估LLM在韩语金融RAG中的幻觉检测能力,发现即使最强模型在细粒度金融诊断和合理弃权上表现不佳。

详情
AI中文摘要

大型语言模型(LLMs)通过检索增强生成(RAG)推动了金融自动化,但幻觉仍然是高风险环境中部署的关键障碍。现有基准侧重于单轮、以英语为中心的任务,未解决韩语金融领域的多轮动态和语言-监管细微差别。我们引入K-FinHallu,这是首个用于多轮韩语金融RAG中幻觉检测的基准。我们从真实的韩语金融文档中构建多轮对话,并在基于上下文可回答性(明确考虑合理弃权)的层次化分类下注入幻觉。将前沿和开源LLMs作为幻觉检测器进行基准测试,我们发现即使最强的模型也难以进行细粒度的金融诊断和拒绝行为。虽然在我们的训练集上微调8B模型可获得与前沿LLMs竞争的性能,但合理弃权仍然是所有评估模型中最薄弱的方面。

英文摘要

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

2605.29522 2026-05-29 cs.AI

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

DeepSurvey: 提升自动综述生成中的分析深度与引用可靠性

Ziyue Yang, Da Ma, Hanqi Li, Zijian Wang, Tiancheng Huang, Zijian Hu, Chenrun Wang, Yunzhe Zhang, Xiaobao Wu, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院X-LANCE实验室) Jiangsu Key Lab of Language Computing, Suzhou, China(江苏省语言计算重点实验室) Suzhou Laboratory, Suzhou, China(苏州实验室)

AI总结 提出DeepSurvey智能体系统,通过结构化全文笔记、跨论文关系建模和代码仓库分析增强分析深度,结合引文图扩展与混合过滤、证据约束引用分配及多粒度智能体精炼提升引用可靠性,在内容质量和引用准确性上超越现有方法。

详情
AI中文摘要

随着科学文献的快速增长,自动综述生成已成为AI科学家和人类研究者的关键能力。然而,现有系统由于依赖摘要和孤立论文处理而分析深度有限,并且由于不精确的检索和事后归因而导致引用不可靠,从而产生肤浅的综述并可能误导研究者。我们提出DeepSurvey,一个解决这两个问题的智能体系统。为了增强深度,DeepSurvey从全文论文中提取结构化要点,通过聚类和比较分析建模跨论文关系,并集成代码仓库分析以恢复实现级细节。为了加强可靠性,它结合引文图扩展与混合过滤进行主题聚焦检索,强制执行证据约束的引用分配,并部署多粒度智能体精炼以验证引用-声明对齐。实验表明,DeepSurvey在内容得分(8.644/10)和引用质量(召回率和精确率分别比最强基线提高12.3%和9.3%)上达到最高,跨领域泛化更稳健(CS到非CS的下降为0.14 vs 0.22至0.69),并且领域专家更倾向于选择它而非人类撰写的综述(整体质量83.3%,内容深度100%)。

英文摘要

As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).

2605.29512 2026-05-29 cs.AI

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

MINDGAMES: 多智能体LLM中社会与策略推理评估的实时竞技场

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

发表机构 * NeurIPS 2025 Competition(NeurIPS 2025 会议)

AI总结 提出MINDGAMES多游戏竞技平台,通过四个游戏环境评估LLM智能体的社会推理与策略能力,揭示规则遵循瓶颈与排行榜有效性差异。

详情
AI中文摘要

大型语言模型(LLM)正越来越多地被部署为交互式智能体,但它们在长时间交互中的社会与策略推理能力仍知之甚少。现有评估依赖于静态场景或单一游戏基准,无法捕捉现实多智能体环境所需的持续、多面推理。我们引入MINDGAMES,一个多游戏竞技场和LLM智能体评估平台,它操作化了与“心智理论”相关的互补推理需求:隐藏信息下的信念归因、通过重复策略交互进行对手建模、知识不对称下的合作推理,以及社会推理中的持续欺骗。基于TextArena,MINDGAMES提供了统一的交互界面、基于TrueSkill的评分和四个游戏环境的完整轨迹记录。我们通过2025年在一场主要AI会议上举办的竞赛周期实例化MINDGAMES,评估了来自76个团队的944个提交智能体,涉及四个游戏:Colonel Blotto、迭代囚徒困境、Codenames和Secret Mafia。我们的分析揭示了智能体层面和评估层面的局限性:脆弱的规则遵循仍是主要瓶颈,顶级系统反复依赖显式结构支撑,且排行榜有效性在不同环境中差异显著。特别是,失败密集的环境可能同样奖励对对手错误的鲁棒性和策略能力,其中Secret Mafia在本周期中表现出明显的错误生存混杂。我们发布了一个包含29,571场多智能体游戏的数据集,包含回合级观察、动作和奖励,以及MG-Ref,一个确定性离线锦标赛协议,该协议使用与本分析相同的错误归因视角,将新智能体与冻结的顶级、低错误Stage II提交参考池进行评分。

英文摘要

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

2605.29507 2026-05-29 cs.AI cs.IR

Xetrieval: Mechanistically Explaining Dense Retrieval

Xetrieval:机械解释稠密检索

Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia, Zilong Zheng, Wenge Rong

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出Xetrieval框架,通过嵌入级别的推理内化器和稀疏可解释特征分解,机械地解释稠密检索模型为何赋予高相关性分数。

Comments Code: https://github.com/Hihiczx/Xetrieval ; Project page: https://hihiczx.github.io/Xetrieval

详情
AI中文摘要

解释稠密检索器为何赋予高相关性分数仍然具有挑战性,因为检索决策是通过不透明的高维嵌入做出的。现有的解释通常关注表面信号,如词汇匹配、令牌对齐或事后文本理由,因此对塑造稠密检索行为在嵌入级别的潜在因素提供的洞察有限。我们提出 extit{Xetrieval},一个嵌入级别的机械框架,用于解释稠密检索。 extit{Xetrieval}首先引入一个轻量级推理内化器,通过单次前向传递直接在嵌入空间中近似思维链推理,丰富句子嵌入的推理导向信息,同时避免昂贵的自回归生成。然后,它将这些推理增强的嵌入分解为稀疏、人类可解释的特征,每个特征与连贯的自然语言描述相关联。通过聚合多个文档端视图上的稀疏特征重叠, extit{Xetrieval}提供单个检索决策的特征级解释。在多种检索器和基准上的实验表明, extit{Xetrieval}揭示了连贯的可解释特征,产生更强的成对干预效果,并支持任务级特征引导。项目页面和源代码可在https://hihiczx.github.io/Xetrieval获取。

英文摘要

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .

2605.29505 2026-05-29 cs.CV

ESAM++: Efficient Online 3D Perception on the Edge

ESAM++:边缘上的高效在线3D感知

Qin Liu, Lavisha Aggarwal, Saptarashmi Bandyopadhyay, Vikas Bahirwani, Marc Niethammer, Ehsan Adeli, Andrea Colaco

发表机构 * Stanford University(斯坦福大学) Google(谷歌) UC San Diego(圣地亚哥大学)

AI总结 提出ESAM++,一种轻量级可扩展的在线3D场景感知方法,通过3D稀疏特征金字塔网络(SFPN)在边缘设备上实现高效、准确的3D实例分割。

详情
AI中文摘要

实时在线3D场景感知对于机器人、AR/VR和自主系统至关重要,尤其是在计算资源有限且隐私至关重要的边缘计算场景中。最近的最先进方法如EmbodiedSAM(ESAM)通过利用Segment Anything Model(SAM)进行实时、细粒度且泛化的3D实例分割,展示了在线3D感知的前景。然而,ESAM仍然依赖计算昂贵的3D稀疏UNet进行点云特征提取,这占据了3D推理时间的大部分,阻碍了其在资源受限设备上的实用性。在本文中,我们提出ESAM++,一种轻量级且可扩展的在线3D场景感知替代方案,专为无GPU加速的边缘设备设计。我们的方法引入了3D稀疏特征金字塔网络(SFPN),该网络高效地从流式3D点云中捕获多尺度几何特征,同时显著降低计算开销和模型大小。我们在四个具有挑战性的分割基准(即ScanNet、ScanNet200、SceneNN和3RScan)上评估了我们的方法,结果表明,与ESAM相比,我们的模型在实现竞争性精度的同时,推理速度提升高达3倍,模型大小缩小2倍,从而能够在边缘设备上实际部署。

英文摘要

Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.

2605.29502 2026-05-29 cs.CL cs.AI

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源语言锚定的语义强化学习用于低资源目标语言生成

Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China(中国民族大学) Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) South China University of Technology(华南理工大学)

AI总结 提出源语言锚定的语义强化学习(SG-SRL),通过跨语言语义奖励模型利用源语言单语数据,结合轻量级恢复阶段解决奖励黑客问题,在低资源目标语言生成中提升语义锚定和事实覆盖。

详情
AI中文摘要

低资源目标语言生成通常受限于稀缺的平行数据,而高资源源语言单语数据丰富但难以通过标准监督微调使用。我们提出源语言锚定的语义强化学习(SG-SRL),一种资源利用框架,将源语言单语数据转换为用于目标语言生成的跨语言语义监督。SG-SRL使用跨语言语义奖励模型(由跨语言重排序器实例化,对源输入与目标语言生成之间的语义相关性进行评分)在源语言数据上执行无参考强化学习(RL)。虽然这会导致严重的基于冗长的奖励黑客问题,但使用小型平行语料库的轻量级恢复阶段在保留语义增益的同时恢复了流畅性、简洁性和任务格式。在中文到泰语生成上的实验表明,SG-SRL在冷启动SFT基础上改善了语义锚定和事实覆盖。对长文本迁移和基于藏语嵌入奖励的额外分析阐明了SG-SRL的泛化行为,并表明在现实低资源语言设置中,基于编码器的语义奖励可以替代基于LLM的重排序器。

英文摘要

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.

2605.29500 2026-05-29 cs.LG cs.AI

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

离线策略评估的商DAG:前向流重要性采样与精确板倾向

Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) National University of Singapore(新加坡国立大学)

AI总结 提出商DAG视角,通过前向流比率合并等价历史,实现精确的无序板倾向计算,减少方差并提高计算效率。

Comments 31 pages, 3 figures, 7 tables

详情
AI中文摘要

离线策略评估利用不同行为策略收集的数据来估计目标策略的表现,这在在线测试成本高或风险大时(如推荐或医疗)至关重要。标准重要性采样对每条记录轨迹进行重加权,但即使评估目标忽略生成过程的某些细节,它仍可能将其视为有意义:例如,自回归板推荐器可能生成有序的项目序列,而奖励和下游估计器仅依赖于无序板。这产生了噪声方差和计算差距,因为精确的无序板倾向需要对所有生成顺序求和。我们引入商DAG视角,合并对评估等价的历史,并在合并图上使用目标与行为的前向流比率分配权重。对于在集合充分的下一个项目接口下的板推荐,这产生了Forward-DP,一种子集DAG动态规划,无需阶乘枚举即可计算精确的无序倾向。得到的倾向基元使得能够对上下文相关的自回归板记录器进行实用的基于倾向的评估和模型选择。

英文摘要

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.

2605.29498 2026-05-29 cs.CL cs.CV

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Mask the Target: 一种即插即用的正则化器,用于对抗LoRA遗忘

Runze Xu, Arpit Garg, Hemanth Saratchandran, Simon Lucey

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所)

AI总结 针对LoRA微调中目标分布与原始训练分布差异大时导致的灾难性遗忘问题,提出一种无需重放数据的输出空间正则化方法,通过遮蔽目标token并仅对非目标词汇进行KL正则化,在不增加推理开销的前提下改善新学习与遗忘之间的平衡。

Comments In Submission

详情
AI中文摘要

低秩适应(LoRA)已成为将大型语言模型适应新领域、任务和用户的最广泛使用的微调机制之一。然而,仅凭适应性能可能掩盖一个重要失败模式:LoRA更新可能在提升目标分布性能的同时,削弱预训练和对齐阶段学习到的先前能力。我们表明,当适应分布与模型的原始训练或对齐分布存在显著差异时,这种遗忘变得尤为严重。在实际场景中,原始训练和对齐数据通常不可用,这加剧了挑战。受此约束,我们研究了基于LoRA的适应如何在无重放设置中平衡新学习与遗忘,并引入了一个简单的输出空间正则化器,可直接添加到现有训练流程中。我们的方法从基模型和适应模型分布中移除真实标记,重新归一化剩余概率,并仅对非目标词汇应用KL正则化。这保留了基模型在替代标记之间的相对偏好,同时不直接对抗适应所需的交叉熵信号。由于正则化器仅在损失层面起作用,它不需要重放数据、架构更改、适配器重新设计或推理时开销,并且可以直接应用于现有LoRA变体。在所有测试的LoRA变体和各种骨干网络上,当适应分布与基模型的原始训练或对齐分布存在显著差异时,我们的方法改善了新学习与遗忘之间的边界,表明这是一条通往更可靠LLM更新的广泛适用途径。

英文摘要

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

2605.29497 2026-05-29 cs.LG

Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption

单索引模型损失景观中的凸盆地:强对抗性腐败下的鲁棒恢复应用

Santanu Das, Sagnik Chatterjee, Jatin Batra

发表机构 * School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India(技术与计算机科学学院,印度塔塔基础研究机构,孟买)

AI总结 针对具有重尾噪声和对抗性腐败的单索引模型,提出首个基于凸盆地结构的鲁棒恢复算法,实现近线性样本和时间复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究了在存在重尾噪声和恒定比例的对抗性腐败协变量及响应的情况下,鲁棒学习高斯单索引模型(SIMs)的问题。先前关于鲁棒恢复的工作考虑了线性回归(Pensia等人,JASA 2024)、严格单调链接函数(Awasthi等人,NeurIPS 2022)和相位恢复(Buna和Rebeschini,AISTATS 2025)等设置。然而,这些技术不能推广到通用的非对称非单调链接函数,例如现代门控神经架构中自然出现的标量原语 extsc{GeLU}和 extsc{Swish}。我们通过给出第一个针对通用非单调链接函数的具有近线性样本和时间复杂度的鲁棒恢复算法来填补这一空白,从而为一大类非线性SIMs建立了首个鲁棒恢复保证,而此前对这些SIMs没有任何已知保证。我们的核心贡献是对对抗性污染下高斯平方损失景观的新结构理解。关键的是,我们证明对于一大类非线性非单调SIMs,在真实参数周围存在一个维度无关、恒定半径的凸盆地,并且即使在对抗性污染下,也可以通过鲁棒谱初始化高效地到达该盆地。先前的工作无法同时建立这两个保证,因此要么在对抗性污染下崩溃,要么无法处理通用的非单调链接函数。这些结构洞察共同为鲁棒梯度下降提供了一个原则性的热启动,该算法在$ ilde{O}(nd)$时间和$ ilde{O}(d)$样本下可证明收敛到最终估计误差$O(\sigma\sqrt{\varepsilon})$,其中$\varepsilon$是污染比例。

英文摘要

We study the problem of robustly learning Gaussian Single Index Models (SIMs) in the presence of heavy-tailed noise and a constant fraction of adversarially corrupted covariates and responses. Prior work on robust recovery has considered settings such as linear regression (Pensia et al., JASA 2024), strictly monotonic link functions (Awasthi et al., NeurIPS 2022), and phase retrieval (Buna and Rebeschini, AISTATS 2025). However, these techniques do not extend to generic asymmetric non-monotonic link functions such as \textsc{GeLU} and \textsc{Swish}, which arise naturally as scalar primitives in modern gated neural architectures. We close this gap by giving the first robust recovery algorithm with near-linear sample and time complexity for generic non-monotonic link functions, thereby establishing the first robust recovery guarantees for a broad family of nonlinear SIMs for which \textit{no guarantees were previously known}. Our central contribution is a new structural understanding of the Gaussian squared-loss landscape under adversarial contamination. Crucially, we prove that for a broad class of nonlinear non-monotonic SIMs, a dimension-independent, constant-radius convex basin exists around the ground truth and is efficiently reachable via robust spectral initialization even under adversarial contamination. Prior works fail to establish both guarantees simultaneously, thereby either breaking down under adversarial contamination or failing to handle generic non-monotonic link functions. Together, these structural insights yield a principled warm start for robust gradient descent that provably converges to a final estimation error of $O(σ\sqrtε)$ in $\tilde{O}(nd)$ time with $\tilde{O}(d)$ samples, where $ε$ is the contamination fraction.

2605.29496 2026-05-29 cs.CL cs.CV

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

视觉语言模型后训练中推理与感知的非对称优化研究

Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 通过合成任务诊断发现,后训练中推理提升显著优于感知,SFT源于感知token少导致训练信号弱,RL源于奖励耦合,提出动态重加权损失和感知奖励可缓解不平衡并提升端到端性能。

Comments Project: https://asymmetric-vlm-post-training.github.io/

详情
AI中文摘要

后训练极大地提升了前沿视觉语言模型中的推理能力,但其对感知的提升相对有限,这成为端到端视觉推理的瓶颈。为探究这一差距,我们引入了一个受控的诊断框架,包含两个将感知与推理分离的合成任务。我们的分析揭示了一致的感知-推理非对称性:后训练对推理的提升显著大于感知,尽管其内在机制因训练范式而异。对于监督微调(SFT),这种非对称性源于思维链监督中的token不平衡,其中感知占据较少token,因此接收到的训练信号较弱。动态重加权损失可缓解这种不平衡,并将端到端性能提升高达18.2。对于强化学习(RL),非对称性则源于奖励耦合:结果奖励与推理的相关性比与感知更强,从而削弱了感知学习的信号。添加感知感知奖励可缓解不平衡,并将端到端准确率提升高达6.0;即使没有真实感知奖励,可靠的替代奖励也能提供有用信号,带来3.2个百分点的提升。综合来看,我们的结果全面诊断了非对称优化,并提出了平衡感知与推理的具体干预措施。

英文摘要

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

2605.29495 2026-05-29 cs.LG

On-Policy Replay for Continual Supervised Fine-Tuning

面向持续监督微调的在策略重放

Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团)

AI总结 提出在策略重放(OPR)方法,通过重放模型自身生成的高质量响应来缓解持续监督微调中的灾难性遗忘,在多个大语言模型上显著降低遗忘。

详情
AI中文摘要

持续监督微调(SFT)是将大型语言模型(LLMs)适配到连续下游任务的事实标准,但它会遭受早期能力的灾难性遗忘。最近的研究表明,在策略信号——在模型自身输出上训练——比离策略监督更可靠地减少遗忘。现有的在策略方法通过新的训练目标(例如,带有教师副本的自蒸馏损失)路由该信号,从而继承了额外的前向传播、调度敏感性和来自教师的风格漂移。我们改为通过训练数据源路由在策略信号。我们的方法,在策略重放(OPR),在少量历史提示上展开最新检查点,通过任务奖励过滤生成结果,并将幸存(提示,模型响应)对作为普通SFT示例重放。没有教师,没有辅助损失,也没有即时蒸馏。在三个7-8B指令微调骨干(Qwen2.5-7B-Instruct、Qwen3-8B、Llama3.1-8B-Instruct)上,在TRACE持续学习基准测试中,OPR一致地减少了遗忘;在最尖锐的压力测试(Qwen2.5-7B-Instruct,顺序SFT BWT -13.93)中,OPR在10%重放预算下将BWT提升至-0.65,在1%预算下提升至-2.29——与调优的普通重放基线相比,|BWT|减少了46%,在所有三个骨干上观察到42-46%的减少。我们给出了一个KL收缩解释,将OPR和先前的在策略蒸馏方法置于单一轴上,并提出了一个反直觉的发现,解释了为什么普通重放已经是一个强基线:低分重放一致地比普通重放更差,表明OPR中的有效成分是在策略分布,而不是单独的响应质量。我们的代码可在https://github.com/Yancey2024/OnPolicyReplay获取。

英文摘要

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

2605.29494 2026-05-29 cs.LG

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

梯度扰动:学习扰动梯度以实现自适应训练

Hua Li

发表机构 * Henan University(河南大学)

AI总结 本文提出学习扰动梯度(LPG)方法,通过自适应地扰动类别级别的梯度实现类别感知训练,并建立统一框架揭示SAM、梯度裁剪等方法的梯度扰动本质,实验表明LPG在平衡/长尾分类和噪声标签学习中优于现有方法。

详情
AI中文摘要

深度神经网络训练涉及前向传播(从特征经logits到损失)和反向传播(从损失经梯度到参数更新)。尽管沿前向链的扰动(包括特征扰动、logit扰动和标签扰动)已被广泛研究,但反向链的梯度扰动却鲜有系统性的研究。在本文中,我们建立了一个统一的梯度扰动框架,揭示现有方法如锐度感知最小化(SAM)、梯度裁剪和梯度噪声注入都可以解释为施加特定形式的梯度扰动。类似于最近提出的Logit扰动学习(LPL),我们推测放大某一类别的梯度范数起到正增强作用(增强学习),而抑制它则起到负增强作用(抑制过拟合)。基于这些观察,我们提出学习扰动梯度(LPG),该方法自适应地在类别级别扰动logit梯度以实现类别感知训练。我们还通过PAC-Bayesian分析建立了梯度扰动边界与泛化保证之间的理论联系。在平衡分类、长尾分类和噪声标签学习上的实验表明,LPG一致优于现有方法,并且可以作为插件模块与它们结合使用。

英文摘要

Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, including feature perturbation, logit perturbation, and label perturbation, have been extensively studied, the backward chain's gradient perturbation has received little systematic investigation. In this paper, we establish a unified framework for gradient perturbation, revealing that existing methods such as Sharpness-Aware Minimization (SAM), gradient clipping, and gradient noise injection can all be interpreted as imposing specific forms of gradient perturbation. Analogous to the recently proposed Logit Perturbation Learning (LPL), we conjecture that amplifying the gradient norm for a class acts as positive augmentation (enhancing learning), while dampening it acts as negative augmentation (suppressing overfitting). Based on these observations, we propose Learning to Perturb Gradients (LPG), which adaptively perturbs logit-level gradients at the class level to achieve category-aware training. We also establish theoretical connections between gradient perturbation bounds and generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning demonstrate that LPG consistently outperforms existing methods and can be combined with them as a plug-in module.

2605.29491 2026-05-29 cs.AI

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

帮助的诅咒:通过 DistractionIF 在干扰指令鲁棒性中的逆缩放定律

Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China, Beijing, China(民族大学,北京,中国) Renmin University of China, Beijing, China(中国人民大学,北京,中国) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 提出 DistractionIF 基准,发现大语言模型在参考文本中干扰指令的鲁棒性存在逆缩放现象,并通过 GRPO 强化学习提升鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在智能体和检索增强生成(RAG)系统中,在这些系统中,它们必须对外部提供的参考文本执行用户指定的任务。实际上,这种上下文通常是非结构化的,并且包含良性的但类似指令的语义噪声,例如编辑评论和系统痕迹,这些应严格视为数据。我们引入了 DistractionIF,这是一个旨在评估对参考文本中此类干扰指令鲁棒性的基准。在广泛模型范围内,我们观察到一致的逆缩放现象:较大的模型通常鲁棒性较差,随着规模增加,性能下降多达 30 个百分点。从机制上讲,我们的困惑度分析表明,缩放侵蚀了鲁棒和受干扰行为之间的概率边界,使模型越来越倾向于将噪声过度解释为指令。为了解决这个问题,我们证明了强化学习,特别是群体相对策略优化(GRPO),可以恢复这一边界,在不损害通用指令遵循能力的情况下,将鲁棒性提高多达 15.5%。我们的发现突显了参考接地任务中关键的指令遵循鲁棒性差距,并确立了强化学习作为在大规模下强制严格数据-指令分离的有前途的途径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

2605.29489 2026-05-29 cs.LG cs.SY eess.SY

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

访问集至关重要:为可扩展的权重空间模型合并预算专家读取

Yuanyi Wang, Yanggan Gu, Su Lu, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University, PolyU(香港理工大学,PolyU) Hong Kong Polytechnic University Daya Bay Technology(香港理工大学达亚拜技术) Innovation Research Institute(创新研究院)

AI总结 针对大语言模型合并中专家权重读取的I/O瓶颈,提出MergePipe,一种预算感知的执行层,通过将合并问题转化为专家访问集问题,在显式I/O预算下选择要访问的专家增量块,实现高达11倍加速且参数偏差极小。

Comments ICML 2026 Workshop on Weight-Space Symmetries: from Foundations to Practical Applications

详情
AI中文摘要

权重空间模型合并通常被表述为检查点上的代数运算,然而在LLM规模下,限制性资源往往是必须读取的专家权重集。我们引入MergePipe,一种预算感知的执行层,将LLM合并转化为一个\emph{专家访问集}问题:给定一个合并算子和一个共享权重坐标系中的检查点族,在显式I/O预算下选择要访问的专家增量块。MergePipe索引参数块,构建确定性访问计划,并通过可重放清单执行诱导的预算合并。该计划在构造上是预算合理的,并在全预算下恢复全读取合并;对于固定系数加法算子,省略更新的误差由省略增量的范数界定。在Qwen和Llama合并工作负载上,MergePipe将专家读取I/O减少多达一个数量级,并实现高达$11 imes$的加速。代表性预算扫描显示,与全读取合并的参数偏差为$O(10^{-3})$,并且在下游基准测试上没有单调退化。

英文摘要

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

2605.29486 2026-05-29 cs.CL cs.AI cs.LG

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan(腾讯文英) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院( Gallagher 学院))

AI总结 提出PhoneWorld,一个可复用的管道,将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚,从而规模化构建手机代理环境。

Comments work in progress

详情
AI中文摘要

手机使用代理的一个核心瓶颈是,覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展,但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld,一个可复用的管道,将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准,而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中,它构建了由只读应用内容和可变状态支持的可运行模拟Android应用,然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中,PhoneWorld覆盖了16个领域的34个应用,涵盖了常见的消费者移动行为,如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下,将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督,同时提升了所有四个评估基准,使HYMobileBench提高了17.7分,AndroidControl提高了6.0分,AndroidWorld提高了14.7分,PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题:增加PhoneWorld监督量显著提高了PhoneWorld性能,并且在固定的PhoneWorld预算下,扩大应用覆盖范围带来了更大的收益。总体而言,PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

2605.29476 2026-05-29 cs.CL

Comparative Evaluation of Machine Translation Systems on Images with Text

含文本图像的机器翻译系统比较评估

Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

发表机构 * ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence(ValgrAI - 瓦伦西亚人工智能研究生学院和研究网络)

AI总结 本研究比较评估了三种机器翻译范式(模块化流水线、多模态大语言模型和端到端模型Translatotron-V)在含文本图像翻译任务上的性能,发现多模态大语言模型表现最佳。

详情
AI中文摘要

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估,该任务位于计算机视觉和自然语言处理的交叉领域。研究比较了三种主要范式:分离文本检测、识别和翻译的模块化流水线;能够联合处理图像和文本的多模态大语言模型(MLLM);以及直接生成翻译图像的端到端模型Translatotron-V。模块化系统采用最先进的OCR(docTR)结合多语言LLM(如Llama和EuroLLM),而评估的MLLM包括Gemini 2.5的不同配置。实验在覆盖多种语言对的并行多语言数据集上进行,基于BLEU、chrF和TER指标进行评估。结果表明,模块化流水线优于端到端方法,而MLLM实现了最佳整体性能,展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性,并为未来在多语言环境中整合视觉理解和语言生成的研究提供了坚实基础。

英文摘要

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

2605.29471 2026-05-29 cs.CV

V2XCrafter: Learning to Generate Driving Scene Across Agents

V2XCrafter:学习生成跨智能体的驾驶场景

Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City(香港JC智能城市STEM实验室) City University of Hong Kong(香港城市大学) Lingnan University(岭南大学)

AI总结 提出V2XCrafter框架,通过渐进式多智能体扩散模型和跨智能体注意力模块,生成跨智能体相机视角的一致可控协作驾驶场景,以增强数据并提升下游协作3D目标检测性能。

详情
AI中文摘要

协作驾驶系统利用车联网(V2X)通信进行多智能体协作感知,以提升驾驶安全性,但仍受限于标注的真实世界V2X驾驶数据集稀缺以及在多样化驾驶条件下的泛化能力有限。虽然图像生成技术为数据增强提供了可行的解决方案,但现有针对单车辆多视角场景的方法在多智能体驾驶设置中面临两个基本挑战:(1)学习目标的扩展降低了生成质量;(2)跨智能体的高度动态变化阻碍了对联合观测对象物理属性(如颜色、类别)一致性的建模。为弥补这一差距,我们提出V2XCrafter,这是首个用于跨智能体相机视角生成可控且逼真的协作驾驶场景的框架。为了实现有效学习,我们基于单智能体骨干网络开发了一种渐进式多智能体扩散模型,利用相邻智能体的潜在状态作为参考信号,逐步引导从单智能体到多智能体的扩散过程。为解决跨车辆不一致性问题,我们提出了一个跨智能体注意力模块,该模块利用协作视图图和可学习的联合观测对象表示来建模动态的跨智能体相机视角关系。实验表明,V2XCrafter能够生成高保真且可控的街道视图,并保持跨智能体的一致性,从而有效提升下游协作3D目标检测任务的效果。

英文摘要

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

2605.29467 2026-05-29 cs.LG cs.AI

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

发表机构 * Eindhoven University of Technology, the Netherlands(埃因霍温理工大学,荷兰) Lviv Polytechnic National University, Lviv, Ukraine(利沃夫国立理工大学,利沃夫,乌克兰) Lazy Dynamics, Utrecht, the Netherlands(Lazy Dynamics,乌得勒支,荷兰)

AI总结 提出五种因子图原语,证明任意组合均支持闭式变分消息传递,并通过堆叠路由层实现通用函数逼近,应用于时间序列预测。

详情
AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语:双线性因子、指数链接、Gamma先验、高斯似然和等式节点,并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效,是因为每个原语都保留了一小部分消息族:在平均场分解下,高斯变量上的消息保持高斯分布,精度变量上的消息保持Gamma分布,而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合,并表明堆叠路由层编码任意决策树,建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时,该框架产生了一个贝叶斯专家混合模型,其中门控函数是推断而非学习得到的,在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

2605.29462 2026-05-29 cs.CV cs.AI

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

大型视觉语言模型在CFMME上的基准测试:一个全面的中文金融多模态评估数据集

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing(文言金团队,阿里云计算)

AI总结 提出CFMME,一个包含6052个实例的中文金融多模态评估基准,涵盖八种主要金融图像模态和四项核心多模态任务,用于评估LVLMs在金融业务全流程中的感知、理解、推理和认知能力。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的出现显著扩展了模型的能力,超越了仅文本理解,实现了跨视觉和文本模态的统一推理,并支持更广泛的实际应用。为了全面评估LVLMs在中文环境下整个金融业务流程中的感知、理解、推理和认知能力,我们引入了CFMME,一个新颖的中文金融多模态评估基准。CFMME包含6052个实例,涵盖从基础学术知识到复杂实际应用,涉及八种主要金融图像模态和四项核心多模态任务。在CFMME上,我们对代表性LVLMs进行了全面评估。结果表明,最先进的模型在问答任务上达到了66.11%的总体准确率,在检测、识别和信息提取任务上平均得分为77.18,表明当前LVLMs仍有很大的改进空间。此外,我们对错误原因、跨模态能力和多方向设置进行了详细分析,为未来研究提供了有价值的见解。我们希望CFMME能推动LVLMs的进一步进展,特别是在金融领域多个多模态任务上的性能提升。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.

2605.29461 2026-05-29 cs.CV

FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation

FlowSeg: 面向大语言模型条件分割的动态语义引导

Zekang Zhang, Guangyu Gao, Youyun Tang, ChengJing Wu, Xiaochao Qu, Chi Harold Liu, Jianbo Jiao, Yunchao Wei, Luoqi Liu, Ting Liu

发表机构 * School of Computer Science, Beijing Institute of Technology(北京理工大学计算机科学学院) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院) WEI Lab, Institute of Information Science, Beijing Jiaotong University(北京交通大学信息科学学院WEI实验室) Beijing Key Laboratory of Advanced Information Science(北京高级信息科学重点实验室)

AI总结 针对大语言模型条件分割中语义错位问题,提出FlowSeg方法,通过双向语义流动态引导掩码生成,实现语义对齐并达到最优性能。

Comments 18 pages, accepted by ICML 2026

详情
AI中文摘要

大语言模型条件分割最近通过将大语言模型与迭代掩码生成框架相结合而迅速发展。然而,我们在当前的“提议-选择”流程中发现了一个持续的失败模式。尽管通常能生成高质量的掩码候选,但最终预测可能无法匹配给定的语言条件。这种失败源于语言语义通常被用作静态提示或事后匹配信号,而不是参与迭代掩码生成过程。通过系统分析,我们表明许多错误源于语义错位而非掩码质量差。为解决此问题,我们提出FlowSeg,它通过在整个生成过程中引入中间解码状态与大语言模型导出的条件嵌入之间的双向语义流,实现动态语义引导。语言条件在每个阶段主动引导掩码细化,而条件嵌入则通过出现的视觉证据逐步更新。这种设计产生了语义基础的掩码表示和视觉对齐的语言条件,从而实现更可靠的匹配。我们进一步引入轻量级边界感知细化,以选择性增强不确定区域而不扰动置信内部。在指代表达分割和推理分割任务上的大量实验表明,FlowSeg持续改善语言-掩码对齐,并达到最先进的性能。项目页面:https://zkzhang98.github.io/FlowSeg_page

英文摘要

LLM-conditioned segmentation has recently advanced rapidly by coupling large language models with iterative mask generation frameworks. However, we identify a persistent failure mode in current propose-then-select pipelines. Although high-quality mask candidates are often generated, the final prediction may fail to match the given linguistic condition. This failure arises because language semantics are typically used as static prompts or post-hoc matching signals, rather than participating in the iterative mask generation process. Through systematic analysis, we show that many errors stem from semantic misalignment rather than poor mask quality. To address this issue, we propose FlowSeg, which introduces dynamic semantic guidance via a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings throughout the generation process. Language conditions actively guide mask refinement at each stage, while condition embeddings are progressively updated by emerging visual evidence. This design yields semantically grounded mask representations and visually aligned language conditions, enabling more reliable matching. We further incorporate a lightweight boundary-aware refinement to selectively enhance uncertain regions without perturbing confident interiors. Extensive experiments on referring expression segmentation and reasoning segmentation tasks demonstrate that FlowSeg consistently improves language-mask alignment and achieves state-of-the-art performance. Project page: https://zkzhang98.github.io/FlowSeg_page

2605.29460 2026-05-29 cs.CV

FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation

FedSmoothLoRA:面向更平滑和更快速收敛的联邦低秩适配

Zehao Wang, Guanglei Yang, Yihan Zeng, Hang Xu, Hongzhi Zhang, Wangmeng Zuo, Chun-Mei Feng

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah’s Ark Lab(华为诺亚实验室) University College Dublin(都柏林大学学院)

AI总结 针对联邦低秩适配中更新空间有限、轮间状态不匹配和客户端无关起始状态的问题,提出FedSmoothLoRA框架,通过轮匹配矩阵和梯度对齐矩阵实现更平滑和更快速的收敛。

Comments 26 pages, 4 figures

详情
AI中文摘要

使用低秩适配(LoRA)对基础模型进行联邦微调提供了一种高效的解决方案,可在降低通信和计算成本的同时保持数据本地性。然而,FedAvg与LoRA的直接组合存在三个关键问题:有限的更新空间限制了模型的有效学习能力;轮间状态不匹配破坏了跨轮局部优化的连续性;以及客户端无关的起始状态减慢了客户端上的局部收敛。尽管最近的方法通过跨通信轮将LoRA更新合并到主干中缓解了有限更新空间问题,但轮间状态不匹配和客户端无关的起始状态仍未得到充分解决。为了解决这些问题,我们提出了FedSmoothLoRA,一个联邦LoRA微调框架,它保留了扩大的更新空间,改善了跨轮局部优化的连续性,并为局部训练提供了客户端感知的起始状态。在每个通信轮,FedSmoothLoRA使用两个矩阵构建局部LoRA初始化:一个轮匹配矩阵,用于保持跨轮局部状态连续性;以及一个梯度对齐矩阵,用于从局部数据估计的梯度信号提供客户端特定的优化指导。这些设计共同实现了更平滑和更快速的收敛。在图像分类和自然语言生成任务上的大量实验表明,FedSmoothLoRA始终优于现有的联邦LoRA微调方法。代码:https://github.com/wangzehao0704/FedSmoothLoRA

英文摘要

Federated fine-tuning of foundation models with Low-Rank Adaptation (LoRA) provides an efficient solution for reducing communication and computation costs while preserving data locality. However, the direct combination of FedAvg and LoRA suffers from three key issues: limited update space, which restricts the model's effective learning capacity; inter-round state mismatch, which disrupts cross-round local optimization continuity; and a client-agnostic starting state, which slows local convergence on clients. Although recent methods mitigate the limited update space issue by merging LoRA updates into the backbone across communication rounds, inter-round state mismatch and the client-agnostic starting state remain insufficiently addressed. To address these issues, we propose FedSmoothLoRA, a federated LoRA tuning framework that preserves the enlarged update space, improves cross-round local optimization continuity, and provides a client-aware starting state for local training. At each communication round, FedSmoothLoRA constructs the local LoRA initialization using two matrices: a Round-Matching matrix that preserves cross-round local state continuity, and a Gradient-Aligned matrix that provides client-specific optimization guidance from gradient signals estimated on local data. Together, these designs enable smoother and faster convergence. Extensive experiments on image classification and natural language generation tasks demonstrate that FedSmoothLoRA consistently outperforms existing federated LoRA tuning methods. Code: https://github.com/wangzehao0704/FedSmoothLoRA

2605.29459 2026-05-29 cs.CL cs.LG

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Kronecker嵌入:用于参数高效语言模型的字节级结构化词元表示

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出Kronecker嵌入,通过字节级字符-位置确定性分解替代标准嵌入表,消除91-94%输入侧可训练参数,在多个实验中实现更低验证损失、更强拼写鲁棒性和运行时效率。

Comments 28 pages, 16 tables. Reference implementation: https://github.com/theschoolofai/kronecker-embeddings

详情
AI中文摘要

大型语言模型通过一个形状为|V| x d_model的可学习嵌入表路由每个输入,在前沿规模下消耗数亿到数十亿的可训练参数。我们引入Kronecker嵌入,一种确定性的字节级字符-位置分解,用固定编码器和单个可学习投影替换该表,与标准BPE分词器兼容,在前沿规模下消除91-94%的输入侧可训练参数。我们提供五项贡献。第一,跨六个LM(135M-671B参数)的模型探针显示,训练后的输入嵌入将探针词的印刷变体聚类程度远高于形态学相关词;Kronecker在嵌入层避免了这种聚类。第二,在FineWeb-Edu上对nanoGPT GPT-2 124M进行2.5B词元的三种子受控比较显示,Kronecker达到比BPE绑定基线低2.5±0.2%的验证损失(差距0.083±0.007 nats,约9%更低的困惑度),达到BPE收敛损失所需的步数减少约1.43倍。第三,在110个干净/拼写错误对上的拼写鲁棒性探针显示,Kronecker在55.5%的对上保持top-1预测,而BPE为47.3%(+8.2个百分点),并将KL降低7.6%,在11个类别中赢得或平局10个;生成探针显示Kronecker在生成中回显字节新颖字符串和拼写错误,而BPE则遗忘它们。第四,BPE嵌入范数在训练期间漂移,而Kronecker投影范数保持在1.0附近,与稳定的表示目标一致。第五,一种即时运行时变体从4.5 MB的字节缓冲区重建嵌入,而不是从词汇量为131,072的2.15 GB表中重建,步长时间开销为0.01-0.24%。字节级局部性存在权衡:字节相似但语义距离远的对(compute/commute, nation/notion)聚类在一起,将消歧转移到早期注意力层。

英文摘要

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

2605.29458 2026-05-29 cs.CL cs.AI

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

面向LLM人格模拟的自适应访谈:基于证据的推理提升决策对齐

Ruoxi Su, Yuhan Liu, Jingyu Hu

发表机构 * University of Cambridge(剑桥大学) Independent Researcher(独立研究员)

AI总结 提出自适应访谈框架,通过结构化三阶段对话收集人格相关信息,并基于访谈记录评估LLM在道德困境场景中模拟个体决策的能力,发现基于后续追问的证据推理能显著提升预测准确性。

Comments 20 pages, 2 figures, 12 tables

详情
AI中文摘要

准确模拟特定个体的决策对大型语言模型(LLM)仍然具有挑战性,部分原因在于人格信息通常以静态描述形式提供,缺乏个体层面决策模拟所需的价值观、经历和情境线索。我们提出一种自适应访谈框架,通过结构化的三阶段对话收集人格相关信息:核心问题、动态追问和综合人格总结。利用生成的访谈记录,我们评估LLM能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——核心10个问题回答、完整访谈对话以及总结性人格表征。结果发现,自适应访谈并非作为统一的准确性增强器,而更像是一种选择性接地机制:约40%的完整访谈轨迹中融入了基于追问的证据,且这些基于追问的预测比仅基于核心问题的预测更准确(45.5% vs. 39.3%)。这些发现强调,仅靠更丰富的人格背景是不够的:只有当模型真正将其决策基于用户特定证据时,改进才会出现。

英文摘要

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

2605.29455 2026-05-29 cs.CV eess.SP

Uni-RCM: Unified Reference-guided Cross-modal Mapping for Multi-Class Anomaly Detection

Uni-RCM:面向多类异常检测的统一参考引导跨模态映射

Yangchen Wu, Huiqiang Xie

发表机构 * School of Information Science and Technology, Jinan University(信息科学与技术学院,暨南大学)

AI总结 提出Uni-RCM框架,通过参考引导块和离线残差量化器,实现多类工业异常检测的统一建模,在MVTec-3D AD数据集上达到最优性能。

Comments This work has been submitted IEEE for potential publication

详情
AI中文摘要

多模态工业异常检测通常依赖于每个产品类别的单独模型,从根本上限制了实际可扩展性。当转向同时处理多种类别的统一范式时,由于类间干扰和特征流形混淆,检测精度往往会下降。为了克服这些挑战,我们提出了一个统一的参考引导跨模态映射框架,命名为Uni-RCM。其核心是,我们提出了一个参考引导块,通过引入可学习的参考特征来动态过滤特定类别的噪声,该参考特征捕捉了不同模态之间的共性。此外,我们提出了一个离线残差量化器,通过多个级联码本来表征正态分布。在MVTec-3D AD数据集上的大量评估表明,在具有挑战性的多类设置以及图像级检测和像素级定位方面,该方法达到了最先进的性能。

英文摘要

Multi-modal industrial anomaly detection typically relies on separate models for each product category, fundamentally limiting practical scalability. When shifting to a unified paradigm that handles diverse classes simultaneously, detection accuracy often degrades due to inter-class interference and feature manifold confusion. To overcome these challenges, we propose a Unified Reference guided Cross-modal Mapping framework, named Uni-RCM. At its core, we propose a reference guide block to dynamically filter out category-specific noise by introducing a learnable reference feature, which captures the commonalities across different modalities. Besides, an offline residual quantizer is proposed to characterize the normal distribution by multiple cascaded codebooks. Extensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art performance in the challenging multi-class setting and in terms of image-level detection and pixel-level localization.

2605.29454 2026-05-29 cs.LG

A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning

用于评估机器学习中成员推断攻击的全流程框架

Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen, Xiaolin Huang, Chen Liu

发表机构 * City University of Hong Kong(香港城市大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出一个涵盖数据、架构、算法和后训练模块的全流程评估框架,系统分析不同上下文对成员推断攻击效果的影响,并通过标准化威胁模型和互补指标提供实用指南。

详情
AI中文摘要

虽然成员推断攻击(MIAs)是识别训练数据的主流方法,但其应用已扩展到隐私审计和机器遗忘。然而,该领域缺乏一个系统性的框架来评估不同上下文如何影响MIA的效果。没有这样的特征描述,实践者可能会部署在基准测试中表现良好但在面对特定真实世界数据集的细微差别时变得统计上无关的算法。为了弥合这一差距并提供可操作的见解,我们引入了一个全面的评估框架,该框架系统地描述了整个机器学习流程(包括数据、架构、算法和后训练模块)中的隐私风险。我们的框架旨在固有地捕捉多样化的操作上下文,严格评估了在广泛训练配置下的最先进MIA。为了考虑真实世界部署中不同的误分类成本,我们采用了三个互补指标:对称成本下的平衡准确率,以及低FPR下的TPR(或低FNR下的TNR)用于严格惩罚误报或漏检的非对称场景。此外,认识到现有MIA假设不同的对手能力,我们形式化了两种标准化的威胁模型,并将这些攻击调整为相应的变体,以确保公平的基准测试。大量的实证评估表明,特定MIA方法的效果高度依赖于假设的威胁模型和选择的评估指标。最终,我们将这些发现提炼为可操作的指南,并提供一个即用的审计工具包,使实践者能够进行更好的隐私评估。

英文摘要

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

2605.29453 2026-05-29 cs.LG cs.AI

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

遗忘更少,泛化更强:统一动态图的时间与结构适应

Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li, Guoping Hu, Xiufeng Cheng, Jinqing Yang, Mengjia Wu, Yi Zhang

发表机构 * University of Auckland, Auckland, New Zealand(奥克兰大学) University of Technology Sydney, Sydney, Australia(悉尼大学) Central China Normal University, Wuhan, China(Central China Normal University)

AI总结 提出双尺度保持动态(DSRD)框架,通过统一的时间-结构自适应机制和可学习衰减核,在动态图表示学习中实现更强的泛化能力。

详情
AI中文摘要

动态图上的表示学习需要捕获随时间与结构共同演化的复杂依赖关系。现有方法通常采用固定的时间衰减方案或预定义的结构传播深度,限制了其在具有不同交互频率和拓扑特征的图上的泛化能力。我们提出双尺度保持动态(DSRD),一个统一框架,维护一个同时编码时间记忆和结构上下文的保持性表示状态。DSRD引入两个关键组件:(i) 具有双尺度自适应的保持状态,在单一循环公式中联合建模时间动态和结构传播;(ii) 具有可学习时间敏感性参数的自适应衰减核,基于底层交互模式自动平衡短期响应和长期保持。我们提供理论分析,建立了事件级并行聚合与高效循环状态更新之间的等价性,以及所学动态的稳定性和有界性保证。在14个真实世界基准上的广泛实验表明,DSRD在链接预测和节点分类任务上均持续达到最先进性能,并在直推和归纳设置中展现出强泛化能力。

英文摘要

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

2605.29452 2026-05-29 cs.CV

Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis

摄影测量重建方法与3D高斯泼溅用于路面粗糙度分析的比较评估

Marouane Elmegdar, Teng Xiao

发表机构 * School of International Education, Hubei University of Technology(湖北工业大学国际教育学院) School of Computer Science, Hubei University of Technology(湖北工业大学计算机科学学院)

AI总结 本研究比较了COLMAP、Meshroom、Metashape和3D高斯泼溅四种重建方法,评估它们从智能手机图像估计路面粗糙度的能力,结果表明COLMAP对微纹理最敏感,而开源方法适用于低成本路面监测。

Comments accepted by RSMIP 2026

详情
AI中文摘要

基于图像的三维重建为传统的基于传感器的路面评估技术提供了一种低成本替代方案。本研究比较了四种重建流程——COLMAP、Meshroom、Metashape和3D高斯泼溅(3DGS),以评估它们从智能手机图像估计路面粗糙度的能力。所有点云均在CloudCompare中使用一致的工作流程进行处理,包括方向对齐、分割、法线估计以及在0.2、0.4和0.6模型单位的邻域半径下进行粗糙度计算。结果表明,COLMAP对微纹理的灵敏度最高,而Meshroom产生具有中等粗糙度变化的平衡重建。Metashape由于其内部滤波而生成最平滑的几何形状,3DGS捕捉到可见的不规则性但表现出更高的噪声和较低的密度。比较表明,开源管道可用于相对粗糙度评估,为低成本路面监测提供了一种实用方法。

英文摘要

Image-based 3D reconstruction offers a low-cost alternative to traditional sensor-based techniques for road surface assessment. This study compares four reconstruction pipelines--COLMAP, Meshroom, Metashape, and 3D Gaussian Splatting (3DGS)--to evaluate their ability to estimate road surface roughness from smartphone imagery. All point clouds were processed in CloudCompare using a consistent workflow involving orientation alignment, segmentation, normal estimation, and roughness computation at neighborhood radiuses of 0.2, 0.4, and 0.6 model units. The results show that COLMAP provides the highest sensitivity to micro-texture, while Meshroom yields balanced reconstructions with moderate roughness variation. Metashape produces the smoothest geometry due to its internal filtering, and 3DGS captures visible irregularities but exhibits higher noise and lower density. The comparison demonstrates that open-source pipelines are viable for relative roughness evaluation, offering a practical approach for low-cost pavement monitoring.

2605.29448 2026-05-29 cs.LG cs.AI cs.CV cs.IT math.IT

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

数据集值多少钱?缩放定律、Vendi分数与矩阵谱函数

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Paul G. Allen School of Computer Science & Engineering(保罗·G·艾伦计算机科学与工程学院) University of Washington(华盛顿大学)

AI总结 本文通过子模性理论统一了神经缩放定律与Vendi分数,提出矩阵谱函数作为广义数据评估框架,并开发了基于割线方程的快速优化算法,在ImageNet-1K规模上实现了约35,000倍加速,实验表明设施选址函数在预测子集价值方面表现最佳。

Comments 75 pages

详情
AI中文摘要

神经缩放定律通过数据集大小评估数据,而Vendi分数使用量子熵衡量数据集价值。我们证明常见的神经缩放定律目标和Vendi分数都是子模的。进一步,我们表明Vendi分数是一类更广泛的子模目标(称为矩阵谱函数)的特例,这还包括行列式点过程(DPP)目标以及许多其他目标。我们还引入了弱矩阵单调函数,并展示了它们如何导致弱子模矩阵谱函数,从而产生一系列实用的数据评估目标。我们开发了基于割线方程的更新方法,避免了贪心优化过程中的重复特征分解,将$m$维嵌入的边际增益评估相对于预言机查询减少了$O(m)$因子。这实现了平均约35,000倍的实证加速,使得在ImageNet-1K规模的数据集上直接优化Vendi分数成为可能。由此,我们比较了多个目标在固定大小、类别平衡和固定训练预算条件下预测训练子集对保留测试性能价值的能力,包括Vendi分数、DPP、设施选址以及三种新的矩阵谱变体。在多个数据集上,设施选址表现最佳。直接优化还揭示,虽然Vendi分数在中等分数范围内具有预测性,但将目标推向更高值可能使其成为下游性能的糟糕代理。我们还发现,均匀随机选择的固定大小子集(无论是否类别平衡)在评估分数和保留性能上都表现出显著的集中性。最后,我们表明大小、类别平衡和训练预算单独并不决定数据价值:即使控制这些因素,性能范围也从好到差平滑变化。

英文摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.