arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05858 2026-06-05 cs.CL

ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLMs

ReverseEOL: 通过解码器仅LLM中的文本反转改进无训练文本嵌入

Ailiang Lin, Zhuoyun Li, Yusong Wang, Keyu Mao, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所) Tencent(腾讯)

AI总结 提出ReverseEOL方法,通过反转输入文本生成互补嵌入,结合前向嵌入提升冻结解码器仅LLM的文本表示能力,在STS和MTEB基准上显著提升无训练基线性能。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展为生成无训练文本嵌入开辟了新途径。然而,解码器仅LLM中的因果注意力机制阻止了早期标记关注未来上下文,导致上下文表示存在偏差。在这项工作中,我们提出了带有显式单词限制的反转提示(ReverseEOL),一种简单而有效的方法,用于增强冻结LLM的表示能力。ReverseEOL通过从反转输入文本中获得的额外反转嵌入来增强标准前向嵌入。由于反转输入使每个标记能够访问原始顺序中无法访问的上下文,所得的反转嵌入有效地为原始嵌入提供了互补信息。因此,结合前向和反转嵌入产生了更丰富的最终表示。在STS和MTEB基准上的全面实验表明,ReverseEOL显著提高了现有无训练基线在具有不同架构和规模的各种LLM上的性能。广泛的消融和分析进一步证实了我们反转机制的必要性。

英文摘要

Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.

2606.05857 2026-06-05 cs.CL

Forgive or forget: Understanding the context of hate in audio retrieval systems

原谅或忘记:理解音频检索系统中仇恨的上下文

Arghya Pal, Sailaja Rajanala, Raphael C. -W. Phan, Shekhar Nayak

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种后门因果去偏框架,通过情感控制中介在保持语义相关性的同时抑制有害语音,实验表明在最小化检索精度损失下持续降低毒性。

详情
AI中文摘要

处理文本到音频系统中的有毒检索因上下文依赖而具有挑战性。现有策略(如改写、摘要)存在改变意图或遗漏细节的风险。我们提出了一种后门因果去偏框架,带有情感控制中介,以在抑制有害语音的同时保持语义相关性。我们的方法是模型无关的,并能无缝集成到现有检索流程中。我们引入了两种变体:Forgive,通过logit调整对有毒音频进行重排序和过滤;Forget,生成反事实有毒提示以减轻有害检索。实验表明,在检索精度损失最小的情况下,毒性持续降低,提高了安全性和可靠性。

英文摘要

Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

2606.05852 2026-06-05 cs.SD cs.AI eess.AS

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice: 一种用于语音和歌声生成的统一模型

Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

发表机构 * Giant Network(巨量网络) Shanghai Conservatory of Music(上海音乐学院)

AI总结 提出UniVoice,一种基于条件流匹配的统一语音和歌声生成框架,通过将条件分解为内容、旋律和音色,并引入空旋律标记,实现单一模型同时生成自然语音和可控歌声。

详情
Comments
9 pages, 2 figures
AI中文摘要

文本到语音(TTS)和歌声合成(SVS)都旨在从符号输入生成人类声音音频,但它们对生成过程提出了不同的要求。语音生成依赖于灵活的、语言驱动的韵律,而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个既能生成自然语音又能生成可控歌声的单一模型具有挑战性,因为与旋律相关的条件应该强烈约束歌声,但不应限制语音韵律。我们提出了UniVoice,一种基于条件流匹配的统一语音和歌声生成框架。UniVoice没有使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色,这些由适合模态的编码器编码,并由共享的扩散变换器(DiT)主干网络使用。对于歌声,旋律条件由MIDI音符序列表示;对于语音,它被替换为学习的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制,同时避免了对语音施加旋律约束的需要。我们进一步将空旋律标记分析为条件流中旋律边缘化的近似。在3万小时语音和3.5万小时歌声数据上训练,UniVoice在语音上实现了5.26%的音素错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌声生成上,UniVoice实现了16.22%的PER,优于统一基线Vevo1.5(24.72%)。

英文摘要

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

2606.05848 2026-06-05 cs.RO

Visuotactile and Explicitly Force-Controlled Robotic Ultrasound for Abdominal Volumetric Reconstruction

用于腹部体积重建的视觉触觉和显式力控制机器人超声

Adrian Piedra, R Brooke Jeffrey, Oussama Khatib

发表机构 * Stanford Robotics Laboratory, Computer Science Department, Stanford University(斯坦福机器人实验室、计算机科学系、斯坦福大学) Department of Radiology, School of Medicine, Stanford University(放射科、医学院、斯坦福大学)

AI总结 提出一种结合立体视觉、触觉反馈和专家策略的机器人超声采集系统,通过力控机械臂实现自适应腹部扫描,并实现三维体积重建以增强诊断能力。

详情
AI中文摘要

在本文中,我们提出了一种机器人超声采集系统,该系统集成立体视觉、基于触摸的反馈和专家策略,以执行自主和自适应的腹部扫描。系统记录来自放射科专家的徒手运动和力数据,创建一个框架来捕获探头运动、施加的力和解剖扫描策略。这些专家数据被重放以用机器人复制特征扫描,为进一步的自主能力奠定基础。利用立体视觉,系统生成患者腹部的三维地形图,并通过关键点的刚度测量来细化,以描绘肋骨边界。这些组合技术使机器人能够执行两种不同的扫描路径:肋骨下方向上倾斜的扫描以可视化上腹部附近的结构,以及穿过软组织区域的垂直扫描。一个柔顺的、扭矩控制的七自由度机器人操纵器通过闭环力控制来保持与不同解剖表面的一致探头接触。物理实验表明,该系统在动态适应患者特定地形的同时,实现了与专家扫描相当的高质量成像。此外,机器人系统通过实现三维体积采集超越了专家能力,这增强了诊断潜力并为高级分析提供了体积数据。这项工作突出了将专家知识集成到自主机器人系统中,并强调了将基于感知的自主性与物理推理相结合以增强诊断性能的潜力。

英文摘要

In this paper, we present a robotic ultrasound acquisition system that integrates stereo vision, touch-based feedback, and expert-informed strategies to perform autonomous and adaptive abdominal scans. The system records freehand motion and force data from expert radiologists, creating a framework to capture transducer motion, applied forces, and anatomical scanning strategies. This expert data is replayed to replicate characteristic scans with the robot, forming a foundation for further autonomous capabilities. Using stereo vision, the system generates three-dimensional topography maps of the patient's abdomen, which are refined through stiffness measurements at key points to delineate the rib cage boundary. These combined techniques enable the robot to execute two distinct scanning paths: an upward-angled sweep beneath the rib cage to visualize structures near the upper abdomen and a perpendicular sweep across soft tissue regions. A compliant, torque-controlled seven degree-of-freedom robotic manipulator is controlled to maintain consistent probe contact through closed-loop force control over the varied anatomical surfaces. Physical experiments demonstrate that the system achieves high-quality imaging comparable to expert scans while dynamically adapting to patient-specific topographies. Furthermore, the robotic system surpasses expert capabilities by enabling three-dimensional volume acquisition, which enhances diagnostic potential and provides volumetric data for advanced analyses. This work highlights the integration of expert knowledge into autonomous robotic systems and underscores the potential of combining perception-based autonomy with physical reasoning for enhanced diagnostic performance.

2606.05847 2026-06-05 cs.AI

Agentic Molecular Recovery via Molecule-Aware Exploration

通过分子感知探索实现智能体分子恢复

Suwan Yoon, Changhee Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对文本引导分子生成中无效SMILES问题,提出AMREC方法,通过分子感知失配追踪、扩展候选探索和轨迹级选择,在恢复化学有效性的同时保留目标相关结构线索和分子身份。

详情
Comments
Preprint
AI中文摘要

使用LLM进行文本引导的分子生成常常产生无效的SMILES。我们认为,无效草稿应通过从面向有效性的修复转向保持身份的分子恢复来解决:目标不仅是恢复化学有效性,还要保留目标相关的结构线索并恢复描述所暗示的分子身份。这一视角揭示了现有修正策略的局限性。事后修复可以在恢复有效性的同时扭曲关键结构,仅LLM修正可能引入意外的全局漂移,而即使配备了可执行的RDKit编辑工具,通用智能体修正仍受限于贪婪的单候选轨迹。为了解决这些局限性,我们提出了AMREC,它将分子感知失配追踪与扩展候选探索和轨迹级选择相结合。在来自三个骨干模型的无效ChEBI-20草稿上,AMREC在结构、精确匹配和字符串级指标上实现了最强的整体恢复性能。

英文摘要

Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.

2606.05846 2026-06-05 cs.CL eess.AS

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

详情
Comments
ICML 2026 Workshop on Machine Learning for Audio
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2606.05843 2026-06-05 cs.CL cs.AI

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

多模态大语言模型中通过CoRe头的功能稀疏性机制洞察

Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 通过识别和分析CoRe头,揭示多模态大语言模型在跨模态检索中功能稀疏的结构特性,并验证其必要性及加速推理的潜力。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在复杂的视觉-语言任务上表现出卓越的能力,但它们从复杂、嘈杂的上下文中提取与查询相关的视觉特征的机制仍然不透明。在本文中,我们进行了一项深入的可解释性研究,揭示了MLLMs中一个深刻的结构属性:跨模态检索中的功能稀疏性。利用一种称为检索注意力质量(RAM)的令牌级指标,我们识别并描述了一组高度专业化的注意力头,称为上下文感知检索(CoRe)头。在不同的视觉领域和模型规模中,我们观察到明确的功能划分:CoRe头充当专用的信息提取器,而大多数其他头则将注意力分布在更广泛的上下文区域。因果干预进一步证明了这些专业化头的必要性。仅消融前5%的CoRe头就会导致多模态推理性能显著下降,而消融排名较低的头则影响甚微。此外,加速实验验证了CoRe头的实用性,表明利用这种局部稀疏性可以显著加速推理,同时保持稳健的任务性能。我们的发现揭示了MLLMs中功能稀疏性的结构原理,完善了当前对机制可解释性的理解,并为未来的架构设计和模型优化奠定了理论基础。

英文摘要

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

2606.05836 2026-06-05 cs.CL

ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

ProSPy: 面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架

Zhaorui Yang, Huawei Zheng, Sen Yang, Yuhui Zhang, Haoxuan Li, Zhizhen Yu, Xuan Yi, Chen Hou, Defeng Xie, Chao Hu, Minfeng Zhu, Dazhen Deng, Haozhe Feng, Danqing Huang, Yingcai Wu, Peng Chen, Wei Chen

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室) School of Software Technology(软件技术学院) Tencent TEG(腾讯科技集团) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Zhejiang University(浙江大学)

AI总结 提出ProSPy框架,通过自动剖析、模式剪枝、中间视图获取和Python分析四阶段,结合SQL高效性与Python灵活性,解决企业级数据库Text-to-SQL中的模式异构、元数据不完整和复杂分析问题。

详情
Comments
24 pages, 12 figures
AI中文摘要

大型语言模型显著推进了Text-to-SQL系统,但将其应用于企业级数据库仍具挑战。现实数据库通常包含大型异构模式、不完整元数据、方言特定SQL语法以及难以用单个SQL查询解决的复杂分析问题。为应对这些挑战,我们提出ProSPy,一个面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架。ProSPy将推理过程分为四个阶段:首先通过自动剖析提取细粒度数据证据,逐步将大型模式剪枝为任务相关上下文,通过方言无关的SQL接口获取中间视图,最后使用Python进行灵活的下游分析。该设计结合了SQL在大型数据库上的高效性与基于Python的分析的灵活性,同时减少了对不可靠元数据的依赖,并提高了跨SQL方言的鲁棒性。在Spider 2.0-Lite和Spider 2.0-Snow上的实验表明,ProSPy在使用开源和专有模型时均持续优于强基线,使用Claude-4.5-Opus时无需多数投票即可达到60.15%和60.51%的执行准确率。进一步分析表明,ProSPy对SQL方言变化具有鲁棒性,并在模式召回率和精确率之间取得了有利的权衡。

英文摘要

Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.

2606.05833 2026-06-05 cs.CV cs.AI

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.05829 2026-06-05 cs.CV

Gender Artifacts from Art History to Text-to-Image Generation

从艺术史到文本到图像生成中的性别伪影

Piera Riccio, Miriam Doh, Benedikt Höltgen, Noa Garcia, Nanne van Noord

发表机构 * University of Amsterdam(阿姆斯特丹大学) Université Libre de Bruxelles(布鲁塞尔自由大学) Hasso Plattner Institut University of Potsdam(波茨坦大学霍索普纳研究所) The University of Osaka(大阪大学)

AI总结 通过提出性别伪影度量(PixelSGA和MaskSGA),研究了艺术风格中性别表征与视觉特征的关系,并发现文本到图像生成模型会放大历史来源中的性别伪影。

详情
AI中文摘要

艺术风格植根于特定的社会历史背景,这些背景编码了社会等级,包括不同的性别建构。然而,在人工智能研究中,风格长期以来被视为一种表面层次的视觉属性:一种应用于内容中性场景的颜色、笔触和纹理的滤镜。我们引入了第一个数据集来研究历史图像和生成图像中性别表征与风格之间的相互作用。StyleGender包含跨越19种艺术风格的74k张图像,包括带有风格和性别注释的艺术历史图像、在受控风格和性别提示下由T2I生成的图像,以及一个语义对齐集,使得可以直接比较艺术史与生成结果。通过提出两种集合性别伪影(SGA)度量(PixelSGA和MaskSGA),在像素级别和构图结构中捕捉性别信号,我们展示了:(1) 性别表征塑造了不同艺术风格的视觉特征,(2) 风格关键词将这些模式带入T2I生成中,(3) 生成模型倾向于放大历史来源中观察到的性别伪影。

英文摘要

Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.

2606.05828 2026-06-05 cs.AI cs.CL

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

隐式偏好的统计先验:在个人代理中解耦技能选择作为局部调控机制

Zeyu Gan, Huayi Tang, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院)

AI总结 针对本地部署的个人代理中隐式用户偏好学习问题,提出一种解耦统计偏好学习与语义意图解析的轻量级架构,通过局部统计结果影响远程LLM的选择决策,显著降低累积遗憾并提高测试准确率。

详情
AI中文摘要

随着大型语言模型(LLM)能力的提升,依赖基于API的远程模型和外部技能的本地部署个人代理成为一种新范式。随着可用技能的快速扩展,使个人代理能够学习并适应隐式用户偏好成为关键挑战。然而,本地部署的限制排除了复杂的集中式选择算法,迫切需要一种轻量级的局部偏好调控机制。本文通过一种严格解耦统计偏好学习与语义意图解析的新颖架构,探索了这种调控机制的实现。具体而言,我们利用局部统计结果来影响和调节远程LLM的选择决策。大量评估表明,我们的解耦方法实现了最低的累积遗憾和最高的测试准确率,显著优于传统的记忆增强型代理。

英文摘要

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.

2606.05817 2026-06-05 cs.LG cs.AI

Consistency Training Along the Transformer Stack

沿Transformer堆栈的一致性训练

Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa

发表机构 * Purdue University(普渡大学) Independent(独立) Columbia University(哥伦比亚大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校) Dartmouth College(达特茅斯学院) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 本文通过引入MLP状态和注意力分布的一致性目标,将一致性训练扩展到多种安全威胁,并发现跨威胁泛化及共享机制,证明其作为灵活对齐框架的有效性。

详情
Comments
Submitted to EMNLP 2026
AI中文摘要

一致性训练鼓励模型在不同上下文中表现相似,并已显示出减少对齐问题的潜力。我们以两种方式扩展一致性训练的范围。首先,我们引入两个新的内部一致性目标:MLP一致性训练(MLPCT),匹配激活后的MLP状态;以及注意力一致性训练(AttCT),匹配每个头的注意力分布。其次,我们将一致性训练应用于四种额外的安全威胁:角色上下文学习攻击、对抗性挫败、预填充攻击和条件性对齐错误。在多个模型和威胁设置中,我们发现一致性训练在减少对齐问题方面远优于先前工作中研究的谄媚和越狱设置。我们还发现了跨威胁泛化的案例,即针对一种失败模式的训练提高了对另一种模式的鲁棒性,并识别了ACT、MLPCT和AttCT共享的残差流机制,同时将BCT区分为机制上不同的方法。我们的结果表明,一致性训练是一个灵活且可扩展的对齐框架,能够统一防御更广泛的模型病理类别。

英文摘要

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

2606.05814 2026-06-05 cs.LG

Robust and sparse support vector machine via hybrid truncated loss for supervised classification

基于混合截断损失的鲁棒稀疏支持向量机用于监督分类

Yuliang Yang, Chen Chen, Yuxiang Liu, Huiru Wang

发表机构 * School of Science, Beijing Forestry University(北京林业大学理学院) Translational Cancer Research Center, Peking University First Hospital(北京大学第一医院转化肿瘤研究中心)

AI总结 提出一种稀疏且有界的混合截断损失函数L_ht,构建L_ht-SVM模型用于单视图分类,并扩展为多视图MvL_ht-SVM,通过P-平稳点和交替方向乘子法实现高效优化,实验表明在准确率、稀疏性和鲁棒性上优于对比方法。

详情
AI中文摘要

支持向量机(SVM)是一种广泛使用的分类器,但选择合适的损失函数仍然困难。凸损失如hinge损失和最小二乘损失对异常值敏感,而有界非凸损失通常导致高计算成本。为解决这一问题,我们提出一种混合截断损失函数($L_{\mathrm{ht}}$),该函数既稀疏又有界,并构建了用于单视图分类的$L_{\mathrm{ht}}$-SVM模型。我们引入P-平稳点,并利用它建立一阶必要和充分最优性条件。基于这些条件,我们设计了一种带有工作集策略的交替方向乘子法,降低了计算成本并实现了全局收敛。我们进一步通过添加结构信息和视图权重将$L_{\mathrm{ht}}$-SVM扩展到多视图学习,得到Mv$L_{\mathrm{ht}}$-SVM,该方法遵循共识和互补原则。在合成、真实世界和图像数据集上的实验表明,$L_{\mathrm{ht}}$-SVM在准确率更高、支持向量更少和噪声鲁棒性更好方面优于五种单视图方法,而Mv$L_{\mathrm{ht}}$-SVM在准确率、精确率、召回率和F1分数上优于六种多视图方法。

英文摘要

The support vector machine (SVM) is a widely used classifier, but choosing an appropriate loss function remains difficult. Convex losses such as the hinge loss and least-squares loss are sensitive to outliers, while bounded non-convex losses often lead to high computational cost. To address this, we propose a hybrid truncated loss function ($L_{\mathrm{ht}}$) that is both sparse and bounded, and build the $L_{\mathrm{ht}}$-SVM model for single-view classification. We introduce the P-stationary point and use it to establish the first-order necessary and sufficient optimality conditions. Based on these conditions, we design an alternating direction method of multipliers with a working-set strategy that reduces computational cost and achieves global convergence. We further extend $L_{\mathrm{ht}}$-SVM to multi-view learning by adding structural information and view weights, resulting in Mv$L_{\mathrm{ht}}$-SVM, which follows both the consensus and complementarity principles. Experiments on synthetic, real-world, and image datasets show that $L_{\mathrm{ht}}$-SVM achieves higher accuracy with fewer support vectors and better noise robustness than five single-view methods, while Mv$L_{\mathrm{ht}}$-SVM outperforms six multi-view methods in accuracy, precision, recall, and F1-score.

2606.05806 2026-06-05 cs.AI

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

当工具失效时:LLM智能体动态重规划与异常恢复的基准测试

Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Sochow University(苏州大学) Shandong University(山东大学) Baidu Inc.(百度公司)

AI总结 本文提出ToolMaze基准,通过有向无环图拓扑复杂度和工具扰动分类法,评估LLM智能体在工具失效时的动态重规划与错误恢复能力,发现模型对隐式语义故障的恢复率下降约37%,且智能体容错性随模型规模增长的速度远慢于基本任务执行。

详情
AI中文摘要

现有基准在理想化的“快乐路径”上评估LLM中的工具集成推理(TIR),很大程度上忽视了现实中的工具故障。我们引入ToolMaze,一个用于TIR智能体动态路径发现和错误恢复的基准。为了将系统性重规划与盲目试错区分开来,ToolMaze采用二维设计:基于DAG的拓扑复杂度和一个$2 \times 2$的工具扰动分类法(显式/隐式,瞬态/永久)。评估表明,扰动几乎在所有模型上降低了性能,在隐式语义故障下下降最为剧烈。由于对受损输出的系统性过度信任,这些场景中的扰动恢复率(PRR)骤降约37%,而复杂拓扑将智能体困在徒劳的试错循环中。关键的是,智能体容错性随模型规模增长的速度比基本任务执行慢$3.66\times$,凸显了动态重规划作为一个独立瓶颈,无法通过模型缩放或提示工程解决。数据和代码见https://github.com/Zhudongsheng75/ToolMaze。

英文摘要

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

2606.05804 2026-06-05 cs.CL

Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

LLMs 能否被约束到过去?通过基于回忆的提示改进知识截止

Michiro Asai, Ailiang Lin, Yu Kishimoto, Takao Obi, Satoshi Kosugi, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出两种基于回忆的提示策略(Self-Recall 和 Question-Recall)来改进大语言模型在知识截止约束下的表现,在反事实问题上尤其有效,并构建了多截止历史事件基准(MHEB)进行鲁棒性评估。

详情
AI中文摘要

提示知识截止指令大语言模型(LLM)表现得好像指定截止日期之后的信息不可用。然而,先前的工作主要依赖于直接答案生成,当截止后的知识未被明确查询而仅与问题存在因果关系时,这种方法难以应对。为了解决这一限制,我们提出了两种基于回忆的提示策略:Self-Recall(SR),要求模型重述其截止约束;以及 Question-Recall(QR),要求模型回忆在截止日期下有效的问题相关信息。在三个现有基准上,我们的方法优于直接答案提示和传统的逐步推理基线,在反事实问题上尤其有显著改进。为了研究不同截止设置下的鲁棒性,我们进一步构建了多截止历史事件基准(MHEB),该基准在多个截止年份下评估同一问题。结果表明,知识截止性能随截止距离变化,而结合 SR 和 QR 始终能获得最佳性能。

英文摘要

Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.

2606.05800 2026-06-05 cs.LG

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

SALT: 当更多 rollout 在基于组的策略优化中无益时如何使其发挥作用

Powei Chang, Jinpeng Zhang, Chaoqun Sun, MiniWell Tsao, Lianrui Li, Jianxiang Xiang, Chenyu Wang, Yukang Gao, Dongying Kong

发表机构 * Bilibili Inc.(哔哩哔哩公司) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 针对 GRPO 风格组归一化中增加 rollout 数量导致梯度抵消的问题,提出 SALT 组件,通过子空间自适应重加权组相对更新系数,改善更新几何并提升性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通常采用 GRPO 风格的组相对更新,为每个提示采样多个 rollout 以构建归一化学习信号。然而,仅仅增加 rollout 数量并不能可靠地增强学习:在 GRPO 风格组归一化下,每个 rollout 的策略梯度特征可能集中到低秩、有符号的几何结构中,导致聚合时大量抵消,削弱有效更新。我们通过 SALT(子空间自适应几何插件组件)解决这种失效模式,该组件利用样本梯度几何对组相对更新的系数进行重新加权。SALT 从小批量 Gram 几何中估计主导共享子空间,将组相对系数分解为共享通道和残差通道,并在符号抵消严重时自适应放大残差通道。在多种推理导向的 RLVR 基准和模型规模上,SALT 在不修改奖励模型或 rollout 采样过程的情况下,改善了有效更新几何和性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure

2606.05799 2026-06-05 cs.LG cs.CL

CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

CaliDist: 通过抗干扰行为鲁棒性校准大型语言模型

Mohammad Anas Jawad, Cornelia Caragea

发表机构 * Cornelia Caragea(卡伦·卡雷亚) Mohammad Anas Jawad(穆罕默德·安斯·贾瓦德)

AI总结 提出CaliDist方法,通过测量和惩罚模型对语义干扰的敏感性来校准LLM,在7个NLU基准上平均将ECE从23%降至7%。

详情
AI中文摘要

现有的大型语言模型(LLM)校准方法常常忽略可信度的一个关键维度:模型对无关或误导信息的{\em 行为鲁棒性}。在本文中,我们认为模型的真实置信度应反映其在认知压力下的稳定性。我们引入\textsc{CaliDist},一种新颖的事后校准方法,直接测量并惩罚模型对干扰的敏感性。\textsc{CaliDist}量化了当输入提示被语义\textit{干扰项}扰动时,LLM的预测和不确定性如何变化。然后利用这种稳定性(或不稳定性)信号来自适应地缩放模型的初始置信度分数。我们在六个不同LLM的七个自然语言理解分类基准上进行的广泛实验表明,与强基线相比,\textsc{CaliDist}一致地实现了更低的期望校准误差(ECE)和Brier分数。值得注意的是,我们的方法平均将ECE从23%降至7%——相对改进70%——表明行为稳定性是校准的有力信号。我们在github.com/m-anas-j/CaliDist提供代码和数据集。

英文摘要

Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model's {\em behavioral robustness} to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel post-hoc calibration approach that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and uncertainty change when its input prompt is perturbed with semantic \textit{distractors}. This stability (or lack thereof) signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 23\% to 7\% on average--a relative improvement of 70\%--demonstrating that behavioral stability is a powerful signal for calibration. We make our code and datasets available at github.com/m-anas-j/CaliDist.

2606.05792 2026-06-05 cs.AI cs.LG cs.LO cs.SE

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

LLM 能写出正确的 TLA+ 规范吗?自然语言到 TLA+ 生成的评估

Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约奈大学芝加哥分校计算机科学系)

AI总结 本文首次系统评估基于 LLM 从自然语言合成 TLA+ 规范的能力,发现模型在语义正确性上仅达 8.6%,且成功依赖于渐进式提示,揭示了模型大小与质量无关、代码专用模型表现不佳等关键发现。

详情
Comments
12 pages, 11 tables. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026); Recommended as Best Paper Award Candidate
AI中文摘要

TLA+ 已支持亚马逊和微软等公司的工业验证,但从自然语言编写正确的 TLA+ 规范仍需时间和专业知识,这限制了其采用。LLM 显示出潜力,但尚无先前研究衡量它们是否能从自然语言生成语义正确的 TLA+ 规范。本文首次系统评估基于 LLM 的 TLA+ 规范合成。我们的研究在精心策划的 205 个 TLA+ 规范数据集上评估了来自八个系列的 30 个 LLM:四种提示策略下的 25 个开放权重模型(2600 次运行)和少样本提示下的 5 个专有模型(130 次运行),所有结果均由 SANY 解析器和 TLC 模型检查器验证。LLM 达到高达 26.6% 的语法正确性,但仅 8.6% 的语义正确性,成功仅出现在渐进式提示中。结果表明模型大小不能预测质量,例如 DeepSeek r1:8b 在所有策略上优于其 70B 变体,这表明推理对齐对形式语言的重要性。由于主流语言训练的负迁移,代码专用模型始终表现不佳。我们识别出五类重复出现的幻觉,所有幻觉均可追溯到特定的训练数据偏差。这些结果表明,当前 LLM 在没有专家监督的情况下无法生成可靠的 TLA+ 规范。我们发布了评估框架、代码和数据集,以支持可重复性和未来研究。

英文摘要

TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.

2606.05785 2026-06-05 cs.CV cs.AI cs.LG

Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

下一代LPDR并行解码器:架构优化与类别平衡的GAN增强

Shawaiz Obaid, Nida Chandio, Neha Jamil, Muhammad Khuram Shahzad

发表机构 * arXiv.org

AI总结 针对车牌检测与识别中的空间字符不匹配和数据不平衡问题,提出交叉空间混合注意力和类别平衡合成增强方法,将少数省份车牌识别率从78.2%提升至91.5%,同时保持152 FPS的实时处理性能。

详情
Comments
8 pages, 7 figures
AI中文摘要

实时车牌检测与识别(LPDR)是现代智慧城市的基石。尽管YOLOV5-PDLPR模型通过并行解码器方法显著提高了系统效率,但其性能仍受训练集中空间字符不匹配和数据不平衡的影响。本文通过引入交叉空间混合注意力(CSHA)和类别平衡合成增强(CBSA)来解决这些局限性。进行了涉及75,000个合成样本的广泛研究,并在四个基准数据集(CCPD、CLPD、PKU和一个应用特定数据集)上进行了评估。实验结果表明,少数省份车牌识别率从78.2%大幅提升至91.5%,同时保持152 FPS的实时处理性能。结果表明,结合空间感知并行解码与类别平衡增强为高速车牌识别系统提供了有效解决方案。

英文摘要

Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.

2606.05784 2026-06-05 cs.AI

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

TAPO: 通过信用转移实现工具感知策略优化用于多模态搜索代理

Chengqi Dong, Chuhuai Yue, Hang He, yandong liu, Fenghe Tang, S Kevin Zhou, Xiaohan Wang, Jiajun Chai, Guojun Yin

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 针对GRPO在多模态搜索代理中信用误分配问题,提出TAPO方法,利用工具参数确定性构建反事实证人进行保守优势校正,无需额外标注或采样,在多个基准上持续提升性能。

详情
AI中文摘要

我们识别并正式刻画了信用误分配作为GRPO在工具增强多模态搜索代理中的系统性失效模式:其对轨迹级优势的统一广播导致失败轨迹中有价值的工具使用步骤与无价值的步骤受到相同的惩罚。我们进一步通过实验量化了该现象的规模。超过一半的失败轨迹和失败的工具使用动作表现出可纠正的信用误分配,表明浪费的训练信号既显著又在结构上可被利用。基于这一见解,我们提出了工具感知策略优化(TAPO),它利用了信息获取工具的参数确定性特性:相似的调用参数定义等价的信息获取动作,因此应共享可比较的动作信用。TAPO在当前训练批次内构建反事实证人,并通过置信门控保守优势校正补偿误分配的负信用。它不需要额外的标注、模型或采样,并且引入可忽略的计算开销。在多个多模态搜索基准上,TAPO在三种主流RL算法(GRPO、GSPO和SAPO)上相对于强基线提供了一致的、即插即用的改进。我们的代码和模型将在接收后公开发布。

英文摘要

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

2606.05778 2026-06-05 cs.CV

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

超越绝对分数:基于编辑诱导差异的通用图像美学评估

Qifei Jia, Xintong Yao, Minghao Li, Yajie Chai, Qiming Lu, Baoyue Shen, Yasen Zhang, Runyu Shi, Ying Huang, Yue Zhang

发表机构 * Xiaomi Corporation, Beijing, China(小米公司,北京,中国)

AI总结 提出RED-Aes框架,利用可控图像编辑模型模拟人类审美推理,通过相对编辑诱导差异学习通用美学原则,实现跨场景泛化。

详情
AI中文摘要

传统的图像美学评估(IAA)方法主要依赖于回归绝对平均意见分数(MOS)。然而,这种范式忽视了人类审美感知固有的动态性质,这种感知依赖于对隐含视觉参考的无意识比较。因此,缺乏对美学差异的因果推理使得模型无法学习通用的美学原则,从而限制了它们在多样化场景中的泛化能力。在这项工作中,我们重新思考IAA任务,并提出相对编辑诱导差异美学学习(RED-Aes),一种新颖的框架,利用可控图像编辑模型模拟人类审美推理过程。RED-Aes不拟合绝对分数分布,而是显式学习驱动美学变化的视觉因素。为了支持这一范式,我们构建了RED-20k数据集,包含基于编辑的图像对、定量美学差异和思维链(CoT)推理。此外,我们引入了一种由相对排序一致性奖励引导的三阶段训练策略,仅通过相对监督优化模型。大量实验表明,RED-Aes在多个公共基准上取得了最先进的性能,展现出优越的泛化能力。

英文摘要

Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.

2606.05774 2026-06-05 cs.CV

LiAuto-GeoX: Efficient Grounded Driving Transformer

LiAuto-GeoX: 高效接地驾驶Transformer

Jiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu, Siyuan Wang, Le Hui, Ning Mao, Tao Wei, Pan Zhou, Kun Zhan, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) Li Auto Inc.(Li Auto公司) Northwestern Polytechnical University(西北工业大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学院)

AI总结 提出LiAuto-GeoX,通过稀疏激光雷达先验和几何保持蒸馏框架,实现高效、实时的自车中心密集3D重建,并显著提升下游自动驾驶任务性能。

详情
AI中文摘要

密集3D重建在空间理解方面展现出巨大潜力,但其作为自动驾驶实时车载表示的可行性仍是一个开放挑战。现有大规模视觉几何模型通常需要大量计算资源,且缺乏动态驾驶环境所需的远距离几何保真度、环视一致性和实时效率。为弥补这一差距,我们提出 extbf{LiAuto-GeoX},一种为可部署的自车中心3D场景理解设计的高效接地驾驶Transformer。我们的方法首先从大规模环视数据中学习高容量驾驶几何模型,利用稀疏激光雷达先验在远处、模糊或结构稀疏区域提供稳健的几何接地。然后,通过一种新颖的几何保持蒸馏框架,将这一能力实例化为高度紧凑的1.55亿参数车载模型。该框架采用掩码引导的深度感知蒸馏,通过强调几何信息丰富的区域来保留细粒度度量结构,以及相对姿态关系蒸馏,通过姿态诱导的几何关系强制跨视图空间一致性。大量评估表明, extbf{LiAuto-GeoX}在KITTI上以220 FPS运行,同时保持高保真密集重建,实现实时部署。学习到的几何结构无缝迁移到下游自主任务,在轨迹预测中达到90.6 PDMS,在占用预测中达到24.63 mIoU,在未来帧预测中达到47.67 IoU。这些结果表明,高效的密集3D重建可以超越其作为感知目标的传统角色,作为下一代自动驾驶的可扩展基础几何表示。

英文摘要

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

2606.05773 2026-06-05 cs.RO

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

PiL-World: 用于VLA策略环内评估的块式世界模型

Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

发表机构 * Tongji University(同济大学) AIRC, Midea Group(美的集团人工智能研究院)

AI总结 提出PiL-World,一种块式世界模型,通过交替VLA推理和世界模型预测实现闭环评估,无需真实机器人执行,显著降低成功率估计误差。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在真实机器人任务中闭环运行:机器人观察场景,执行一个动作块,并根据结果观察决定下一步。然而,大多数现有的用于机器人动作评估的世界模型仅限于沿预收集动作轨迹进行开环预测。这阻碍了它们支持闭环VLA评估,其中每个动作块必须基于先前执行产生的观察。为填补这一空白,我们提出PiL-World,一种专为策略环内VLA评估设计的块式世界模型。给定当前观察和VLA策略展开的动作轨迹,PiL-World生成与VLA展开一致的多视角未来观察,并匹配策略所需的图像输入。通过交替VLA推理和世界模型预测,PiL-World实现了无需每一步真实机器人执行的闭环评估。为提高展开保真度,PiL-World将视频生成条件化为从头部视角机器人运动导出的动作视觉控制和编码任务执行上下文的潜在历史,同时联合预测互补的多视角观察。除了成功的遥操作演示,它还从失败的执行轨迹中学习,帮助想象展开更好地匹配真实策略执行的分布。我们在三个真实双臂操作任务上评估PiL-World。PiL-World生成的想象展开与真实机器人执行高度一致。更重要的是,与基线相比,它将真实世界展开中测量的VLA成功率与通过闭环世界模型评估估计的VLA成功率之间的误差从63.2%降低到12.0%。

英文摘要

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

2606.05769 2026-06-05 cs.CV

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

在预测之前想象:用于视频事件预测的交错潜在视觉推理

Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) Nanjing University(南京大学) Fudan University(复旦大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Future-L1框架,通过交错潜在视觉推理在自回归解码中交替语言token和连续潜在视觉跨度,结合LA-DAPO强化学习优化,在视频事件预测任务上取得最先进结果。

详情
Comments
https://github.com/OpenGVLab/Future-L1
AI中文摘要

视频事件预测(VEP)要求模型从部分视频证据中推断未观察到的未来状态。现有的视频多模态大语言模型(MLLMs)通常在文本空间中将中间未来推理进行语言化:一旦视觉证据被语言化,细粒度的运动、几何和交互线索可能会丢失,导致看似合理但视觉上无根据的幻觉。我们引入了Future-L1,一种交错潜在视觉推理框架,允许MLLM在自回归解码过程中在语言token和连续潜在视觉跨度之间交替。为了训练这种能力,我们通过选择未来视觉提示有助于预测的示例,并将潜在状态与未来帧嵌入对齐,构建了Future-L1-50K数据集,然后使用LA-DAPO(一种具有结果对比和时间多样性奖励的潜在感知RL目标)进一步优化采样的潜在轨迹。Future-L1在两个基准测试上均取得了新的最先进结果:在FutureBench上,它将Qwen3-VL-8B从61.0提升至85.4,并超过之前最佳Video-CoE 10.4分;在TwiFF-Bench上,它将平均得分从2.44提升至3.04。这些结果表明,面向未来的视频推理受益于在潜在空间中保留中间视觉语义,而不是将每个推理步骤都转换为文本。

英文摘要

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

2606.05760 2026-06-05 cs.CV

ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

ExpSpeech-Net: 表情与语音的多模态融合用于深度伪造检测

Ruchika Sharma, Rudresh Dwivedi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出轻量级ExpSpeech-Net模型,通过融合面部表情和语音模式,利用SqueezeNet和RNN骨干网络及智能特征选择,实现高效深度伪造检测,准确率达94.5%。

详情
AI中文摘要

深度伪造视频日益挑战在线内容的可信度。许多现有检测方法依赖于复杂、资源密集型的模型,限制了其实用性。本研究引入了ExpSpeech-Net深度伪造检测(SqN-R-DFD)模型,该模型以SqueezeNet和RNN(循环神经网络)为骨干,提供了一个轻量级且高效的深度伪造检测框架,能够同时分析面部表情和语音模式。该方法采用了先进的特征提取,例如基于ISLBT的图像特征和用于信号的MPNCC,并结合使用SASMA(鹬辅助黏液霉菌算法)的智能特征选择策略,确保检测模型获得最优且平衡的输入。通过结合SqueezeNet和RNN,有效捕捉深度伪造视频中的细微不一致性。该框架实现了94.5%的准确率、99.3%的精确率和96.8%的F-measure,优于传统方法。这表明,将多种模态与智能预处理和特征选择相结合,能够实现适用于日常应用的实用、实时深度伪造检测。

英文摘要

Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

2606.05756 2026-06-05 cs.LG cs.AI cs.IT math.IT

Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability

超越软掩码:用于鲁棒GNN可解释性的硬扰动混合解释器

Jialiang Yin, Zheng Zhao, Linsey Pang, Bo Dong, Bin Shi, Jiaxing Zhang

发表机构 * Xi’an Jiaotong University(西安交通大学) PayPal bellevue USA(贝尔维尤美国)

AI总结 提出基于广义图信息瓶颈的硬扰动混合解释框架HPME,通过图池化提取离散解释子图并采用结构级替换的混合策略,解决软掩码方法中标签无关信息泄漏和分布偏移问题,提升解释保真度。

详情
AI中文摘要

图神经网络(GNN)在涉及图结构数据的各种应用中表现出卓越性能,尤其是在高风险领域。然而,其决策过程的不透明性限制了可信度和更广泛的采用。现有的事后解释方法通过识别影响GNN预测的子图来提高可解释性,并采用混合策略来缓解使用子图进行预测时引起的分布外(OOD)问题。然而,这些方法通常依赖软掩码,其本质上无法完全消除标签无关信息,允许冗余结构泄漏到混合过程中,阻碍OOD问题的解决,从而降低解释保真度。在本文中,我们提出HPME,一个基于广义图信息瓶颈的硬扰动混合解释框架,利用图池化提取离散解释子图,并产生信息容量界限以彻底压缩标签无关组件。此外,我们引入了一种基于结构级替换的新型混合策略,生成分布内解释以有效缓解分布偏移。在多种任务上的大量实验表明,HPME在合成和真实数据集上生成鲁棒且可解释的解释方面达到了最先进的性能。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.

2606.05754 2026-06-05 cs.SD cs.AI eess.AS

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感:标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University(东华交通大学) School of Materials and Energy, Guangdong University of Technology(广东工业大学材料与能源学院) Jiangxi Tonghui Technology Group Co., Ltd.(江西 Tonghui 技术集团有限公司) School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology(广州科学技术职业大学人工智能与大数据学院)

AI总结 提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架,通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率,解决了偏振衰落和干扰问题。

详情
AI中文摘要

相位敏感光时域反射计(ϕ-OTDR)因其在大距离上提供分布式时空监测能力,被广泛应用于大规模分布式声学传感(DAS)。然而,其现场性能仍可能因偏振诱导衰落(PIF)、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应,补充了ϕ-OTDR通道中易衰落的观测值,并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下,比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明,双分支融合模型在评估方法中提供了最有利的权衡,在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明,通道分组对双分支评估影响显著,表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟,而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略,并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

2606.05753 2026-06-05 cs.CV

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

余弦误导:辅助损失重塑视觉语言模型,而非其潜变量

XiuYu Zhang, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文通过实验发现,在视觉语言模型的潜视觉推理中,余弦相似度等对齐损失与准确性负相关,并引入PRISM诊断工具揭示潜变量被绕过,辅助损失主要通过共享参数重塑语言模型。

详情
AI中文摘要

潜视觉推理(LVR)在视觉语言模型(VLM)的感知和答案生成之间插入有监督的潜变量。该领域使用这些潜变量与其视觉目标之间的对齐(即余弦相似度或均方误差)作为训练损失和质量指标,假设更好的对齐会产生更好的答案。我们通过设计包含五种LVR变体的矩阵进行测试,发现该假设被颠覆:余弦对齐与所有五种变体的准确性呈负相关(r=-0.94)。为了解释这一点,我们引入了PRISM,一对推理时诊断工具:一个线性探针,询问答案在何处可解码;一个破坏性测试,询问潜变量是否承担负载。有监督的潜变量在很大程度上被绕过。破坏它们最多使准确性变化四个百分点。答案在潜变量下游可解码,但在潜变量处不可解码,并且这种可解码性差距的大小预测了每个变体在扰动下对其潜变量的依赖程度。与信息瓶颈对损失的解释一致,辅助目标通过共享参数而非其名义上优化的潜变量来重塑语言模型。

英文摘要

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

2606.05749 2026-06-05 cs.CL cs.AI

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

MARDoc:面向多模态长文档问答的记忆感知精炼智能体框架

Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu, Xiaochen Zhang, Qing Yang

发表机构 * Tianjin University(天津大学) Qifu Technology(启福科技) Beihang University(北航) Jiangnan University(江南大学)

AI总结 提出MARDoc框架,通过解耦为探索、精炼和反思三个智能体,并利用结构化记忆替代完整交互历史,减少上下文噪声,提升多模态长文档问答性能。

详情
AI中文摘要

迭代检索-推理智能体近期在多模态长文档问答中展现出潜力。然而,现有系统大多维护一个不断增长的单一上下文,混合了检索轨迹、观察和中间推理。随着交互积累,关键证据变得分散和稀释,使多跳推理变得嘈杂。我们提出MARDoc,一个记忆感知精炼智能体框架,将长文档问答解耦为三个专门智能体:探索者负责多粒度多模态检索,精炼者负责将交互轨迹蒸馏为结构化证据和推理记忆,反思者负责检查证据充分性并提供针对性反馈。在迭代过程中,智能体依赖动态更新的结构化记忆,而非完整的累积交互历史。这种设计减少了上下文噪声,同时保留了答案关键事实及其逻辑依赖。在MMLongBench-Doc和DocBench上的实验表明,MARDoc取得了强劲结果,优于同骨干基线,并证明了结构化记忆在智能体文档问答中的有效性。

英文摘要

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

2606.05744 2026-06-05 cs.CL

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

PlanBench-V: 面向视觉语言模型的空间规划地图基准

Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang

发表机构 * Behavioral and Spatial AI Lab(行为与空间人工智能实验室) Tongji University(同济大学) Peking University(北京大学) College of Architecture and Urban Planning(建筑与城市规划学院)

AI总结 为评估视觉语言模型在空间规划地图解读中的能力,构建了专家标注数据集SPMD,并提出基于感知、推理、关联、实施四阶段认知框架的基准PlanBench-V,实验表明当前模型在实施类任务上存在显著局限。

详情
AI中文摘要

空间规划地图是领土治理的核心,将规划目标、法规和空间策略转化为视觉形式,用于决策、公共沟通和机构协调。然而,其解读需要细粒度的视觉感知、空间推理和基于政策的专业判断,给人类学习者和AI系统都带来了重大挑战。随着视觉语言模型(VLM)的快速发展,其在城市规划分析中的应用日益受到关注,但现有的多模态基准主要针对通用视觉理解,忽视了规划实践中的领域特定认知过程。为填补这一空白,我们引入了PlanBench-V,这是首个用于评估VLM在空间规划地图解读中的综合基准。我们首先构建了空间规划地图数据库(SPMD),这是一个由专业规划师整理的专家标注数据集,包含223张规划地图和1629个问答对,覆盖了不同的地理区域和制图风格。然后,我们提出了一个理论驱动的评估框架,评估四种渐进能力:感知、推理、关联和实施,对应于规划地图解读的认知流程。跨两代VLM的大量实验显示了明显的进步但持续存在局限。最佳的2026年代理性推理模型Qwen3.6-Plus比最佳的2025年模型GPT-4o高出27%。尽管如此,所有模型在需要评估判断、政策敏感性和约束感知决策的实施导向任务上仍然表现挣扎。这些发现揭示了当前VLM在专业规划背景下的根本局限,并强调了领域自适应多模态推理框架的必要性。代码和数据可在https://plangpt.github.io获取。

英文摘要

Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.