arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.09945 2026-05-12 cs.LG

Selection of the Best Policy under Fairness Constraints for Subpopulations

Tingyu Zhu, Yuhang Wu, Zeyu Zheng

AI总结本文研究了在公平性约束下选择适用于不同子群体的最佳政策的问题，要求所选政策在每个预设子群体上的表现均不低于一定阈值。作者提出了一个名为 T-a-S-CS 的算法，能够在保证公平性的前提下高效识别出平均性能最优的政策，并给出了该问题的样本复杂度下界。实验表明，该方法相比现有政策分配方法具有显著的效率提升。

2605.09944 2026-05-12 cs.RO

Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion

Jianguo Zhang, Wentai Xu, Shusheng Ye, Yuxiang He, Weimin Qi, Qinbo Sun, Ning Ding, Liguang Zhou

AI总结本文针对人形机器人在复杂楼梯环境中行走的鲁棒性问题，提出了一种基于显式楼梯几何条件的控制框架。该方法通过提取楼梯高度、深度和偏航角等可解释的几何参数，直接作为策略网络的输入，从而实现对步态参数的主动调整。实验表明，该方法在仿真和真实环境中均表现出优异的泛化能力和稳定性，尤其在户外连续33级台阶的测试中验证了其实际应用价值。

Comments 8 pages, 7 figures, 4 tables

2605.09942 2026-05-12 cs.AI

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li

AI总结本文提出HAGE，一种基于强化学习的加权多关系记忆框架，旨在解决智能体大语言模型系统中记忆检索的问题。HAGE将记忆检索重新定义为基于查询条件的序列化图遍历过程，通过共享记忆节点上的关系特定图视图组织记忆，并利用可训练的关系特征向量编码多维关系信号。研究引入了一个路由网络动态调整边嵌入的维度，并结合语义相似度与查询条件下的边表示计算遍历得分，从而优先选择高效用的关系路径。实验表明，HAGE在长期推理任务中表现出更高的准确率，并在准确率与效率之间取得了更优的平衡。

2605.09939 2026-05-12 cs.RO

Neural Distance-Guided Path Integral Control for Tractor-Trailer Navigation

Peng Wei, Chen Peng, Stavros Vougioukas

AI总结本文研究了牵引挂车系统在复杂农业环境中的自主安全导航问题，针对其复杂的几何结构和非线性动力学特性，提出了一种基于几何神经编码器的实时避障方法。该方法通过神经网络快速准确地估计牵引挂车与激光雷达感知环境之间的距离，无需预先地图即可实现动态几何推理，并将学习到的距离信息融入模型预测路径积分（MPPI）控制器中，从而提升系统在复杂环境中的导航安全性和响应性。仿真结果验证了该方法在生成动态可行且安全轨迹方面的有效性。

2605.09936 2026-05-12 cs.CV cs.IR cs.LG

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

AI总结本文提出Urban-ImageNet，一个大规模多模态数据集与评估框架，用于从社交媒体图像中感知城市空间。该数据集包含来自微博的200万张公共图像及其配对文本，涵盖中国24个城市61个城区，支持从1K到2M不同规模的训练与评估。基于城市理论构建的层次化分类体系，Urban-ImageNet支持城市场景语义分类、跨模态图像-文本检索和实例分割三项任务，旨在评估AI模型对城市空间社会性、功能性和空间特征的理解能力。

详情

英文摘要

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

URL PDF HTML ☆

赞 0 踩 0

2605.09934 2026-05-12 cs.CL

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu

AI总结 TRACER 是一种用于多模态工具使用代理的可验证生成溯源框架，旨在解决当前工具使用过程中存在的“溯源鸿沟”问题，即生成的结论缺乏对支撑证据的明确依赖关系。TRACER 在生成每个回答的同时，生成结构化的溯源记录，明确标注支持该结论的工具调用、证据单元及语义关系，并通过多方面验证确保溯源可靠性，进而用于强化学习中的可追溯性约束和局部信用分配。实验表明，TRACER 在 TRACE-Bench 基准上表现出色，显著优于现有方法，证明了可靠多模态工具推理依赖于对观测的溯源感知，而非单纯增加工具调用次数。

详情

英文摘要

Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.

URL PDF HTML ☆

赞 0 踩 0

2605.09932 2026-05-12 cs.CL

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

AI总结当前大型语言模型在处理长文本时，仍难以有效利用长上下文中的信息。本文提出FocuSFT，一种基于双层优化的细调方法，通过在训练过程中优化注意力分配，减少位置偏差和注意力陷阱对内容相关词的关注度削弱问题。该方法在内层优化中引入轻量级快速参数形成参数化记忆，引导模型关注语义相关内容，外层则基于此优化进行监督细调，从而提升模型在长上下文任务中的表现。实验表明，FocuSFT在多个基准测试中均取得显著性能提升。

2605.09931 2026-05-12 cs.CL cs.AI

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, Shuhao Zhang

AI总结 PruneTIR 是一种在推理阶段提升工具集成推理（TIR）效果与效率的方法，旨在优化已具备工具使用能力的大语言模型在实际推理中的表现。该方法通过剪枝错误工具调用轨迹、重新采样工具调用以及在必要时暂停工具使用，有效减少错误调用对推理过程的负面影响，避免模型陷入反复失败的循环。实验表明，PruneTIR 显著提升了模型的推理准确率和效率，同时缩短了推理所需上下文长度。

2605.09929 2026-05-12 cs.LG cs.SE

TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

Pranshav Gajjar, Emmanuel Ojo, Vijay K Shah

AI总结本文提出了TeleResilienceBench，用于评估大型语言模型在电信领域中面对不完整或错误推理时的恢复能力，即“推理韧性”。该基准通过从弱生成模型中收集失败案例，并截断错误推理过程，要求目标模型继续并修正推理，从而量化模型的恢复表现。研究发现，即使是最强的模型其恢复率也仅为29.1%，且模型规模并不总是带来韧性提升，其中Nemotron-3-nano 4b在韧性与成本比方面表现最佳。此外，研究指出当前电信基准的难度标签更多反映知识覆盖而非推理深度。

2605.09925 2026-05-12 cs.CV

Frequency Adapter with SAM for Generalized Medical Image Segmentation

Phuoc-Nguyen Bui, Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

AI总结医学图像分割在辅助诊断和治疗规划中具有重要意义，但深度学习模型在面对不同数据集时常因成像协议、扫描设备和患者群体的差异而难以泛化。本文提出了一种基于频率域适配的通用医学图像分割方法FSAM，结合低秩适配（LoRA）和频率适配模块，有效提取跨域不变的高频特征，提升模型在单一源域下的泛化能力。实验表明，该方法在视网膜和前列腺数据集上优于传统域泛化及基于SAM的域泛化方法。

Comments Under review, 10 pages, 1 figure, 2 tables

2605.09924 2026-05-12 cs.CL

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Xuewen Zhang, Haixiao Zhang, Xinlong Huang

AI总结本文提出了一种名为Evolving Knowledge Distillation（EKD）的渐进式知识蒸馏框架，旨在解决大型神经机器翻译模型在资源受限设备上部署时的挑战。通过让学生模型逐步从容量逐渐增加的一系列教师模型中学习，EKD有效缩小了师生模型之间的性能差距。实验表明，EKD在多个基准数据集上均取得显著提升，最终学生模型的性能与大型教师模型非常接近。

2605.09922 2026-05-12 cs.CL cs.AI

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

Wu Li, Yigeng Zhou, Zesheng Shi, Yequan Wang, Min Zhang, Jing Li

AI总结本文提出了一种名为TPAW的团队式自博弈算法，旨在提升大语言模型在完全自监督设置下的对齐效果。该方法通过让当前策略模型与历史检查点进行协作与竞争，增强训练稳定性与效率，并引入两种自适应加权机制，分别调整目标响应的重要性以及团队成员在训练中的贡献度。实验表明，TPAW在多种基础模型和大语言模型基准上均优于现有方法。

Comments Accepted by ACL 2026 Main

2605.09920 2026-05-12 cs.LG cs.AI

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang

AI总结本文提出了一种无需验证器的强化学习方法VIGOR，用于大语言模型的后训练优化。该方法通过计算策略模型自身生成文本时的梯度范数作为内在奖励信号，引导模型生成更符合当前策略的输出。VIGOR通过调整梯度长度偏差并采用分组排序策略，提升了奖励信号的稳定性和有效性，在数学推理和代码生成任务中均表现出优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

2605.09918 2026-05-12 cs.LG cs.AI cs.CY

NaiAD: Initiate Data-Driven Research for LLM Advertising

Yihang Zhang, Zimeng Huang, Ren Zhai, Yipeng Kang, Tonghan Wang

AI总结本文提出NaiAD，首个专为大语言模型（LLM）广告设计的综合性数据集，包含58,999条精心构建的嵌入广告的响应及对应用户查询。该数据集基于理论支撑的评估指标，分别全面捕捉用户和商业价值，并通过解耦生成管道缓解对齐LLM的维度共线性问题，生成结构多样的样本。研究还引入基于方差校准预测驱动推理的评分框架，使自动评分与人工标注一致，并揭示了成功广告整合依赖于四种语义策略，为未来LLM原生广告系统的发展提供了基础支撑。

Comments 37 pages, 11 figures

2605.09915 2026-05-12 cs.CL cs.AI cs.CY

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

Rong Shan, Te Gao, Hang Zheng, Yunjia Xi, Jiachen Zhu, Zeyu Zheng, Yong Yu, Weinan Zhang, Jianghao Lin

AI总结本文指出，顶级人工智能会议为维持相对稳定的接收率，可能面临由全自动科学代理引发的“分母博弈”新威胁。恶意行为者可通过部署AI代理大量提交表面合理但质量低的论文，从而稀释评审资源，提高特定高质量论文的录用概率。研究分析了该威胁的可行性及影响，并提出需通过系统性政策与激励机制改革，而非仅依赖技术检测手段，来应对这一挑战。

Comments Accepted by ICML'26 Position Track

2605.09908 2026-05-12 cs.LG cs.AI cs.SD

Voice Biomarkers for Depression and Anxiety

Oleksii Abramenko, Noah D. Stein, Colin Vaz

AI总结本文研究如何从语音中检测抑郁和焦虑，提出了一种基于深度学习的方法，直接利用原始语音信号进行建模，避免了传统方法中依赖人工设计特征的局限。研究使用了一个包含约65,000条语料、来自23,000名美国代表性人群的大规模数据集进行训练，所提出的模型能够提取与内容无关的生物标志物信息，并与语音中的词汇特征结合，在实际应用中提升了预测性能。实验表明，该模型在约5000名独立测试者上实现了71%的灵敏度和特异性，并已开源发布以促进相关研究。

2605.09906 2026-05-12 cs.AI cs.SD

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

AI总结该研究针对音频-视觉大语言模型在推理过程中存在的跨模态干扰问题，提出了一种名为“先分离后融合”（SFFL）的新型推理框架。该方法通过强制进行模态特定的推理过程，分别生成音频和视觉的推理轨迹，并在后续阶段整合信息进行回答，从而减少模态间的信息干扰。实验表明，该方法在多个基准测试中显著提升了模型的准确性和鲁棒性。

2605.09905 2026-05-12 cs.LG cs.AI

Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

Guisong Liu, Xin Gao, Martin Dresler, Jiansong Zhang, Pengfei Wei

AI总结本文重新审视了随机初始化的Transformer在睡眠分期任务中的作用，指出睡眠信号具有强局部时间连续性这一被忽视的特性。研究发现，未经训练的随机Transformer即可显著提升睡眠分期性能，并优于传统平滑方法。通过引入随机注意力先验核（RAPK），论文揭示了随机自注意力机制在保持阶段转换的同时，实现了全局平均与内容相似性的自适应平衡，表明性能提升主要源于模型结构的归纳偏置，而非参数学习。这一发现为构建高效、适用于边缘设备的睡眠监测系统提供了新思路。

2605.09902 2026-05-12 cs.CV

Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang

AI总结该研究针对多模态大语言模型（MLLM）的安全性问题，提出了一种新型的定向迁移攻击方法PRAF-Attack，旨在通过对抗样本误导模型对图像内容的判断。该方法引入了渐进式分辨率处理和自适应特征对齐策略，利用中间层特征增强攻击的迁移性和鲁棒性，并通过梯度一致性选择可迁移的层次特征，显著提升了攻击效果。实验表明，PRAF-Attack在多种黑盒MLLM上均表现出优于现有方法的迁移能力。

详情

英文摘要

Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.09900 2026-05-12 cs.AI cs.CL cs.CV

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

AI总结该论文提出了一种名为KnotBench的新型基准，用于评估视觉-语言模型在处理绳结图示任务中的能力。研究通过大量绳结图像和对应的规范签名，设计了包括等价判断、操作预测、识别和跨模态对齐在内的14项任务，揭示了当前模型在感知与操作之间的能力差距。实验表明，即使是最先进的模型如Claude Opus 4.7和GPT-5，在无思考模式下表现接近随机水平，而思考模式虽有提升，但整体仍难以准确模拟绳结操作。

Comments 41 pages, 18 figures

2605.09899 2026-05-12 cs.CV cs.AI

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang, Xiaopeng Fan

AI总结本文提出了一种基于双曲几何约束的跨模态知识蒸馏方法HGC-Det，用于提升多模态3D目标检测的性能。该方法通过图像分支和点云分支分别提取语义特征，并引入语义引导的体素优化、双曲几何约束的跨模态特征迁移以及特征聚合的几何优化三个核心组件，有效缓解了模态异质性、空间错位和表示危机等问题。实验表明，该方法在室内和室外数据集上均取得了检测精度与计算成本之间的良好平衡。

Comments Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review

2605.09893 2026-05-12 cs.CL cs.AI

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Sushrita Rakshit, Hanwen Zhang, Hua Shen

AI总结本研究探讨了大型语言模型中“价值-行为鸿沟”问题，即模型所宣称的价值与其实际行为之间存在不一致的现象。研究提出了一种新的失败模式——“伪推理”，即模型表现出看似合理的推理过程，但行为并未与价值对齐。为此，研究者构建了VALDI框架，用于系统评估模型在对话生成中对价值的遵循程度，并发现无论是专有模型还是开源模型，都存在显著的价值与行为不一致现象。此外，研究还提出VIVALDI多智能体审计系统，用于在生成过程中干预以改善对齐效果。

Comments 9 pages

2605.09887 2026-05-12 cs.LG cs.AI math.DG

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta

AI总结该研究探讨了稀疏自编码器（SAEs）在不同网络层中重建误差变化的几何原因，指出激活空间的曲率和内在维度差异导致了现有单层缩放定律无法解释的现象。研究通过分析多个模型层的几何特征，发现SAEs的宽度-稀疏性缩放规律依赖于每层的流形结构，并提出了一个可跨模型迁移的几何缩放定律。实验表明，流形的几何特性决定了每层的宽度指数，且高曲率和高内在维度对应更高的重建误差下限，揭示了SAEs面临的是由流形结构决定的“几何墙”而非资源限制的天花板。

详情

英文摘要

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

URL PDF HTML ☆

赞 0 踩 0

2605.09886 2026-05-12 cs.RO

Network-Efficient World Model Token Streaming

Shatadal Mishra, Ahmadreza Moradipari, Nejib Ammar

AI总结该研究探讨了在分布式计算和车联网环境下，如何高效地传输和同步离散世界模型的状态表示。提出了一种基于VQ-U-Net编码器的网络高效流式传输方法，并设计了一种无标签、全在线的算法，通过余弦距离优先传输状态变化部分，并自适应触发关键帧以应对网络带宽限制和数据包丢失。实验表明，该方法在保持相同比特率的前提下，显著降低了状态嵌入的失真，并提升了下游任务的预测性能，验证了其在车载网络环境中的实用价值。

Comments Accepted at IEEE VNC 2026

2605.09879 2026-05-12 cs.AI

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Junjian Wang, Xin Zhou, Qiran Xu, Kun Zhan

AI总结该研究提出了M2A方法，旨在将数学推理与智能体推理在大语言模型中有效结合，解决两者在多任务学习中难以协同的问题。M2A通过在参数空间中合并模型，仅沿不影响智能体行为的子空间注入数学推理能力，从而在不干扰原有行为的前提下增强推理深度。实验表明，M2A在真实编程智能体任务中显著提升了推理效果，例如在Qwen3-8B模型上将SWE-Bench Verified的解决率从44.0%提升至51.2%。

2605.09875 2026-05-12 cs.AI

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

Su-Hyeon Kim, Yo-Sub Han

AI总结不同家族的大语言模型由于使用不同的隐藏维度、分词器和训练过程，使得行为方向难以在模型间进行比较或迁移。本文提出了一种锚点投影框架，将各模型的隐藏表示映射到共享的锚坐标空间（ACS），从而提取并对齐跨模型的行为方向。实验表明，该方法在多个模型家族和行为轴上具有良好的对齐效果，并在下游任务中表现出稳定的迁移能力，为跨家族模型的可解释性研究提供了新的视角。

2605.09874 2026-05-12 cs.CV cs.AI cs.CL

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal

AI总结 EgoMemReason 是一个面向长期第一人称视频理解的记忆驱动推理基准，旨在评估模型在连续多天视觉信息中积累、回忆和推理的能力。该基准引入了三种互补的记忆类型，包括实体记忆、事件记忆和行为记忆，用于评估模型对物体状态变化、活动顺序以及长期行为模式的识别能力。实验表明，当前最先进的模型在该基准上的整体准确率仅为39.6%，揭示了长期记忆推理仍面临重大挑战。

Comments The first two authors contributed equally. Project website: https://egomemreason.github.io/

详情

英文摘要

Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2605.09870 2026-05-12 cs.LG cs.AI

Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

Tsuyoshi Okita

AI总结该论文提出了一种基于干预的时序因果发现框架SVAR-FM，通过将物理模拟器视为对 Pearl 的 do 操作符的实现，利用模拟器生成干预数据，从而学习非线性因果关系。研究证明了在满足一定条件时结构VAR模型的可识别性，并通过实验验证了该方法在多个科学领域中优于传统观测方法，尤其在模拟器精度不足时能正确预测因果效应符号反转现象。

Comments 54 pages, 6 figures

2605.09867 2026-05-12 cs.LG cs.AI

Continuous Latent Contexts Enable Efficient Online Learning in Transformers

Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia

AI总结该研究探讨了如何使Transformer模型更有效地实现在线学习，提出通过引入连续潜在上下文标记来增强模型的适应能力。研究构建了深度恒定的Transformer结构，能够以线性组合的形式存储算法状态，从而实现加权多数算法和Q学习等基础在线决策过程。实验表明，使用潜在上下文的轻量级Transformer在长序列在线预测任务中表现优于更大更复杂的语言模型，展示了其作为实现在线学习算法的有效状态表示的潜力。

Comments 37 pages, 15 figures, 3 tables

2605.09864 2026-05-12 cs.CV cs.LG

DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

Kevin Zhu, William Tang, Raphael Hay Tene, Zesheng Liu, Nhut Le, Maryam Rahnemoonfar

AI总结本文提出了一种名为DA-SegFormer的细粒度灾害评估语义分割方法，旨在解决无人机影像中因纹理退化和类别不平衡导致的细微损伤识别难题。该方法基于SegFormer架构，引入了类别感知采样策略和在线难例挖掘结合Dice损失函数，以增强对罕见损伤特征的学习，并采用分辨率保持的推理协议以保留原始纹理细节。实验表明，DA-SegFormer在RescueNet数据集上取得了74.61%的mIoU，显著优于基线模型，并在关键损伤类别上实现了显著提升。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)