arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4060
2605.09945 2026-05-12 cs.LG

Selection of the Best Policy under Fairness Constraints for Subpopulations

Tingyu Zhu, Yuhang Wu, Zeyu Zheng

AI总结 本文研究了在公平性约束下选择适用于不同子群体的最佳政策的问题,要求所选政策在每个预设子群体上的表现均不低于一定阈值。作者提出了一个名为 T-a-S-CS 的算法,能够在保证公平性的前提下高效识别出平均性能最优的政策,并给出了该问题的样本复杂度下界。实验表明,该方法相比现有政策分配方法具有显著的效率提升。

详情
英文摘要

Many high-stakes decisions in health care, public policy, and clinical development require committing to a single policy that will be applied uniformly across a heterogeneous population. Regulatory and fairness standards sometime requires that the chosen policy performs adequately in every pre-specified subpopulation, not only on average. We formalize this as a Selection of the Best with Fairness Constraints (SBFC) problem, in order to identify the policy with the highest average performance among those policies that meet a minimum per-subpopulation threshold. We establish an instance-specific lower bound on sample complexity of the SBFC problem. We then develop a Track-and-Stop with Constraints on Subpopulation (T-a-S-CS) algorithm that achieves the lower bound asymptotically. We extend the framework to general closed-set and penalty-based fairness specifications with matching guarantees. Numerical experiments and a case study using the International Stroke Trial demonstrate substantial efficiency gains over policy-level allocation baselines.

2605.09944 2026-05-12 cs.RO

Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion

Jianguo Zhang, Wentai Xu, Shusheng Ye, Yuxiang He, Weimin Qi, Qinbo Sun, Ning Ding, Liguang Zhou

AI总结 本文针对人形机器人在复杂楼梯环境中行走的鲁棒性问题,提出了一种基于显式楼梯几何条件的控制框架。该方法通过提取楼梯高度、深度和偏航角等可解释的几何参数,直接作为策略网络的输入,从而实现对步态参数的主动调整。实验表明,该方法在仿真和真实环境中均表现出优异的泛化能力和稳定性,尤其在户外连续33级台阶的测试中验证了其实际应用价值。

Comments 8 pages, 7 figures, 4 tables

详情
英文摘要

Robust humanoid stair climbing remains challenging due to geometric discontinuities, sensitivity to step height variations, and perception uncertainty in real-world environments. Existing learning-based locomotion policies often rely on implicit terrain representations or blind proprioceptive feedback, limiting their ability to generalize across varying stair geometries and to anticipate required gait adjustments. This paper proposes an explicit stair geometry conditioning framework for robust humanoid stair climbing. Instead of encoding terrain as high-dimensional latent features, we extract a compact set of interpretable geometric parameters, including step height, step depth, and current yaw angle relative to the robot heading. These explicit stair parameters directly condition a Proximal Policy Optimization (PPO)-based locomotion policy, enabling proactive modulation of swing-foot clearance and stride characteristics according to stair structure. Simulation experiments demonstrate improved generalization across unseen stair heights beyond the training distribution. Real-world experiments on the Unitree G1 humanoid validate reliable indoor and outdoor stair traversal. In challenging outdoor scenarios, the robot successfully ascends 33 consecutive steps without failure, demonstrating robustness and practical deployability.

2605.09942 2026-05-12 cs.AI

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li

AI总结 本文提出HAGE,一种基于强化学习的加权多关系记忆框架,旨在解决智能体大语言模型系统中记忆检索的问题。HAGE将记忆检索重新定义为基于查询条件的序列化图遍历过程,通过共享记忆节点上的关系特定图视图组织记忆,并利用可训练的关系特征向量编码多维关系信号。研究引入了一个路由网络动态调整边嵌入的维度,并结合语义相似度与查询条件下的边表示计算遍历得分,从而优先选择高效用的关系路径。实验表明,HAGE在长期推理任务中表现出更高的准确率,并在准确率与效率之间取得了更优的平衡。

详情
英文摘要

Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at https://github.com/FredJiang0324/HAGE_MVPReview.

2605.09939 2026-05-12 cs.RO

Neural Distance-Guided Path Integral Control for Tractor-Trailer Navigation

Peng Wei, Chen Peng, Stavros Vougioukas

AI总结 本文研究了牵引挂车系统在复杂农业环境中的自主安全导航问题,针对其复杂的几何结构和非线性动力学特性,提出了一种基于几何神经编码器的实时避障方法。该方法通过神经网络快速准确地估计牵引挂车与激光雷达感知环境之间的距离,无需预先地图即可实现动态几何推理,并将学习到的距离信息融入模型预测路径积分(MPPI)控制器中,从而提升系统在复杂环境中的导航安全性和响应性。仿真结果验证了该方法在生成动态可行且安全轨迹方面的有效性。

详情
英文摘要

Autonomous and safe navigation of tractor-trailer systems requires accurate, real-time collision avoidance and dynamically feasible control, particularly in cluttered and complex agricultural environments. This is challenging due to their articulated, deformable geometries and nonlinear dynamics. Traditional methods oversimplify vehicle geometry or rely on precomputed distance fields that assume a known map, limiting their applicability in dynamic, partially unknown environments. To address these limitations, we propose a geometric neural encoder that provides fast and accurate distance estimates between the full tractor-trailer body and raw LiDAR perception, enabling real-time, map-free geometric reasoning. These learned distances are integrated into a Model Predictive Path Integral (MPPI) controller, allowing the system to incorporate true articulated geometry directly into its cost evaluation and enabling more responsive navigation in challenging agricultural settings. Simulation results demonstrate that the proposed framework generates dynamically feasible and safe trajectories for navigating tractor-trailer systems in cluttered and complex environments.

2605.09936 2026-05-12 cs.CV cs.IR cs.LG

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

AI总结 本文提出Urban-ImageNet,一个大规模多模态数据集与评估框架,用于从社交媒体图像中感知城市空间。该数据集包含来自微博的200万张公共图像及其配对文本,涵盖中国24个城市61个城区,支持从1K到2M不同规模的训练与评估。基于城市理论构建的层次化分类体系,Urban-ImageNet支持城市场景语义分类、跨模态图像-文本检索和实例分割三项任务,旨在评估AI模型对城市空间社会性、功能性和空间特征的理解能力。

详情
英文摘要

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

2605.09934 2026-05-12 cs.CL

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu

AI总结 TRACER 是一种用于多模态工具使用代理的可验证生成溯源框架,旨在解决当前工具使用过程中存在的“溯源鸿沟”问题,即生成的结论缺乏对支撑证据的明确依赖关系。TRACER 在生成每个回答的同时,生成结构化的溯源记录,明确标注支持该结论的工具调用、证据单元及语义关系,并通过多方面验证确保溯源可靠性,进而用于强化学习中的可追溯性约束和局部信用分配。实验表明,TRACER 在 TRACE-Bench 基准上表现出色,显著优于现有方法,证明了可靠多模态工具推理依赖于对观测的溯源感知,而非单纯增加工具调用次数。

详情
英文摘要

Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.

2605.09932 2026-05-12 cs.CL

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

AI总结 当前大型语言模型在处理长文本时,仍难以有效利用长上下文中的信息。本文提出FocuSFT,一种基于双层优化的细调方法,通过在训练过程中优化注意力分配,减少位置偏差和注意力陷阱对内容相关词的关注度削弱问题。该方法在内层优化中引入轻量级快速参数形成参数化记忆,引导模型关注语义相关内容,外层则基于此优化进行监督细调,从而提升模型在长上下文任务中的表现。实验表明,FocuSFT在多个基准测试中均取得显著性能提升。

详情
英文摘要

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

2605.09931 2026-05-12 cs.CL cs.AI

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, Shuhao Zhang

AI总结 PruneTIR 是一种在推理阶段提升工具集成推理(TIR)效果与效率的方法,旨在优化已具备工具使用能力的大语言模型在实际推理中的表现。该方法通过剪枝错误工具调用轨迹、重新采样工具调用以及在必要时暂停工具使用,有效减少错误调用对推理过程的负面影响,避免模型陷入反复失败的循环。实验表明,PruneTIR 显著提升了模型的推理准确率和效率,同时缩短了推理所需上下文长度。

详情
英文摘要

Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.

2605.09929 2026-05-12 cs.LG cs.SE

TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

Pranshav Gajjar, Emmanuel Ojo, Vijay K Shah

AI总结 本文提出了TeleResilienceBench,用于评估大型语言模型在电信领域中面对不完整或错误推理时的恢复能力,即“推理韧性”。该基准通过从弱生成模型中收集失败案例,并截断错误推理过程,要求目标模型继续并修正推理,从而量化模型的恢复表现。研究发现,即使是最强的模型其恢复率也仅为29.1%,且模型规模并不总是带来韧性提升,其中Nemotron-3-nano 4b在韧性与成本比方面表现最佳。此外,研究指出当前电信基准的难度标签更多反映知识覆盖而非推理深度。

详情
英文摘要

Deploying large language models in telecommunications requires more than task accuracy. In realistic workflows, a model may inherit partially completed reasoning from a prior step, an upstream agent, or its own earlier generation, and must continue that reasoning even when it is already going wrong. We introduce TeleResilienceBench, a benchmark that quantifies this capability, which we term reasoning resilience, across seven telecom sub-domains drawn from the GSMA Open-Telco LLM suite. Instances are constructed by collecting failures from a weak generator model, truncating the flawed reasoning trace at its midpoint, and asking a target model to continue and correct it. We propose the Correct Flip Rate (CFR) as a direct measure of successful recovery and evaluate eight models spanning the Qwen3.5, Gemma4, and Nemotron-3 families. Our results show that even the strongest model achieves a macro-average CFR of only 29.1%, and scale does not reliably improve resilience within families. Nemotron-3-nano 4b outperforms all Qwen3.5 variants including the 27b model and leads the auxiliary TeleMath numerical evaluation at 23.4% CR%, offering the best resilience-to-cost ratio in the set. A difficulty-stratified analysis further reveals that existing telecom benchmark difficulty labels reflect factual specificity rather than reasoning depth, suggesting that current evaluations measure knowledge coverage more than reasoning ability.

2605.09925 2026-05-12 cs.CV

Frequency Adapter with SAM for Generalized Medical Image Segmentation

Phuoc-Nguyen Bui, Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

AI总结 医学图像分割在辅助诊断和治疗规划中具有重要意义,但深度学习模型在面对不同数据集时常因成像协议、扫描设备和患者群体的差异而难以泛化。本文提出了一种基于频率域适配的通用医学图像分割方法FSAM,结合低秩适配(LoRA)和频率适配模块,有效提取跨域不变的高频特征,提升模型在单一源域下的泛化能力。实验表明,该方法在视网膜和前列腺数据集上优于传统域泛化及基于SAM的域泛化方法。

Comments Under review, 10 pages, 1 figure, 2 tables

详情
英文摘要

Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM's segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.

2605.09924 2026-05-12 cs.CL

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Xuewen Zhang, Haixiao Zhang, Xinlong Huang

AI总结 本文提出了一种名为Evolving Knowledge Distillation(EKD)的渐进式知识蒸馏框架,旨在解决大型神经机器翻译模型在资源受限设备上部署时的挑战。通过让学生模型逐步从容量逐渐增加的一系列教师模型中学习,EKD有效缩小了师生模型之间的性能差距。实验表明,EKD在多个基准数据集上均取得显著提升,最终学生模型的性能与大型教师模型非常接近。

详情
英文摘要

Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher models.Code and models are available at https://github.com/agi-content-generation/EKD.

2605.09922 2026-05-12 cs.CL cs.AI

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

Wu Li, Yigeng Zhou, Zesheng Shi, Yequan Wang, Min Zhang, Jing Li

AI总结 本文提出了一种名为TPAW的团队式自博弈算法,旨在提升大语言模型在完全自监督设置下的对齐效果。该方法通过让当前策略模型与历史检查点进行协作与竞争,增强训练稳定性与效率,并引入两种自适应加权机制,分别调整目标响应的重要性以及团队成员在训练中的贡献度。实验表明,TPAW在多种基础模型和大语言模型基准上均优于现有方法。

Comments Accepted by ACL 2026 Main

详情
英文摘要

While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.

2605.09920 2026-05-12 cs.LG cs.AI

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang

AI总结 本文提出了一种无需验证器的强化学习方法VIGOR,用于大语言模型的后训练优化。该方法通过计算策略模型自身生成文本时的梯度范数作为内在奖励信号,引导模型生成更符合当前策略的输出。VIGOR通过调整梯度长度偏差并采用分组排序策略,提升了奖励信号的稳定性和有效性,在数学推理和代码生成任务中均表现出优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $\ell_2$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a $\sqrt{T}$ scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at https://github.com/ZJUSCL/VIGOR.

2605.09918 2026-05-12 cs.LG cs.AI cs.CY

NaiAD: Initiate Data-Driven Research for LLM Advertising

Yihang Zhang, Zimeng Huang, Ren Zhai, Yipeng Kang, Tonghan Wang

AI总结 本文提出NaiAD,首个专为大语言模型(LLM)广告设计的综合性数据集,包含58,999条精心构建的嵌入广告的响应及对应用户查询。该数据集基于理论支撑的评估指标,分别全面捕捉用户和商业价值,并通过解耦生成管道缓解对齐LLM的维度共线性问题,生成结构多样的样本。研究还引入基于方差校准预测驱动推理的评分框架,使自动评分与人工标注一致,并揭示了成功广告整合依赖于四种语义策略,为未来LLM原生广告系统的发展提供了基础支撑。

Comments 37 pages, 11 figures

详情
英文摘要

Reconciling platform revenue with user experience in LLM advertising motivates a data-centric foundation. We introduce NaiAD, the first comprehensive dataset for LLM-native advertising comprising 58,999 carefully constructed ad-embedded responses paired with user queries. NaiAD is organized around theoretically grounded evaluation metrics that separately and comprehensively capture user and commercial utility. To mitigate the dimensional collinearity of aligned LLMs, we propose a decoupled generation pipeline that produces structurally diverse samples, ranging from responses that explicitly disentangle stakeholder utilities to responses that are uniformly strong or weak across dimensions. We further provide score labels calibrated by a Variance-Calibrated Prediction-Powered Inference (VC-PPI) framework, aligning automated scoring with human annotations. Mechanistic analyses reveal that successful ad integration relies on reasoning paths that cluster into four distinct semantic strategies. Models leveraging NaiAD internalize these strategies to simultaneously improve user and commercial utility, while enabling independent control over these distinct objectives via in-context learning. Together, these results position NaiAD as a foundational infrastructure for developing future LLM-native ad systems.

2605.09915 2026-05-12 cs.CL cs.AI cs.CY

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

Rong Shan, Te Gao, Hang Zheng, Yunjia Xi, Jiachen Zhu, Zeyu Zheng, Yong Yu, Weinan Zhang, Jianghao Lin

AI总结 本文指出,顶级人工智能会议为维持相对稳定的接收率,可能面临由全自动科学代理引发的“分母博弈”新威胁。恶意行为者可通过部署AI代理大量提交表面合理但质量低的论文,从而稀释评审资源,提高特定高质量论文的录用概率。研究分析了该威胁的可行性及影响,并提出需通过系统性政策与激励机制改革,而非仅依赖技术检测手段,来应对这一挑战。

Comments Accepted by ICML'26 Position Track

详情
英文摘要

The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term Agentic Denominator Gaming, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated agent mills. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.

2605.09908 2026-05-12 cs.LG cs.AI cs.SD

Voice Biomarkers for Depression and Anxiety

Oleksii Abramenko, Noah D. Stein, Colin Vaz

AI总结 本文研究如何从语音中检测抑郁和焦虑,提出了一种基于深度学习的方法,直接利用原始语音信号进行建模,避免了传统方法中依赖人工设计特征的局限。研究使用了一个包含约65,000条语料、来自23,000名美国代表性人群的大规模数据集进行训练,所提出的模型能够提取与内容无关的生物标志物信息,并与语音中的词汇特征结合,在实际应用中提升了预测性能。实验表明,该模型在约5000名独立测试者上实现了71%的灵敏度和特异性,并已开源发布以促进相关研究。

详情
英文摘要

Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.

2605.09906 2026-05-12 cs.AI cs.SD

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

AI总结 该研究针对音频-视觉大语言模型在推理过程中存在的跨模态干扰问题,提出了一种名为“先分离后融合”(SFFL)的新型推理框架。该方法通过强制进行模态特定的推理过程,分别生成音频和视觉的推理轨迹,并在后续阶段整合信息进行回答,从而减少模态间的信息干扰。实验表明,该方法在多个基准测试中显著提升了模型的准确性和鲁棒性。

详情
英文摘要

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.

2605.09905 2026-05-12 cs.LG cs.AI

Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

Guisong Liu, Xin Gao, Martin Dresler, Jiansong Zhang, Pengfei Wei

AI总结 本文重新审视了随机初始化的Transformer在睡眠分期任务中的作用,指出睡眠信号具有强局部时间连续性这一被忽视的特性。研究发现,未经训练的随机Transformer即可显著提升睡眠分期性能,并优于传统平滑方法。通过引入随机注意力先验核(RAPK),论文揭示了随机自注意力机制在保持阶段转换的同时,实现了全局平均与内容相似性的自适应平衡,表明性能提升主要源于模型结构的归纳偏置,而非参数学习。这一发现为构建高效、适用于边缘设备的睡眠监测系统提供了新思路。

详情
英文摘要

Automatic sleep staging commonly adopts Transformers under the assumption that they learn complex long-range dependencies. We challenge this view by revealing a neglected property of sleep sequences: strong local temporal continuity. We show that a randomly initialized Transformer, without any training, substantially improves sleep staging performance and consistently outperforms heuristic smoothing. We formalize this effect via a Random Attention Prior Kernel (RAPK), showing that random self-attention acts as an adaptive smoother by balancing global averaging and content-based similarity while preserving stage transitions. Using two metrics, the Local Smoothness Influence Index (LSII) and the Weighted Transition Entropy (WTE), we provide evidence that most performance gains in Transformer-based sleep staging arise from architectural inductive bias rather than parameter learning. Our results suggest that sleep staging can be effectively addressed with structure-driven smoothing mechanisms rather than complex dependency modeling, enabling more efficient and edge-deployable healthcare systems for large-scale physiological monitoring.

2605.09902 2026-05-12 cs.CV

Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang

AI总结 该研究针对多模态大语言模型(MLLM)的安全性问题,提出了一种新型的定向迁移攻击方法PRAF-Attack,旨在通过对抗样本误导模型对图像内容的判断。该方法引入了渐进式分辨率处理和自适应特征对齐策略,利用中间层特征增强攻击的迁移性和鲁棒性,并通过梯度一致性选择可迁移的层次特征,显著提升了攻击效果。实验表明,PRAF-Attack在多种黑盒MLLM上均表现出优于现有方法的迁移能力。

详情
英文摘要

Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

2605.09900 2026-05-12 cs.AI cs.CL cs.CV

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

AI总结 该论文提出了一种名为KnotBench的新型基准,用于评估视觉-语言模型在处理绳结图示任务中的能力。研究通过大量绳结图像和对应的规范签名,设计了包括等价判断、操作预测、识别和跨模态对齐在内的14项任务,揭示了当前模型在感知与操作之间的能力差距。实验表明,即使是最先进的模型如Claude Opus 4.7和GPT-5,在无思考模式下表现接近随机水平,而思考模式虽有提升,但整体仍难以准确模拟绳结操作。

Comments 41 pages, 18 figures

详情
英文摘要

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

2605.09899 2026-05-12 cs.CV cs.AI

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang, Xiaopeng Fan

AI总结 本文提出了一种基于双曲几何约束的跨模态知识蒸馏方法HGC-Det,用于提升多模态3D目标检测的性能。该方法通过图像分支和点云分支分别提取语义特征,并引入语义引导的体素优化、双曲几何约束的跨模态特征迁移以及特征聚合的几何优化三个核心组件,有效缓解了模态异质性、空间错位和表示危机等问题。实验表明,该方法在室内和室外数据集上均取得了检测精度与计算成本之间的良好平衡。

Comments Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review

详情
英文摘要

Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.

2605.09893 2026-05-12 cs.CL cs.AI

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Sushrita Rakshit, Hanwen Zhang, Hua Shen

AI总结 本研究探讨了大型语言模型中“价值-行为鸿沟”问题,即模型所宣称的价值与其实际行为之间存在不一致的现象。研究提出了一种新的失败模式——“伪推理”,即模型表现出看似合理的推理过程,但行为并未与价值对齐。为此,研究者构建了VALDI框架,用于系统评估模型在对话生成中对价值的遵循程度,并发现无论是专有模型还是开源模型,都存在显著的价值与行为不一致现象。此外,研究还提出VIVALDI多智能体审计系统,用于在生成过程中干预以改善对齐效果。

Comments 9 pages

详情
英文摘要

Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.

2605.09887 2026-05-12 cs.LG cs.AI math.DG

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta

AI总结 该研究探讨了稀疏自编码器(SAEs)在不同网络层中重建误差变化的几何原因,指出激活空间的曲率和内在维度差异导致了现有单层缩放定律无法解释的现象。研究通过分析多个模型层的几何特征,发现SAEs的宽度-稀疏性缩放规律依赖于每层的流形结构,并提出了一个可跨模型迁移的几何缩放定律。实验表明,流形的几何特性决定了每层的宽度指数,且高曲率和高内在维度对应更高的重建误差下限,揭示了SAEs面临的是由流形结构决定的“几何墙”而非资源限制的天花板。

详情
英文摘要

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

2605.09886 2026-05-12 cs.RO

Network-Efficient World Model Token Streaming

Shatadal Mishra, Ahmadreza Moradipari, Nejib Ammar

AI总结 该研究探讨了在分布式计算和车联网环境下,如何高效地传输和同步离散世界模型的状态表示。提出了一种基于VQ-U-Net编码器的网络高效流式传输方法,并设计了一种无标签、全在线的算法,通过余弦距离优先传输状态变化部分,并自适应触发关键帧以应对网络带宽限制和数据包丢失。实验表明,该方法在保持相同比特率的前提下,显著降低了状态嵌入的失真,并提升了下游任务的预测性能,验证了其在车载网络环境中的实用价值。

Comments Accepted at IEEE VNC 2026

详情
英文摘要

Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network-efficient streaming of a discrete world model state, where a stride-16 VQ-U-Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed-length coding. We consider a keyframe--delta protocol under strict per-message payload budgets and packet loss, and propose a fully online, label-free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming-drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200-byte budget) dynamic-only embedding distortion drops from 0.0712 to 0.0661 (7.2\%), and at 0.036 Mb/s (400-byte budget) from 0.0427 to 0.0407 (4.8\%). Under 10\% delta packet loss at 200 bytes, dynamic-only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next-token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic-position perplexity improves from 206.0 to 193.1 (6.3\%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1\%). These results support discrete token-state streaming as a practical systems layer for bandwidth-aware synchronization and improved downstream token-dynamics utility under vehicular networking constraints.

2605.09879 2026-05-12 cs.AI

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Junjian Wang, Xin Zhou, Qiran Xu, Kun Zhan

AI总结 该研究提出了M2A方法,旨在将数学推理与智能体推理在大语言模型中有效结合,解决两者在多任务学习中难以协同的问题。M2A通过在参数空间中合并模型,仅沿不影响智能体行为的子空间注入数学推理能力,从而在不干扰原有行为的前提下增强推理深度。实验表明,M2A在真实编程智能体任务中显著提升了推理效果,例如在Qwen3-8B模型上将SWE-Bench Verified的解决率从44.0%提升至51.2%。

详情
英文摘要

While reasoning has become a central capability of large language models (LLMs), the reasoning patterns required for different scenarios are often misaligned. Mathematical reasoning typically relies on intrinsic logic to solve closed-world problems in a single response, whereas agentic reasoning requires not only internal reasoning but also multi-turn interaction with external environments, interleaving thought and action. This misalignment prevents mathematical and agentic reasoning from effectively benefiting from each other, often yielding unstable reasoning behavior and only limited performance gains under multi-task learning. In this paper, we propose M2A, a novel paradigm that synergizes mathematical and agentic reasoning via model merging. To avoid overfitting to superficial reasoning patterns under joint training, M2A operates directly in parameter space: it identifies the feature subspace critical for agent behavior, and merges the mathematical reasoning task vector only along its null space, thereby injecting reasoning capability along directions that do not perturb agent behavior. Unlike SFT or RL, M2A requires no additional gradient-update and exposes the merging coefficient as a simple knob for controlling reasoning length. Experiments in a challenging real-world coding agent setting show that our method effectively extends agentic reasoning depth and delivers substantial performance improvements. Applied to a fine-tuned Qwen3-8B, M2A improves its SWE-Bench Verified resolved rate from 44.0% to 51.2% without retraining the model. Code is available at https://github.com/laplucky/M2A.git.

2605.09875 2026-05-12 cs.AI

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

Su-Hyeon Kim, Yo-Sub Han

AI总结 不同家族的大语言模型由于使用不同的隐藏维度、分词器和训练过程,使得行为方向难以在模型间进行比较或迁移。本文提出了一种锚点投影框架,将各模型的隐藏表示映射到共享的锚坐标空间(ACS),从而提取并对齐跨模型的行为方向。实验表明,该方法在多个模型家族和行为轴上具有良好的对齐效果,并在下游任务中表现出稳定的迁移能力,为跨家族模型的可解释性研究提供了新的视角。

详情
英文摘要

Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.

2605.09874 2026-05-12 cs.CV cs.AI cs.CL

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal

AI总结 EgoMemReason 是一个面向长期第一人称视频理解的记忆驱动推理基准,旨在评估模型在连续多天视觉信息中积累、回忆和推理的能力。该基准引入了三种互补的记忆类型,包括实体记忆、事件记忆和行为记忆,用于评估模型对物体状态变化、活动顺序以及长期行为模式的识别能力。实验表明,当前最先进的模型在该基准上的整体准确率仅为39.6%,揭示了长期记忆推理仍面临重大挑战。

Comments The first two authors contributed equally. Project website: https://egomemreason.github.io/

详情
英文摘要

Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

2605.09870 2026-05-12 cs.LG cs.AI

Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

Tsuyoshi Okita

AI总结 该论文提出了一种基于干预的时序因果发现框架SVAR-FM,通过将物理模拟器视为对 Pearl 的 do 操作符的实现,利用模拟器生成干预数据,从而学习非线性因果关系。研究证明了在满足一定条件时结构VAR模型的可识别性,并通过实验验证了该方法在多个科学领域中优于传统观测方法,尤其在模拟器精度不足时能正确预测因果效应符号反转现象。

Comments 54 pages, 6 figures

详情
英文摘要

We propose SVAR-FM (Structural VAR with Flow Matching), a framework for time series causal discovery that treats a physics-based simulator as a mechanical realization of Pearl's do operator. Clamping a variable inside the simulator physically severs confounding paths, producing interventional data by construction. Conditional Flow Matching then learns the nonlinear interventional conditionals. Theoretically, we prove that the full structural VAR becomes identifiable under a coverage condition on the simulator-clampable variables, and derive an end-to-end error bound that decomposes into Monte Carlo, simulator fidelity, and Flow Matching terms. A sign-flip corollary predicts that when simulator accuracy falls below a threshold, the estimated causal effect reverses sign. Empirically, a benchmark across four scientific domains confirms that SVAR-FM recovers the correct causal sign where observational methods produce sign-reversed estimates due to confounding. A case study in ultrafast laser physics verifies the sign-flip prediction by physically varying the accuracy level of a first-principles quantum solver: the low-accuracy setting reverses the causal sign, while the high-accuracy setting recovers the correct direction (R-squared = 0.983, zero bias).

2605.09867 2026-05-12 cs.LG cs.AI

Continuous Latent Contexts Enable Efficient Online Learning in Transformers

Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia

AI总结 该研究探讨了如何使Transformer模型更有效地实现在线学习,提出通过引入连续潜在上下文标记来增强模型的适应能力。研究构建了深度恒定的Transformer结构,能够以线性组合的形式存储算法状态,从而实现加权多数算法和Q学习等基础在线决策过程。实验表明,使用潜在上下文的轻量级Transformer在长序列在线预测任务中表现优于更大更复杂的语言模型,展示了其作为实现在线学习算法的有效状态表示的潜力。

Comments 37 pages, 15 figures, 3 tables

详情
英文摘要

Large language models (LLMs) exhibit a strong capacity for in-context learning: Given labeled examples, they can generate good predictions without parameter updates. However, many interactive settings go beyond static prediction to online decision-making, in which effective behavior demands adaptation over long multi-turn horizons in response to feedback, and efficient algorithms in these domains must use compact representations of what they have learned. Recently, continuous transformer architectures with latent chain of thought have shown promise for offline iterative tasks such as directed graph-reachability. Motivated by this, we study whether continuous latent context tokens equip transformers to more effectively realize online learning. We give explicit constructions of constant-depth transformers that implement two foundational online decision-making procedures -- the weighted majority algorithm and $Q$-learning -- by storing their algorithmic state as linear combinations of feature embeddings, using a small number of latent context tokens. We further train a small GPT-2-style transformer with latent contexts using a multi-curriculum objective that does not directly supervise the latent states. On long synthetic online prediction sequences, this model outperforms larger and more complex LLMs, including Qwen-3-14B and DeepSeek-V3. Our results suggest that continuous latent contexts provide a simple and effective persistent state for transformers to implement online learning algorithms.

2605.09864 2026-05-12 cs.CV cs.LG

DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

Kevin Zhu, William Tang, Raphael Hay Tene, Zesheng Liu, Nhut Le, Maryam Rahnemoonfar

AI总结 本文提出了一种名为DA-SegFormer的细粒度灾害评估语义分割方法,旨在解决无人机影像中因纹理退化和类别不平衡导致的细微损伤识别难题。该方法基于SegFormer架构,引入了类别感知采样策略和在线难例挖掘结合Dice损失函数,以增强对罕见损伤特征的学习,并采用分辨率保持的推理协议以保留原始纹理细节。实验表明,DA-SegFormer在RescueNet数据集上取得了74.61%的mIoU,显著优于基线模型,并在关键损伤类别上实现了显著提升。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61\% mIoU, outperforming the baseline by 2.55\%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).