arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12245 2026-05-13 cs.LG

SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

Chengzhu Bao, Xianglong Yan, Zhiteng Li, Guangshuo Qin, Guanghua Yu, Yulun Zhang

AI总结 SOAR 是一种针对 NVFP4 量化格式的后训练量化框架,旨在提升大语言模型在 4 位微缩量化下的重建精度。该方法提出闭式联合尺度优化(CJSO)和解耦尺度搜索(DSS)技术,通过联合优化全局与块级尺度并解耦量化与反量化尺度,有效缓解了传统方法中尺度选择不灵活和精度损失的问题。实验表明,SOAR 在多个大语言模型上均优于现有 NVFP4 量化方法,在相同内存占用下实现了更高的精度。

详情
英文摘要

NVFP4 has recently emerged as an efficient 4-bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post-training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed-form Joint Scale Optimization (CJSO), which jointly optimizes global and block-wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high-precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at https://github.com/steven-bao1/SOAR.

2605.12243 2026-05-13 cs.CL

PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

Weixiang Sun, Shang Ma, Yiyang Li, Tianyi Ma, Zehong Wang, Colby Nelson, Xusheng Xiao, Yanfang Ye

AI总结 PreScam 是一个用于从早期对话中预测诈骗进展的基准数据集,旨在研究如情感诈骗和投资诈骗等多轮对话型诈骗的演变过程。该数据集基于用户提交的诈骗报告构建,包含11,573个涵盖20类诈骗的对话实例,并按照诈骗生命周期进行结构化标注,标注内容包括诈骗者的心理操作和受害者的回应。研究通过两个任务评估模型能力,结果显示当前模型在捕捉诈骗线索方面有一定成效,但在追踪风险升级和跨轮次操控方面仍存在较大挑战。

详情
英文摘要

Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.

2605.12242 2026-05-13 cs.CL cs.AI

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

Deepak Kumar, Baban Gain, Asif Ekbal

AI总结 自动语音识别(ASR)生成的文本常包含停顿、重复和误起等不流畅现象,影响可读性和下游应用效果。本文提出一种基于大语言模型(LLM)的多语言语音文本流畅性修正方法,通过序列标注识别不流畅词元,并结合指令微调与对比学习优化模型,使其在去除不流畅内容的同时保持语义和语法完整性。实验表明,该方法在印地语、孟加拉语和马拉地语上显著优于现有基线模型,验证了其有效性与实用性。

Comments Accepted to ACL 2026 (Main)

详情
英文摘要

Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at https://github.com/deepak-kumar-98/Mind-the-Pause.

2605.12240 2026-05-13 cs.AI

No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents

Zixu Yang, Hang Zheng, Nan Jiang, Zhiyang Tang, Situo Zhang, Xiaobao Wu, Lu Chen, Kai Yu

AI总结 本文提出了一种异构多智能体架构NOD(Navigator-Operator-Director),用于提升服务型智能体在长期任务中的可靠性。该架构通过引入结构化的全局状态显式跟踪任务进展,并引入独立的Director智能体在关键操作前进行验证和干预,有效减少了策略违规、工具幻觉和用户意图偏差等问题。实验表明,NOD在任务成功率和关键操作精度方面优于现有方法,显著提升了服务智能体的可靠性。

详情
英文摘要

Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on $τ^2$-Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.

2605.12237 2026-05-13 cs.CV

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

Shuo Ni, Tong Wang, Jing Zhang, He Chen, Haonan Guo, Ning Zhang, Bo Du

AI总结 随着超高分辨率(UHR)地球观测图像的广泛应用,视觉-语言模型(VLMs)在处理这类数据时面临“分辨率幻觉”问题,即高分辨率输入虽能提供更丰富的视觉细节,却难以可靠地识别微小目标。为此,研究提出UHR-Micro基准,包含11,253条基于1,212张UHR图像的指令,用于评估VLM在微尺度目标识别上的性能,并引入Micro-evidence Active Perception(MAP)方法,通过主动定位和解析任务相关微小证据,提升模型对高分辨率图像中微小目标的感知能力。该研究为诊断和改进地球观测VLM的高分辨率推理能力提供了系统平台。

详情
英文摘要

Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.

2605.12236 2026-05-13 cs.RO cs.AI cs.LG

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

Matthew M. Hong, Jesse Zhang, Anusha Nagabandi, Abhishek Gupta

AI总结 该论文提出了一种名为 TMRL 的方法,通过引入扩散时间步调节的预训练策略,解决基于行为克隆的预训练策略在强化学习微调过程中探索能力不足的问题。核心方法包括 Context-Smoothed Pre-training(CSP)和 Timestep-Modulated Reinforcement Learning(TMRL),前者通过在策略输入中注入扩散噪声,增强动作分布的广泛性,后者则在微调阶段动态调节扩散时间步,从而有效控制探索过程。该方法在多种策略输入形式下均表现出更高的样本效率,并在复杂现实任务中实现了高效微调。

详情
英文摘要

Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.

2605.12233 2026-05-13 cs.LG cs.AI cs.CR

No More, No Less: Task Alignment in Terminal Agents

Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini, Devansh Srivastav, Thorsten Eisenhofer, Sahar Abdelnabi, Lea Schönherr

AI总结 本文研究了终端智能体在执行复杂任务时如何正确理解并选择性遵循环境中的指令,而非盲目接受或完全忽略。为此,作者提出了一个新的基准测试TAB,包含89个精心设计的任务,每个任务都包含必要的线索和干扰信息,要求智能体能够区分并仅使用有效线索完成任务。实验表明,当前最先进的终端智能体在任务完成能力与任务对齐之间存在系统性差距,揭示了现有模型在选择性遵循环境指令方面仍面临挑战。

详情
英文摘要

Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.

2605.12228 2026-05-13 cs.RO

Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation

Max Siebenborn, Daniel Ordoñez Apraez, Sophie Lueth, Giulio Turrisi, Massimiliano Pontil, Claudio Semini, Georgia Chalvatzaki

AI总结 该论文研究了双臂移动机器人的协调控制问题,提出了一种基于形态对称性的流匹配方法,以提升模仿学习的效率和泛化能力。通过形式化双臂系统的镜像对称性先验,作者设计了一种具有$\mathbb{C}_2$对称性的策略网络,能够在训练中强制保持反射对称性,从而在未见过的镜像配置上实现零样本泛化。实验表明,该方法在多种移动操作任务中显著提升了样本效率,并在真实机器人平台上验证了其有效性。

Comments Preprint. 4 pages, 5 figures

详情
英文摘要

Mobile manipulation requires coordinated control of high-dimensional, bimanual robots. Imitation learning methods have been broadly used to solve these robotic tasks, yet typically ignore the bilateral morphological symmetry inherent in such systems. We argue that morphological symmetry is an underexplored but crucial inductive bias for learning in bimanual mobile manipulation: knowing how to solve a task in one configuration directly determines how to solve its mirrored counterpart. In this paper, we formalize this symmetry prior and show that it constrains optimal bimanual policies to be ambidextrous and equivariant under reflections across the robot's sagittal plane. We introduce a $\mathbb{C}_2$-equivariant flow matching policy that enforces reflective symmetry either via a regularized training loss or an equivariant velocity network. Across planar and 6-DoF mobile manipulation tasks, symmetry-informed policies consistently improve sample efficiency and achieve zero-shot generalization to mirrored configurations absent from the training distribution. We further validate this zero-shot generalization capability on a real-world manipulation task with a TIAGo++ robot. Together, our findings establish morphological symmetry as an effective, generalizable, and scalable inductive bias for ambidextrous generative policy learning.

2605.12227 2026-05-13 cs.CL

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

AI总结 该研究旨在提升大语言模型在长上下文任务中的推理能力,针对现有方法在准确性、稳定性与样本效率方面的不足,提出了一种结合策略优化与知识蒸馏的新方法dGRPO。通过引入基于教师模型的密集指导,该方法在保持模型短上下文能力的同时,有效增强了其在长序列任务中的表现。此外,研究还构建了一个涵盖多跳推理、上下文定位和长文本生成的合成数据集LongBlocks,并通过实验验证了所提方法在长上下文对齐任务中的优越性。

详情
英文摘要

Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.

2605.12225 2026-05-13 cs.CL

Mechanistic Interpretability of ASR models using Sparse Autoencoders

Dan Pluth, Zachary Nicholas Houghton, Yu Zhou, Vijay K. Gurbani

AI总结 本文研究了基于Transformer的自动语音识别(ASR)模型的机制可解释性,采用稀疏自编码器(SAE)方法,将Whisper模型的帧级编码表示映射到高维稀疏潜在空间中。通过该方法,研究发现了跨越语言和非语言特征的多样化单语义特征,并展示了跨语言特征引导的能力,证明了SAE在ASR模型解释中的有效性,揭示了Whisper编码器中丰富的语言信息。

Comments 10 pages + references and appendix

详情
英文摘要

Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.

2605.12224 2026-05-13 cs.LG

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

Rodney A Sanchez, Ferat Sahin, Alex Ororbia, Jamison Heard

AI总结 该论文提出了一种名为“内在替代性条件作用”的内在奖励机制,旨在克服传统强化学习中直接条件作用的局限性。该方法借鉴心理学和生物学原理,通过基于记忆的机制实现注意、保持、再现和强化四个关键步骤,无需依赖示范智能体的策略或奖励函数,从而支持低样本学习。实验表明,该方法在多个环境中有效延长了任务时长,提升了智能体对非描述性终止状态的处理能力,为单次生命学习和持续学习等场景提供了更符合认知机制的学习范式。

详情
英文摘要

Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the environment as well as from others. Off-policy or learn-by-example methods can learn from demonstrators' representations, but they require access to the demonstrating agent's policies or their reward functions. Our work overcomes this direct sampling limitation by introducing vicarious conditioning as an intrinsic reward mechanism. We draw from psychological and biological literature to provide a foundation for vicarious conditioning and use memory-based methods to implement its four steps: attention, retention, reproduction, and reinforcement. Crucially, our vicarious conditioning paradigms support low-shot learning and do not require the demonstrator agent's policy nor its reward functions. We evaluate our approach in the MiniWorld Sidewalk environment, one of the few public environments that features a non-descriptive terminal condition (no reward provided upon agent death), and extend it to Box2D's CarRacing environment. Our results across both environments demonstrate that vicarious conditioning enables longer episode lengths by discouraging the agent from non-descriptive terminal conditions and guiding the agent toward desirable states. Overall, this work emulates a cognitively-plausible learning paradigm better suited to problems such as single-life learning or continual learning.

2605.12220 2026-05-13 cs.CV cs.AI cs.LG cs.RO

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

Mohammad Khoshkdahan, Alexey Vinel

AI总结 本文提出了一种基于高度感知鸟瞰图(BEV)和高分辨率特征融合的实时激光雷达-only 3D行人检测方法TriBand-BEV,通过将三维点云映射到三个高度带的二维BEV张量,将3D检测问题转化为2D检测问题,并从BEV输出重建3D边界框。该方法在单一网络中实现了对车辆、行人和骑行者的联合检测,采用层次化双向特征融合和分布焦点学习等技术,在KITTI数据集上取得了优于现有方法的检测性能,且运行速度高达49 FPS,适用于实时机器人部署。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

2605.12218 2026-05-13 cs.CV

Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction

Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf

AI总结 本文研究了如何从多摄像头输入中学习以自我为中心的鸟瞰图(BEV)表示,用于在线高精度地图构建。为了解决现有方法依赖单一自车视角监督导致的结构推理不一致问题,作者提出了跨视角监督(CVS)方法,通过从俯视视角向摄像头BEV编码器迁移几何和拓扑先验知识,从而提升结构一致性。实验表明,CVS在标准和扩展区域的mAP指标上均有显著提升,验证了其在长距离地图构建中的有效性。

详情
英文摘要

Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.

2605.12207 2026-05-13 cs.LG cs.AI cs.CL

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

Arijit Sehanobish, Charles Lovering

AI总结 本文研究了低秩适配(LoRA)中参数放置的问题,即在固定可训练参数数量的条件下,选择哪些参数进行微调对模型性能影响更大。研究发现,在监督微调任务中,随机选择和基于梯度信息选择的参数效果相近,但在基于梯度的参数优化(GRPO)任务中,只有基于梯度信息的参数选择能有效提升性能。作者提出了一种高效的参数评分方法,能够在极低计算成本下识别出对模型性能关键的参数,这些参数主要集中在残差流写入相关的投影层,并在不同规模的模型中表现出一致性。

Comments Preprint. Comments welcome

详情
英文摘要

We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).

2605.12206 2026-05-13 cs.LG

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

Asad Bakija, Florent De Geeter, Julien Brandoit, Pierre Sacré, Guillaume Drion

AI总结 在强化学习中,智能体在部分可观测马尔可夫决策过程(POMDP)中需要依赖记忆机制(通常由循环神经网络实现)来整合历史观测信息。本文研究了长期地平线任务中智能体的行为泛化问题,提出时间地平线泛化这一概念,并推导出其必要且充分的条件。研究发现,多稳定性是实现时间地平线泛化的必要条件,而现代可并行化架构如状态空间模型和门控线性RNN由于结构上的单稳性,无法实现跨地平线的泛化,因此设计兼具多稳定性和瞬态动态的可并行化架构成为长期强化学习的关键方向。

Comments 23 pages, 6 figures

详情
英文摘要

In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observation and the optimal action are separated by many time steps (called the horizon), are particularly challenging: training suffers from poor generalization, severe sample inefficiency, and prohibitive exploration costs. Ideally, an agent trained on short horizons would retain optimal behavior at arbitrarily longer ones, but no formal framework currently characterizes when this is achievable. To fill this gap, we formalized temporal horizon generalization, the property that a policy remains optimal for all horizons, derived a necessary and sufficient condition for it, and experimentally evaluated the ability of nonlinear and parallelizable RNN variants to achieve it. This paper presents the resulting theoretical framework, the empirical evaluation, and the dynamical interpretation linking RNN behavior to temporal horizon generalization. Our analyses reveal that multistability is necessary for temporal horizon generalization and, in simple tasks, sufficient; more complex tasks further require transient dynamics. In contrast, modern parallelizable architectures, namely state space models and gated linear RNNs, are monostable by construction and consequently fail to generalize across temporal horizons. We conclude that multistability and transient dynamics are two essential and complementary dynamical regimes for horizon generalization, and that no current parallelizable RNN exhibits both. Designing parallelizable architectures that combine these regimes thus emerges as a key direction for scalable long-horizon RL.

2605.12200 2026-05-13 cs.LG

Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS

Gaspard Berthelier, Mariia Baranova, Andrei-Tiberiu Pantea, Etienne Le Naour, Adrien Petralia, Tahar Nabil, Themis Palpanas

AI总结 本文研究了时间序列基础模型(TSFM)Chronos-2和TabPFN-TS在处理简单目标-协变量关系时的表现。通过设计受控实验,作者评估了这两种模型对协变量整合的能力,结果表明TabPFN-TS在短时间预测任务中更有效地捕捉这些关系,说明Chronos-2的优秀基准性能并不一定意味着其在简单协变量依赖建模方面最优。

详情
英文摘要

Time Series Foundation Models (TSFMs) have recently achieved state-of-the-art performance, often outperforming supervised models in zero-shot settings. Recent TSFM architectures, such as Chronos-2 and TabPFN-TS, aim to integrate covariates. In this paper, we design controlled experiments based on simple target-covariate relationships to assess this integration capability. Our results show that TabPFN-TS captures these relationships more effectively than Chronos-2, especially for short horizons, suggesting that the strong benchmark performance of Chronos-2 does not automatically translate into optimal modeling of simple covariate-target dependencies.

2605.12199 2026-05-13 cs.LG cs.AI

Overtrained, Not Misaligned

Joel Schreiber, Ariel Goldstein

AI总结 本文研究了“新兴对齐偏差”(EM)现象,即在特定任务上微调大语言模型会导致其在无关领域出现广泛偏差。通过对12个开源模型的系统实验,发现EM并非普遍现象,且模型规模与EM敏感性存在显著相关性。研究进一步表明,EM在训练后期出现,可通过提前停止训练或合理选择学习率有效避免,为实际应用提供了可行的缓解策略。

Comments Under review at CoLM 2026; companion to Nature Matters Arising (also under review). 25 pages, 6 figures

详情
英文摘要

Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.

2605.12198 2026-05-13 cs.CV

Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation

Xinhao Hu, Yiyi Zhang, Liqing Zhang, Jianfu Zhang

AI总结 该研究针对3D人体姿态估计中因训练与测试数据分布差异导致的领域泛化问题,提出了一种可控的生成增强框架,通过系统地变化姿态、背景和摄像机视角生成多样化的视频数据。该方法通过融合室内外真实与虚拟数据集,构建适用于实际部署场景的丰富训练数据,显著提升了模型在未知场景和数据集上的性能。

详情
英文摘要

Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.

2605.12197 2026-05-13 cs.LG

A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning

Haibo Chen, Xin Wang, Jiaheng Chao, Ling Feng, Wenwu Zhu

AI总结 本文提出了一种统一的图语言模型UniGraphLM,旨在解决多领域多任务图对齐中的表示对齐问题。该模型通过引入多领域多任务图神经网络编码器,学习具有跨领域和跨任务泛化能力的图表示,并将其与大语言模型的token空间进行自适应对齐。该方法有效克服了现有图语言模型在跨领域和跨任务对齐中的局限性,为图数据的通用语言理解提供了新的解决方案。

详情
英文摘要

Leveraging Graph Neural Networks (GNNs) as graph encoders and aligning the resulting representations with Large Language Models (LLMs) through alignment instruction tuning has become a mainstream paradigm for constructing Graph Language Models (GLMs), combining the generalization ability of LLMs with the structural modeling capacity of GNNs. However, existing GLMs that adopt GNNs as graph encoders largely overlook the problem of aligning GNN-encoded representations across domains and tasks with the LLM token space to obtain unified graph tokens, thereby limiting their ability to generalize across diverse graph data. To bridge this gap, we aim to incorporate a multi-domain, multi-task GNN encoder into GLMs and align its representations with LLMs to enable multi-domain, multi-task graph alignment instruction tuning. This alignment problem remains underexplored and poses two key challenges: 1) learning GNN-encoded representations that are simultaneously generalizable across domains and tasks and well aligned with textual semantics is difficult, due to substantial variations in graph structures, feature distributions, and supervision signals, together with the lack of textual-semantic alignment guidance in task-specific GNN training; 2) diverse graph data and task-specific instructions can exhibit different degrees of compatibility with the LLM token space during instruction tuning, leading to varying alignment difficulty and rendering a fixed alignment strategy suboptimal. To tackle these challenges, we propose UniGraphLM, a Unified Graph Language Model that incorporates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, and then adaptively aligns these representations with the LLM.

2605.12195 2026-05-13 cs.LG

Fair Conformal Classification via Learning Representation-Based Groups

Senrong Xu, Yanke Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Taolue Chen, Feng Xu, Xiaoxing Ma

AI总结 该论文提出了一种用于分类任务的公平合规模型预测框架,旨在解决传统合规模型预测方法在保障统计覆盖性的同时忽视算法偏倚的问题。研究通过学习表示方式动态识别子群体,并在这些子群体上保证条件覆盖,从而实现公平性与预测效用的平衡。实验表明,该方法在合成和真实数据集上均能有效提升预测的公平性与可靠性。

详情
英文摘要

Conformal prediction methods provide statistically rigorous marginal coverage guarantees for machine learning models, but such guarantees fail to account for algorithmic biases, thereby undermining fairness and trust. This paper introduces a fair conformal inference framework for classification tasks. The proposed method constructs prediction sets that guarantee conditional coverage on adaptively identified subgroups, which can be implicitly defined through nonlinear feature combinations. By balancing effectiveness and efficiency in producing compact, informative prediction sets and ensuring adaptive equalized coverage across unfairly treated subgroups, our approach paves a practical pathway toward trustworthy machine learning. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the framework.

2605.12185 2026-05-13 cs.CL cs.AI

Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding

Yigeng Zhou, Wu Li, Yifan Lu, Yequan Wang, Xuebo Liu, Wenya Wang, Jun Yu, Min Zhang, Jing Li

AI总结 本文研究了大语言模型在处理上下文与记忆知识冲突时的问题,提出了一种名为动态认知协调解码(DCRD)的两阶段解码方法,用于预测并缓解冲突。该方法通过分析注意力图评估上下文可信度,并根据预测结果选择贪心解码或基于上下文可信度的动态解码路径,从而在冲突场景下提升生成质量,同时保持无冲突情况下的高效性。此外,作者构建了ConflictKG基准数据集,实验表明DCRD在多个问答任务中优于现有方法,达到当前最优性能。

Comments Accepted by IEEE TASLP

详情
英文摘要

Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.

2605.12183 2026-05-13 cs.LG cs.AI

DriftXpress: Faster Drifting Models via Projected RKHS Fields

Ali Falahati, Elliot Creager, Gautam Kamath, Shubhankar Mohapatra

AI总结 DriftXpress 是一种基于投影再生核希尔伯特空间(RKHS)场的加速漂移模型方法,旨在提升生成模型的训练效率。该方法通过在低秩特征空间中近似漂移核,保持原始漂移场的吸引-排斥结构,同时降低场评估的计算成本。实验表明,DriftXpress 在保持图像生成质量的同时,显著减少了训练时间,进一步优化了漂移模型的训练-推理权衡。

详情
英文摘要

Drifting Models have emerged as a new paradigm for one-step generative modeling, achieving strong image quality without iterative inference. The premise is to replace the iterative denoising process in diffusion models with a single evaluation of a generator. However, this creates a different trade-off: drifting reduces inference cost by moving much of the computation into training. We introduce DriftXpress, an accelerated formulation of drifting models based on projected RKHS fields. DriftXpress approximates the drifting kernel in a low-rank feature space. This preserves the attraction-repulsion structure of the original drifting field while reducing the cost of field evaluation. Across image-generation benchmarks, DriftXpress achieves comparable FID to standard drifting while reducing wall-clock training cost. These results show that the training-inference trade-off of drifting models can be pushed further without giving up their one-step inference advantage.

2605.12182 2026-05-13 cs.RO

DexTwist: Dexterous Hand Retargeting for Twist Motion via Mixed Reality-based Teleoperation

Dongmyoung Lee, Chengxi Li, Dongheui Lee

AI总结 本文提出了一种基于混合现实的灵巧手遥操作框架DexTwist,用于解决在旋转操作任务中传统姿态映射方法的不足。该方法通过检测三指捏持动作,估计操作者的螺旋轴和旋转幅度,并在关节空间中实时优化,以提高旋转过程中的稳定性与精度。实验表明,DexTwist在旋转角度跟踪和螺旋轴稳定性方面优于基于向量映射的基线方法。

Comments 6 pages, 5 figures, 2 tables. Dongmyoung Lee and Chengxi Li contributed equally to this research

详情
英文摘要

Dexterous teleoperation via Mixed Reality (MR)-based interfaces offers a scalable paradigm for transferring human manipulation skills to dexterous robot hands. However, conventional retargeting approaches that minimize kinematic dissimilarity (e.g., joint angle or fingertip position error) often fail in contact-rich rotational manipulation, such as cap opening, key turning, and bolt screwing. This failure stems from the embodiment gap: mismatched link lengths, joint axes/limits, and fingertip geometry can cause direct pose imitation to induce tangential fingertip sliding rather than stable object rotation, resulting in screw axis drift, contact slip, and grasp instability. To address this, we propose DexTwist, a functional twist-retargeting framework for MR-based dexterous teleoperation. DexTwist detects a tripod pinch, estimates the operator's intended screw axis and twist magnitude, and applies a real-time residual joint-space refinement that tracks turning progress while regularizing the robot tripod geometry. The refinement minimizes a virtual-object objective defined by turning angle, screw axis consistency, fingertip closure, and tripod stability. Simulation and real-world experiments show that DexTwist improves turning angle tracking and screw axis stability compared with a vector-based retargeting baseline.

2605.12181 2026-05-13 cs.AI

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

Jueon Park, Wonjune Jang, Jiwoo Lee, Yein Park, Jaewoo Kang

AI总结 本文提出 MolDeTox,一个用于分子解毒的新型基准,旨在评估语言模型在逐步片段编辑任务中对分子毒性的优化能力。该基准解决了现有模型在毒性修复任务中数据多样性不足、分子结构有效性低以及依赖代理模型评估毒性等问题,通过细粒度任务分析提供可解释的评估框架。实验表明,基于片段级别的分子理解和生成能够提升结构有效性和分子质量,为药物安全性优化提供了新的研究方向。

详情
英文摘要

Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : https://huggingface.co/datasets/MolDeTox/MolDeTox

2605.12179 2026-05-13 cs.CV

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng, Xihua Wang, Ying Ba, Yuyue Wang, Kaisi Guan, Yinbo Wang, Wenpu Li, Ruihua Song

AI总结 SyncDPO 是一种通过偏好学习提升视频-音频联合生成中时间同步能力的后训练框架。该方法通过引入基于规则的实时负样本生成策略,有效增强了模型对时间错位的敏感性,避免了传统方法中高昂的采样和排序成本。实验表明,SyncDPO 在多个基准测试中显著提升了模型的时间对齐能力,并在分布外数据上展现出优越的泛化性能。

Comments Preprint. Under review

详情
英文摘要

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.

2605.12178 2026-05-13 cs.AI cs.CL cs.LG

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

Jishnu Sethumadhavan Nair, Patrice Bechard, Rishabh Maheshwary, Surajit Dasgupta, Sravan Ramachandran, Aakash Bhagat, Shruthan Radhakrishna, Pulkit Pattnaik, Johan Obando-Ceron, Shiva Krishna Reddy Malay, Sagar Davasam, Seganrasan Subramanian, Vipul Mittal, Sridhar Krishna Nemala, Christopher Pal, Srinivas Sunkara, Sai Rajeswar

AI总结 本文探讨了企业系统中是否需要学习世界模型的问题,指出由于企业系统的动态行为由租户特定的业务逻辑定义且随时间变化,传统基于历史数据训练的模型在部署变化时表现不佳。研究提出了一种新的方法——企业发现代理,通过在推理时读取系统配置来获取动态规则,从而提高预测的鲁棒性。实验表明,与依赖离线训练的模型相比,基于运行时发现的代理在面对动态变化时更具适应性。

详情
英文摘要

World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.

2605.12177 2026-05-13 cs.CL

Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

Andrea Morandi, Mahesh Viswanathan

AI总结 本文研究了在用户反馈稀疏且存在选择偏差的情况下,如何准确评估大型语言模型的质量。作者提出了一种基于多智能体的分层贝叶斯方法,无需真实标签即可对反馈进行去偏处理。该方法通过主题聚类、偏差建模和合成推理三个智能体协同工作,有效纠正了由用户反馈分布不均带来的估计偏差,并在实验中表现出优于传统方法的性能。

详情
英文摘要

[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatπ_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatπ_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $κ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

2605.12176 2026-05-13 cs.LG

Multi-Task Representation Learning for Conservative Linear Bandits

Jiabin Lin, Shana Moothedath

AI总结 本文提出了一种用于保守线性老虎机的约束多任务表示学习框架(CMTRL)。该框架假设多个线性老虎机任务共享一个低维的公共表示,并且每个任务的动作选择受到安全或性能约束。作者设计了一种新的算法Safe-AltGDmin,在满足约束的前提下学习低秩特征矩阵,并建立了该框架在遗憾和样本复杂度方面的理论保证。实验结果表明,该方法在多个任务上的表现优于现有基准算法。

详情
英文摘要

This paper presents the Constrained Multi-Task Representation Learning (CMTRL) framework for linear bandits. We consider T linear bandit tasks in a d dimensional space, which share a common low-dimensional representation of dimension r, where r is much smaller than the minimum of d and T. Furthermore, tasks are constrained so that only actions meeting specific safety or performance requirements are allowed, referred to as conservative (safe) bandits. We introduce a novel algorithm, Safe-Alternating projected Gradient Descent and minimization (Safe-AltGDmin), to recover a low-rank feature matrix while satisfying the given constraints. Building on this algorithm, we propose a multi-task representation learning framework for conservative linear bandits and establish theoretical guarantees for its regret and sample complexity bounds. We presented experiments and compared the performance of our algorithm with benchmark algorithms.

2605.12174 2026-05-13 cs.LG math.PR

Expected Batch Optimal Transport Plans and Consequences for Flow Matching

Samuel Boïté, Julie Delon, Kimia Nadjahi

AI总结 本文研究了在大规模学习中使用随机小批量解决最优传输(OT)问题的理论性质,特别是在流匹配(FM)中的应用。作者提出了期望批量OT计划 $\overlineπ_{k}$,通过在独立小批量上平均经验OT计划来定义整体耦合,并分析了其在大批量情况下的一致性。在生成模型相关的半离散情形下,作者推导了传输成本偏差和 $\overlineπ_{k}$ 收敛到真实OT计划的收敛速率,为流匹配提供了更稳定的理论支撑,并通过实验验证了批量大小对数值积分的影响。

详情
英文摘要

Solving optimal transport (OT) on random minibatches is a common surrogate for exact OT in large-scale learning. In flow matching (FM), this surrogate is used to obtain OT-like couplings that can straighten probability paths and reduce numerical integration cost. Yet, the population-level coupling induced by repeated minibatch OT remains only partially understood. We formalize this coupling as the expected batch OT plan $\overlineπ_{k}$, obtained by averaging empirical OT plans over independent minibatches of size $k$. We then establish its large-batch consistency and, in the semidiscrete case relevant to generative modeling, derive rates for both the transport-cost bias and the convergence of $\overlineπ_{k}$ to the OT plan. For FM, this yields a population coupling whose induced velocity field is regular enough to define a unique flow from the source to the discrete target. We finally quantify how OT batch size interacts with numerical integration in a tractable two-atom model and in synthetic and image experiments.

2605.12171 2026-05-13 cs.LG

Lower bounds for one-layer transformers that compute parity

Daniel Hsu

AI总结 本文研究了一层Transformer模型能否通过自注意力机制和有理函数后处理来表示异或(parity)函数的问题,证明了除非头数与后处理函数的次数乘积随输入长度线性增长,否则无法实现该函数的符号表示。该结果结合ReLU网络的有理逼近,进一步得出了针对ReLU后处理自注意力层的依赖边距的下界扩展,为理解Transformer模型的表达能力提供了理论依据。

详情
英文摘要

This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.