arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2605.07924 2026-05-11 cs.LG cs.AI cs.CL

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

AI总结 该论文提出了一种名为Trajectory-Shaped Discrete Flow Matching(TS-DFM)的新方法,旨在解决离散流匹配生成文本时需要大量步骤的问题。其核心思想是通过能量导航引导生成过程,替代传统轨迹中盲目的随机跳跃,从而在更少步骤内生成更高质量的文本。实验表明,该方法在保持推理成本不变的前提下,显著提升了生成效率和质量,优于多种现有生成基线方法。

详情
英文摘要

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

2605.07915 2026-05-11 cs.CV

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Zhengrong Yue, Taihang Hu, Mengting Chen, Haiyu Zhang, Zihao Pan, Tao Liu, Zikang Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Yali Wang

AI总结 本文研究了对扩散模型友好的潜在流形应具备的特性,发现其关键在于潜在空间的结构一致性、局部连续性和全局语义性,而非单纯的重建保真度。基于这一发现,作者提出了先验对齐自编码器(PAE),通过引入改进的先验知识和扰动正则化,显式地引导潜在流形的组织结构。实验表明,PAE在ImageNet 256x256数据集上显著提升了训练效率和生成质量,达到与现有方法相比更优的性能。

详情
英文摘要

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.

2605.07914 2026-05-11 cs.LG cs.CV

Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

Aristotelis Ballas, Christos Diou

AI总结 该论文研究了多分布学习中损失函数景观的平坦性与梯度对齐性之间的关系,指出以往方法通常只关注单一几何特性,而忽略了两者的协同作用。作者通过风险分解分析,提出了一个包含曲率项和对齐项的理论框架,并基于此设计了SAGE算法,同时优化这两个因素。实验表明,SAGE在域泛化和多任务学习任务中均取得了优于现有方法的性能。

Comments Preprint - Submitted to NeurIPS 2026

详情
英文摘要

Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}Σ_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $Σ_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.

2605.07903 2026-05-11 cs.SD cs.AI

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

Hamze Hammami, Nidhal Abdulaziz

AI总结 本文提出了一种名为BeeVe的无监督框架,用于从蜂群嗡嗡声中发现声学状态,无需依赖预定义的语义单元或声音生成模型。该方法利用冻结的自监督PaSST模型提取特征,并通过VQ-VAE在无标签数据上学习离散的声学编码本,成功区分了蜂群中是否有蜂王的不同状态,并进一步识别出多个稳定的子状态。实验表明,该方法能够有效捕捉声学信号中的非随机序列结构,并在未见过的录音中保持良好的泛化能力,为无创蜂群健康监测提供了新途径。

详情
英文摘要

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.

2605.07902 2026-05-11 cs.LG cs.DS

Curvature Beyond Positivity: Greedy Guarantees for Arbitrary Submodular Functions

Yixin Chen, Alan Kuhnle

AI总结 该论文研究了在非单调和负值情况下子模函数优化的贪心算法性能保证问题。传统上,贪心算法在单调非负子模函数上的近似比为63%,但面对负值或非单调函数时,现有理论无法适用。本文通过扩展“曲率”这一概念,统一处理了非单调性和负值问题,提出了适用于任意子模函数的贪心算法近似比分析,并在非单调场景下超越了现有最优结果。此外,该方法还可推广到一般组合约束,实验验证了理论的有效性。

Comments 44 pages, 11 figures

详情
英文摘要

Submodular functions -- functions exhibiting diminishing returns -- are central to machine learning. When the objective is monotone and non-negative, the greedy algorithm achieves a tight $63\%$ approximation. But many practical objectives incorporate costs that make them negative on some inputs, and all existing multiplicative guarantees require non-negativity. Prior work handles negativity through additive bounds for the special class of decomposable functions and non-monotonicity through partial-monotonicity parameters, but these address each difficulty in isolation and neither extends the classical structural theory. We extend \emph{curvature} -- a parameter measuring how far a function deviates from linearity -- to all submodular functions, handling both non-monotonicity and negativity through a single classical concept. A greedy algorithm with pruning achieves a curvature-controlled multiplicative ratio for \emph{any} submodular function, including those taking negative values -- the first such guarantee beyond monotonicity and non-negativity. In the non-monotone regime $1 \le c_g < 2.2$, the bound strictly beats the best known uniform ratio of $0.401$ (for non-negative $f$), and it recovers the classical $(1-e^{-c_g})/c_g$ guarantee for monotone functions. A multilinear-extension variant extends the framework to general combinatorial constraints via multilinear relaxation. Experiments on cost-penalized experimental design, coverage, feature selection, and a curvature sweep on Multi-News passage selection support the theory.

2605.07897 2026-05-11 cs.CV cs.AI

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

Hang Wu, Sherin Mary Mathews, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang

AI总结 该研究针对流式视频理解中的实时记忆管理难题,提出了一种无需训练的双阶段框架SAVEMem,通过引入语义感知机制优化视觉记忆生成与查询检索过程。第一阶段在线构建三级流式记忆结构,利用固定伪问题库引导长期记忆的语义显著性保留;第二阶段根据查询内容动态调整检索范围,结合锚点条件的近期门控机制实现从短期到中长期记忆的自适应检索。实验表明,该方法在多个基准测试中显著提升了性能,同时有效降低了模型的峰值显存占用。

详情
英文摘要

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

2605.07888 2026-05-11 cs.LG cs.CV

Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability Analysis

Ozgu Goksu, Nicolas Pugeault

AI总结 该论文研究了联邦学习中数据异构性对模型泛化性能的影响,提出了一种名为FedQuad的新方法,通过同时最小化同类样本之间的距离并最大化不同类样本之间的距离,以缓解模型聚合过程中表示对齐的问题。该方法在多个非独立同分布的数据集上进行了实验,验证了其优于现有方法的有效性,并对基于度量学习的方法在集中式和联邦环境中的表现进行了深入分析。

Comments arXiv admin note: substantial text overlap with arXiv:2509.04107

详情
英文摘要

Federated Learning (FL) enables decentralised model training across distributed clients without requiring data centralisation. However, the generalisation performance of the global model is usually degraded by data heterogeneity across clients, particularly under limited data availability and class imbalance. To address this challenge, we propose FedQuad, a novel method that explicitly enforces minimising intra-class representations while enabling inter-class splits across clients. By jointly minimising distances between positive pairs and maximising distances between negative pairs, the proposed approach mitigates representation misalignment introduced during model aggregation. We evaluate our method on CIFAR-10, CIFAR-100, and Tiny-ImageNet under diverse non-IID settings and varying numbers of clients, demonstrating consistent improvements over existing baselines. Additionally, we provide a comprehensive analysis of metric learning-based approaches in both centralised and federated environments, highlighting their effectiveness in alleviating representation collapse under heterogeneous data distributions.

2605.07885 2026-05-11 cs.RO

AERO-VIS: Asynchronous Event-based Real-time Onboard Visual-Inertial SLAM

Yannick Burkhardt, Sebastián Barbas Laina, Simon Boche, Leonard Freißmuth, Stefan Leutenegger

AI总结 本文提出了一种基于异步事件相机的视觉惯性同步定位与建图系统AERO-VIS,旨在提升复杂环境下的实时定位性能。该系统采用数据驱动的鲁棒关键点检测器,并通过异步处理事件流来动态适应运行需求,从而实现低延迟和高吞吐的实时性能。研究在无人机上的部署验证了其卓越的精度,实现了首个仅依赖机载计算的纯事件相机惯性SLAM系统,并展示了闭环控制和大范围状态估计能力。

Comments 8 pages, 4 figures

详情
英文摘要

The robustness of event cameras to high dynamic range and motion blur holds the potential to improve visual odometry systems in challenging environments. Although their high temporal resolution does not require synchronous processing, most event-based odometry methods still run at fixed rates, which simplifies system design but restricts latency and throughput. In this work, we present AERO-VIS, a stereo event-inertial SLAM system with an integrated, data-driven, robust, and performance-optimized keypoint detector. By processing the event stream asynchronously, the system dynamically adapts to downstream runtime demands, ensuring low-latency and real-time performance. When deploying AERO-VIS on a UAV, we achieve unprecedented accuracy in onboard event-based SLAM. These unique characteristics enable us to present the first purely event-based inertial SLAM system that demonstrates closed-loop UAV control and large-scale state estimation while relying solely on onboard compute. A video of the experiments and the source code are available at ethz-mrl.github.io/AERO-VIS.

2605.07883 2026-05-11 cs.CL

Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

Ying Zhang, Congyu Qiao, Xin Geng, Ning Xu

AI总结 该论文研究了大型语言模型(LLMs)在安全对齐过程中常见的“僵化拒绝”问题,即模型对请求采用统一的拒绝模板,导致交互自然性下降。为解决这一问题,作者提出了LANCE方法,通过标签增强技术,利用变分推断预测多类拒绝的连续分布,从而提供更细粒度的反馈引导模型生成既安全又自然的响应。实验表明,LANCE在保持高安全性的前提下,显著缓解了僵化拒绝问题,提升了响应的有用性和自然度。

详情
英文摘要

Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals and severely undermines the naturalness of interactions between humans and LLMs. To address this issue, LANCE is proposed in this paper to ensure safe yet flexible and natural responses via label enhancement. Specifically, LANCE employs variational inference to perform label enhancement, predicting a continuous distribution across multiple rejection categories. These fine-grained rejection distributions provide multi-way textual gradients for a refinement model to neutralize the hazardous elements in the prompt, so that the LLMs could generate safe responses that avoid rigid rejections while preserving the naturalness of interactions. Experiments demonstrate that LANCE significantly alleviates the rigid rejection problem while maintaining high security standards, significantly outperforming existing baseline models in terms of helpfulness and naturalness of responses.

2605.07878 2026-05-11 cs.LG stat.ML

Black-box model classification under the discriminative factorization

Hayden Helm, Merrick Ohata, Carey Priebe

AI总结 本文研究了在黑盒模型分类任务中如何通过查询集区分模型特性的问题。作者提出了一种判别因子分解方法,用于评估查询集质量,并证明在该框架下,随机分类的概率会随查询预算指数级下降。实验表明,基于估计的判别因子选择的查询集能够有效重现最优查询集的性能排序,为黑盒模型分析提供了新的理论依据和实用工具。

详情
英文摘要

Access to modern generative systems is often restricted to querying an API (the ``black-box" setting) and many properties of the system are unknown to the user at inference time. While recent work has shown that low-dimensional representations of models based on the relationship between their embedded responses to a set of queries are useful for inferring model-level properties, the quality of these representations is highly sensitive to the query set. We introduce the \emph{discriminative factorization} to distinguish between high- and low-quality query sets in the context of black-box model-level classification. Under this framework, the probability of chance-level classification decays exponentially in the query budget. On three auditing tasks, estimated factorization parameters predict the empirical performance decay rate. We conclude by showing that query sets selected using the estimated discriminative field reproduce the empirical ordering of oracle query sets.

2605.07877 2026-05-11 cs.RO

Melding LLM and temporal logic for reliable human-swarm collaboration in complex scenarios

Junfeng Chen, Yuxiao Zhu, An Zhuo, Xintong Zhang, Shuo Zhang, Guanghui Wen, Xiwang Dong, Meng Guo, Zhongkui Li

AI总结 本文研究了如何在复杂动态环境中实现可靠的人群机器人协作任务规划问题。作者提出了一种将可验证时序逻辑与上下文感知的大语言模型(LLM)相结合的神经符号框架,用于生成符合任务规则且可执行的子任务序列,并通过不确定性感知的调度器和事件触发的交互协议,实现高效、鲁棒的异构机器人集群协作。该方法有效减少了操作员干预需求,提升了长期任务下的系统可靠性与可扩展性。

详情
英文摘要

Robot swarms promise scalable assistance in complex and hazardous environments. Task planning lies at the core of human-swarm collaboration, translating the operator's intent into coordinated swarm actions and helping determine when validation or intervention is required during execution. In long-horizon missions under dynamic scenarios, however, reliable task planning becomes difficult to maintain: emerging events and changing conditions demand continual adaptation, and sustained operator oversight imposes substantial cognitive burden. Existing LLM-based planning tools can support plan generation, yet they remain susceptible to invalid task orderings and infeasible robot actions, resulting in frequent manual adjustment. Here we introduce a neuro-symbolic framework for long-horizon human-swarm collaboration that tightly melds verifiable task planning with context-grounded LLM reasoning. We formalize mission goals and operational rules as temporal logic formulas and admissible task orderings as task automata. Conditioned on these formal constraints and live perceptual context, LLMs generate executable subtask sequences that satisfy mission rules and remain grounded in the current scene. An uncertainty-aware scheduler then assigns subtasks across the heterogeneous swarm to maximize parallelisms while remaining resilient to disruptions. An event-triggered interaction protocol further limits operator involvement to sparse, high-level confirmation and guidance. Deployment on a heterogeneous robotic fleet yields similar results while remaining robust to hardware-specific actuation and communication uncertainties. Together, these results support a formal and scalable paradigm for reliable and low-overhead human-swarm collaboration in dynamic environments

2605.07872 2026-05-11 cs.CV cs.AI

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Yuancheng Wei, Linli Yao, Lei Li, Haojie Zhang, Hao Zhou, Fandong Meng, Xu Sun

AI总结 本文针对视频理解领域奖励模型研究不足的问题,提出了一种统一的框架,涵盖基准设计、数据构建和奖励模型训练。研究引入了包含2100个偏好对的视频理解奖励基准VURB,并构建了大规模高质量的VUP-35K偏好数据集,用于训练出性能优越的VideoDRM和VideoGRM奖励模型,显著提升了视频理解任务中的模型表现与推理能力。

详情
英文摘要

Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.

2605.07865 2026-05-11 cs.LG cs.AI cs.CL

KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo

AI总结 本文提出了一种名为vOPD的策略梯度方法,用于稳定大型语言模型的在线蒸馏(OPD)过程。该方法引入了来自强化学习领域的控制变量基线,通过计算学生模型与教师模型之间每词的负反向KL散度作为价值函数,从而在不增加额外计算开销的情况下降低梯度方差。实验表明,vOPD在保持单样本估计器轻量性的同时,有效提升了训练稳定性,并在多个数学和科学推理任务中优于传统OPD方法。

详情
英文摘要

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.

2605.07863 2026-05-11 cs.LG

ADKO: Agentic Decentralized Knowledge Optimization

Lucas Nerone Rillo, Zhanhong Jiang, Nastaran Saadati, Aditya Balu, Baskar Ganapathysubramanian, Chinmay Hegde, Soumik Sarkar

AI总结 本文提出了一种名为ADKO的智能体去中心化知识优化框架,旨在实现多个自主智能体之间的协作黑箱优化,具备样本效率高、隐私保护、处理异构目标和通信高效等优势。每个智能体维护一个基于本地数据训练的私有高斯过程代理模型,通过知识令牌进行通信,令牌包含方向信号、优势评分和可选的语言模型见解,无需共享原始数据或模型参数。该方法结合了高斯过程上置信界、并行贝叶斯优化、去中心化学习和语言模型引导发现,理论分析表明其累积遗憾可分解为多个可控制的误差项,并提出了基于保真度的令牌剪枝策略以在内存限制下保留高信息量的令牌,实验验证了其在神经架构搜索和科学发现任务中的有效性。

Comments 31 pages

详情
英文摘要

We present Agentic Decentralized Knowledge Optimization (ADKO), a framework for collaborative black-box optimization across autonomous agents that achieves sample efficiency, privacy preservation, heterogeneous-objective handling, and communication efficiency. Each agent maintains a private Gaussian Process (GP) surrogate trained on local data and communicates only through knowledge tokens-compact, lossy summaries containing directional signals, advantage scores, and optional language-model (LM) insights-without sharing raw data or model parameters. ADKO unifies GP-Upper Confidence Bound (GP-UCB), parallel Bayesian optimization, decentralized learning, and LM-guided discovery. We provide the first formal analysis of dual information loss: token compression, quantified via mutual-information-based fidelity, and LM approximation error, decomposed into bias and stochastic noise. Our main result shows cumulative regret decomposes into GP error, LM bias, LM noise, and compression loss, with necessary and sufficient conditions for sublinear regret. We also propose fidelity-aware token pruning to preserve high-information tokens under memory budget. Experiments on neural architecture search and scientific discovery validate the theory and show consistent improvements over strong baselines.

2605.07861 2026-05-11 cs.CV

From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

Yue Yu, Jiayu Wang, Jiajia Shi, Jingjing Chen, Yu-Gang Jiang

AI总结 该研究旨在解决从合成数据到真实场景的妆容迁移问题,重点在于保持人物身份一致性和背景真实感。为克服现有方法在身份保持和跨域泛化方面的不足,作者提出了ConsistentBeauty数据生成管道和RealBeauty后训练框架,通过强化学习和定制奖励机制提升模型在真实场景中的表现。此外,研究还构建了一个多样化的妆容迁移基准,涵盖多种肤色、年龄、性别和妆容风格,全面评估模型在复杂现实条件下的性能。

详情
英文摘要

Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.

2605.07860 2026-05-11 cs.LG cs.AI

On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems

Usevalad Milasheuski, Piero Baraldi, Enrico Zio, Stefano Savazzi

AI总结 本文研究了在联邦预测性维护系统中使用设备端生成模型(如变分自编码器、生成对抗网络和扩散模型)时面临的性能与通信开销之间的权衡。通过对比全联邦和部分联邦设置下的模型表现,论文提出了一种新的联邦生成模型分类方法,将部分模型组件共享作为个性化机制。实验表明,在异构和带宽受限的联邦学习环境中,不同生成模型在实用性、稳定性和可扩展性方面存在显著差异,部分联邦策略在某些场景下能优于全联邦方法。

详情
英文摘要

Federated Learning (FL) has emerged as a promising paradigm for preserving client data ownership and control over distributed Internet of Things (IoT) environments. While discriminative models dominate most FL use cases, recent advances in generative models -- such as Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models (DM) -- offer new opportunities for unsupervised anomaly detection in time series analysis, with relevant applications in predictive maintenance (PdM) in critical industrial infrastructures. In this work, we present a comprehensive analysis of VAEs, GANs, and DMs in the context of federated PdM. We analyze their performance and communication overhead under both full and partial federation setups, where only subsets of model components are shared. Building on this analysis, the paper proposes a novel taxonomy for federated generative models that formalizes partial component sharing as a principled mechanism for model personalization. Our experiments over a real-world time series dataset reveal distinct trade-offs in model utility, stability, and scalability, especially in heterogeneous and bandwidth-constrained FL settings. For the evaluated GAN-based configurations, full federation improves training stability relative to independent local training, although the model remains less robust than the VAE- and DDPM-based alternatives. For DMs, however, partial federation -- especially decoder sharing -- can outperform full federation in bandwidth-constrained, non-IID settings.

2605.07859 2026-05-11 cs.CV

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

Lang Zhang, JinYi Yoon, Matthew Corbett, Abhijit Sarkar, Bo Ji

AI总结 驾驶员认知分心是导致道路碰撞的主要原因之一,但目前仍难以检测。本文提出EyeCue,一种基于注视信息的自我中心视频理解框架,用于检测驾驶员的认知分心。该方法通过融合眼动信息与视频内容,建模驾驶员随时间变化的注意力分布,从而捕捉认知分心的特征。此外,研究还构建了多场景数据集CogDrive,并在该数据集上验证了EyeCue的有效性,其准确率高达74.38%,显著优于多个基线方法。

Comments Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
英文摘要

Driver cognitive distraction is a major cause of road collisions and remains difficult to detect. Unlike manual or visual distraction, cognitive distraction is diverted by thoughts unrelated to driving, even when the driver appears visually attentive and exhibits no explicit physical movements. In this work, we propose EyeCue, a gaze-empowered egocentric video understanding framework, to detect driver cognitive distraction. A key insight is that cognitive distraction manifests in the interaction between eye gaze and visual context. To capture this interaction, EyeCue integrates eye gaze with egocentric video to enable context-aware modeling of the driver's attention over time. Furthermore, to tackle the limited scale and diversity of existing datasets, we introduce CogDrive, a comprehensive multi-scenario dataset that augments four existing driving datasets with cognitive distraction annotations. Through extensive evaluations on CogDrive, we show that EyeCue achieves the highest accuracy of 74.38%, outperforming 11 baselines from 6 model families by over 7%. Notably, EyeCue can achieve an accuracy of over 70% across various driving scenarios (different road types, times of day, and weather conditions) with strong generalizability. These results highlight the importance of modeling gaze-context interactions and the effectiveness of cross-modal interaction modeling for multimodal cognitive distraction detection. Our codes and CogDrive dataset resources are available at https://github.com/langzhang2000/EyeCue.

2605.07857 2026-05-11 cs.LG

Actor-Critic Algorithm for Dynamic Expectile and CVaR

Yudong Luo, Erick Delage

AI总结 本文研究了在动态风险环境下优化随机策略的挑战,提出了无需转移扰动的替代策略梯度方法,并基于softmax策略参数化进行改进。通过利用可计算性,开发了无模型的价值学习方法,用于动态期望分位数和条件风险价值的估计。受预期SARSA和预期策略梯度的启发,构建了一个无模型的异策略actor-critic算法,在验证性风险规避场景中表现出优越的性能。

详情
英文摘要

Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires transition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.

2605.07850 2026-05-11 cs.CL cs.AI cs.LG

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

Ionut-Vlad Modoranu, Mher Safaryan, Dan Alistarh

AI总结 随着大语言模型参数规模的增长,微调过程中的计算成本成为部署的重要障碍。本文提出MatryoshkaLoRA,一种受俄罗斯套娃启发的LoRA训练框架,通过在现有LoRA适配器之间插入一个精心设计的对角矩阵,学习准确的层次化低秩表示,从而实现动态秩选择与高效梯度信息嵌入。该方法在保持高精度的同时提升了秩适应的灵活性,并引入AURAC指标用于评估层次化低秩适配器的性能,实验表明其在多个数据集上取得了更优的精度-效率权衡。

详情
英文摘要

With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.

2605.07847 2026-05-11 cs.CL

Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors

Shuhaib Mehri, Philippe Laban, Sumuk Shashidhar, Marwa Abdulhai, Sergey Levine, Michel Galley, Dilek Hakkani-Tür

AI总结 随着用户模拟器在AI助手的交互训练与评估中应用增多,准确模拟真实用户行为变得尤为重要。本文提出了一种衡量真实用户行为与模拟行为之间分布差异的方法,并通过人类研究和消融实验进行了验证。研究发现,现有24种基于大语言模型的用户模拟器在行为分布上与真实用户存在显著差距,且该差距在不同模型家族、规模和行为维度上有所不同;结合行为互补的模拟器可使生成的行为分布更接近真实用户。

详情
英文摘要

As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users. While existing works train user simulators to generate human-like responses, whether they capture the broad and heterogeneous distribution of real user behaviors remains an open question. In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations. Given a dataset of real and simulated conversations, our method extracts representations of user behavior from each conversation, quantizes them into discrete distributions via clustering, then computes divergence metrics. We provide the first systematic evaluation of 24 LLM-based user simulators on coding and writing tasks, and reveal a large distributional gap from real users that varies across model families, scales, and behavioral facets. Pairwise comparisons show that most simulators behave similarly, while a few stand apart. Combining behaviorally complementary simulators brings the resulting distribution closer to real users compared to either simulator on its own. Finally, a TF-IDF analysis of the clusters surfaces interpretable patterns of behaviors that simulators capture, miss, and hallucinate.

2605.07844 2026-05-11 cs.LG

Distributional simplicity bias and effective convexity in Energy Based Models

Aurélien Decelle, Alfonso de Jesús Navas Gómez, Beatriz Seoane

AI总结 本文研究能量基模型(EBM)训练中的非凸性问题,通过有效模型的视角分析其动态特性。作者发现,在足够表达能力下,学习正分布的梯度流会产生两类固定点:与数据一致的点和虚假的局部平稳点,并揭示了在数据一致点附近扰动的稳定性特性。研究进一步表明,梯度动态会优先学习低阶相互作用,从而解释了分布简单性偏差的机制,并阐明了为何实践中难以观察到低阶非数据一致的固定点。

Comments 13 pages, 2 figures

详情
英文摘要

Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.

2605.07841 2026-05-11 cs.LG cs.AI cs.DC

\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments

Hanzaleh Akbari Nodehi, Parsa Moradi, Soheil Mohajer, Mohammad Ali Maddah-Ali

AI总结 本文研究了在敌对节点占多数的去中心化机器学习环境中如何实现鲁棒训练的问题。提出了一种基于激励机制的框架,通过设定一致性阈值来接受和奖励工人节点的报告,使敌对节点从纯粹破坏者转变为权衡误差与奖励风险的理性参与者。为此,作者设计了自适应算法 $\mathsf{VISTA}$,根据优化历史动态调整接受阈值,在保证收敛性的同时提升训练效率,并通过理论分析证明其在无需诚实多数假设下仍可保持标准SGD的渐近收敛性能。

详情
英文摘要

Decentralized machine learning often relies on outsourcing computations, such as gradient evaluations, to untrusted worker nodes. Existing robust aggregation methods can mitigate malicious behavior under honest-majority assumptions, but may fail when adversaries control a majority of the workers. We study this adversary-dominated setting through an incentive-oriented framework in which reports are accepted and rewarded only when they are mutually consistent up to a threshold. This turns the adversary from a pure saboteur into a rational agent that trades off increasing estimation error against the risk of rejection and loss of reward. We consider iterative optimization under this model. Unlike one-shot computation, iterative learning requires long-horizon decisions: permissive acceptance rules enable faster early progress but admit more adversarial corruption, while strict rules improve estimation accuracy but cause frequent rejections. We propose \mathsf{VISTA}, an adaptive algorithm that tunes the acceptance threshold using the optimization history. Numerical results show that \mathsf{VISTA} improves convergence over static thresholds. We also provide a rigorous convergence analysis showing that, with suitable incentive-aware adaptation, adversary-dominated decentralized learning can retain the asymptotic convergence behavior of standard SGD without relying on an honest majority.

2605.07840 2026-05-11 cs.LG

RelAgent: LLM Agents as Data Scientists for Relational Learning

Xingyue Huang, Louis Tichelman, Jinwoo Kim, Krzysztof Olejniczak, İsmail İlkan Ceylan

AI总结 RelAgent 是一种基于大语言模型(LLM)的自主数据科学家系统,旨在解决关系学习问题。该方法分为两个阶段:在搜索阶段,LLM 代理利用数据库、验证和评估工具构建 SQL 特征程序并选择预测模型;在推理阶段,生成的程序直接执行,无需进一步调用 LLM。RelAgent 最终输出由 SQL 查询和经典模型组成的预测器,具有快速、确定性和内在可解释性,便于在标准数据库系统中部署。

详情
英文摘要

Relational learning is a challenging problem that has motivated a wide range of approaches, including graph-based models (e.g., graph neural networks, graph transformers), tabular methods (e.g., tabular foundation models), and sequence-based approaches (e.g., large language models), each with its own advantages and limitations. We propose RelAgent, an LLM-based autonomous data scientist for relational learning, which operates in two phases. In the search phase, an LLM agent uses database, validation, and evaluation workspace tools to construct SQL feature programs and select a predictive model. In the inference phase, the resulting program is executed without further LLM calls. The final predictor consists of SQL queries and a classical model, enabling fast, deterministic, and intrinsically interpretable predictions: features are human-readable queries, and predictions depend only on the resulting query-defined feature map, enabling scalable deployment using standard database systems.

2605.07839 2026-05-11 cs.AI

Exact Regular-Constrained Variable-Order Markov Generation via Sparse Context-State Belief Propagation

François Pachet

AI总结 该论文研究了在满足正则约束条件下,如何精确生成可变阶马尔可夫模型序列的问题。核心方法是将传统的信念传播算法扩展到可变阶模型,通过引入观察到的上下文状态与正则约束自动机的乘积结构,实现对生成分布的精确计算。该方法在固定训练好的上下文图和自动机下,推理复杂度与序列长度线性相关,避免了对所有K元组进行展开,从而提高了效率并保持了精确性。

详情
英文摘要

Variable-order Markov models generate sequences over a finite alphabet by conditioning each symbol on the longest available suffix of the generated history. Regular constraints, by contrast, describe finite-horizon control requirements by an automaton: fixed positions, forced endings, metrical patterns, and forbidden copied fragments are all special cases. Existing exact methods already handle regular constraints with belief propagation for first-order Markov chains. The contribution here is the variable-order extension: identifying the state space on which the existing BP-regular machinery must be run when the generator is a variable-order/backoff model. A first-order constraint layer can enforce useful support conditions, but it computes future mass after merging histories that a variable-order generator deliberately keeps distinct. We formalize this mismatch and give the sparse construction obtained by replacing the first-order Markov state with the observed context state, then taking the standard product with the regular constraint automaton. For a fixed trained context graph and automaton, inference is linear in the sequence horizon; in general it is polynomial in the number of reachable product edges. This gives the correct variable-order distribution conditioned on regular constraints without expanding to all K-tuples. The same finite-source interface supports reversible data augmentation by inverse count lookup, matching materialized transposition augmentation without storing transformed corpora. We also separate exact BP inference from generation-time backoff policies, such as singleton avoidance, whose stochastic semantics must be made explicit if exactness is claimed.

2605.07837 2026-05-11 cs.LG cs.AI

Approximation-Free Differentiable Oblique Decision Trees

Subrat Prasad Panda, Blaise Genest, Arvind Easwaran

AI总结 决策树因其可解释性和在表格数据上的有效性,广泛应用于医疗诊断等安全关键领域。然而,训练准确的斜决策树面临优化复杂和过拟合等挑战,现有方法多依赖近似技术。本文提出DTSemNet,一种新颖的、语义等价且可逆的硬斜决策树神经网络表示,实现了无需近似的端到端训练,显著提升了分类与回归任务的性能,并拓展了决策树在强化学习中的应用潜力。

Comments Accepted for publication in JMLR, Vol. 27, 2026

详情
英文摘要

Decision Trees (DTs) are widely used in safety-critical domains such as medical diagnosis, valued for their interpretability and effectiveness on tabular data. However, training accurate oblique DTs is challenging due to complex optimization landscapes and overfitting risks, particularly in regression. Recent advances have introduced differentiable formulations that enable gradient-based training and joint optimization of decision boundaries and leaf regressors. Yet, existing approaches typically rely on approximations, either through probabilistic softening of boundaries (soft DTs) or quantized gradients such as the Straight-Through Estimator (STE). To overcome these limitations, we propose DTSemNet, a novel, semantically equivalent, and invertible representation of hard oblique DTs as neural networks. DTSemNet enables end-to-end training with standard gradient descent, eliminating the need for approximations in both classification and regression. While classification aligns naturally with this formulation, regression remains challenging due to the joint optimization of internal nodes and leaf regressors. To address this, we analyze the limitations of STE and introduce an annealed Top-k method that provides accurate gradient signals without approximation. Extensive experiments on classification and regression benchmarks show that DTSemNet-trained oblique DTs outperform state-of-the-art differentiable DTs. Furthermore, we demonstrate that DTSemNet can serve as programmatic DT policies in reinforcement learning environments, thereby broadening their applicability.

2605.07835 2026-05-11 cs.RO cs.MA

Many-to-Many Multi-Agent Pickup and Delivery

Ethan Schneider, Jingkai Chen, Tianyi Gu, Kunlei Lian, Seth Hutchinson, Sonia Chernova

AI总结 本文研究了自动化仓库中多机器人系统面临的多对多配送与取货(MAPD)问题,该问题相较于传统的单对单任务更具挑战性,因为每个任务的取货和送货地点可能有多个选择,导致问题复杂度显著增加。为了解决这一难题,作者提出了一种新的算法 M2M,该算法通过优化任务持续时间或结合商品库存分布进行任务分配,显著提升了任务完成效率。实验结果表明,该方法在不同仓库环境和库存密度下均能稳定匹配或超越现有最优方法,平均任务完成量最多可提高22,000项。

详情
英文摘要

Multi-robot systems in automated warehouses must manage continuous streams of pickup-and-delivery tasks while ensuring efficiency and safety. Prior work on Multi-Agent Pickup-and-Delivery (MAPD) has largely focused on the one-to-one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many-to-many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP-hard four-dimensional assignment problem. To solve the many-to-many MAPD problem, we contribute our algorithm: Many-to-Many Multi-Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M-wSKU). Simulation results over 8-hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.

2605.07831 2026-05-11 cs.CV

Explainable Part-Based Vehicle Classifier with Spatial Awareness

Andreas Caduff, Klaus Zahn, Jonas Hofstetter, Martin Rechsteiner, Patrick Flaig

AI总结 在智能交通系统中,细粒度车辆分类对于提升交通管理效率具有重要意义。本文提出了一种具有空间感知能力的可解释部件级车辆分类方法,将传统端到端卷积神经网络分解为部件检测、特征构建和决策树分类三个模块,显著提升了模型的可解释性。通过引入部件的空间概率图,该方法增强了对部件位置的感知能力,有效提高了分类鲁棒性,并在保持高分类精度的同时挑战了准确率与可解释性之间不可兼得的传统认知。

详情
英文摘要

In the area of Intelligent Transportation Systems (ITS), fine-grained vehicle classification systems play an essential role. Recently, the authors have presented a novel vision-based classification approach in which standard end-to-end Convolutional Neural Networks (CNNs) have been decomposed into 1) a CNN-based detector for semantically strong vehicle parts, followed by 2) feature construction and 3) final classification by a decision tree. In contrast to conventional CNNs, this allows both easy extensibility to new vehicle categories - without the need to fully retrain the part detector - and an important step towards the interpretability of the model, removing partially the black-box nature inherent to CNNs. Here we present an important extension of this approach that now incorporates spatial awareness of the vehicle parts: while the feature construction 2) of the previous approach used a binary decision for each feature (present vs. absent), now a full spatial probability map is constructed to condition the presence of each individual part with respect to a given vehicle category. The classification is performed using a softmax regression approach for the overall vehicle probabilities. This method shows a considerably improved robustness against false (part-)detections, a point that is crucial for practical application. Comparative analyses with a state-of-the-art end-to-end CNN indicate that our part-based methods achieve comparable accuracy, effectively challenging the presumed trade-off between accuracy and explainability. This research represents a significant advance in vehicle classification for ITS and forms the basis for systems that combine high accuracy with intuitive interpretability.

2605.07823 2026-05-11 cs.CL

SCENE: Recognizing Social Norms and Sanctioning in Group Chats

Mateusz Jacniacki, Maksymilian Bilski

AI总结 本文提出SCENE,一个用于评估大型语言模型在群体聊天中识别社会规范并应对社会制裁能力的基准。SCENE通过设定隐含规范和制造违规场景,测试模型对负面制裁的响应能力以及从同伴行为中学习规范的能力。实验表明,某些封闭源模型在适应隐含社会规范方面表现优于开源模型,SCENE为动态评估LLM社交能力提供了新的研究方向。

详情
英文摘要

Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.

2605.07821 2026-05-11 cs.CV cs.AI

Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection

Boyang Dai, Chaoqi Chen, Yizhou Yu

AI总结 本文研究了如何利用物体共现信息来缓解深度学习模型在分布外检测(OOD detection)中的简单性偏差问题。作者提出了一种以物体为中心的OOD检测框架,通过学习图像中的物体共现模式,将检测任务分解为三种基于共现关系的场景进行处理,从而更有效地识别近似分布外数据。该方法通过考虑图像中的语义上下文关系,提升了模型对语义偏移和协变量偏移的鲁棒性,并在多种挑战性OOD设置中取得了有竞争力的实验结果。

Comments This paper has been accepted by CVPR2026

详情
英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

2605.07817 2026-05-11 cs.CV cs.AI cs.CL

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti

AI总结 本文提出了一种名为GazeVLM的多模态模型,旨在通过内部注意力控制机制模拟人类主动视觉过程,从而提升视觉-语言推理能力。该模型引入了自主生成的注视标记($\texttt{<LOOK>}$),实现对注意力掩码的自上而下控制,动态聚焦于任务相关细节并抑制无关视觉信息,从而在无需外部工具的情况下实现全局与局部视角的灵活切换。实验表明,GazeVLM在高分辨率多模态推理任务中表现出色,优于同参数规模的最先进模型和基于图像思考的代理系统。

详情
英文摘要

Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{<LOOK>}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.