arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2601.13780 2026-05-13 cs.LG

Principled Latent Diffusion for Graphs via Laplacian Autoencoders

Antoine Siraudin, Christopher Morris

AI总结 该论文提出了一种基于拉普拉斯自编码器的图潜在扩散模型LG-Flow,用于解决传统图扩散模型在节点数量增加时计算复杂度呈二次增长的问题。通过将图结构编码到低维潜在空间,模型实现了近似无损的图重建,并有效避免了稀疏图中边缺失建模的冗余问题。该方法利用排列等变自编码器和扩散变换器,显著提升了图生成的效率与规模,实验表明其在生成性能上具有竞争力,且训练速度提升了近千倍。

Comments Preprint, under review

详情
英文摘要

Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes -- and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion in that space. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps nodes to fixed-dimensional embeddings that enable near-lossless reconstruction of both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, thereby removing the quadratic adjacency-space bottleneck in the diffusion process and enabling the training of substantially larger generative backbones. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models while delivering up to a $1000\times$ speed-up. Our code is available at https://github.com/asiraudin/LG-Flow .

2601.07473 2026-05-13 cs.LG

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

AI总结 随着模型能力增强,人类难以可靠地验证模型的输出。本文提出了一种名为 AntiPaSTO 的自监督方法,通过在反平行轴上分离表示并引入一致性约束,实现对模型诚实性的内部引导。该方法仅需在模板句中插入两个对比词进行训练,无需人工标注,实验表明其在多个价值轴上均优于传统提示方法,且具备双向控制能力。

Comments Code is available at https://github.com/wassname/AntiPaSTO

详情
英文摘要

As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.

2601.07384 2026-05-13 cs.LG

CompNO: A Novel Foundation Model approach for solving Partial Differential Equations

Hamda Hmida, Hsiu-Wen Chang Joly, Youssef Mesri

AI总结 本文提出了一种名为CompNO的新基础模型方法,用于求解参数化偏微分方程(PDEs)。该方法通过学习一组基础模块(每个模块对应一种基本微分算子的傅里叶神经算子),并结合轻量的适配模块构建任务特定求解器,从而避免了传统单一大模型的高昂预训练成本和可解释性不足的问题。实验表明,CompNO在多种PDEs上取得了比现有方法更低的相对L2误差,并能准确满足边界条件,展现出良好的泛化能力和物理可解释性。

Comments Under review at MDPI

详情
英文摘要

Partial differential equations (PDEs) govern a wide range of physical phenomena, but their numerical solution remains computationally demanding, especially when repeated simulations are required across many parameter settings. Recent Scientific Foundation Models (SFMs) aim to alleviate this cost by learning universal surrogates from large collections of simulated systems, yet they typically rely on monolithic architectures with limited interpretability and high pretraining expense. In this work we introduce Compositional Neural Operators (CompNO), a compositional neural operator framework for parametric PDEs. Instead of pretraining a single large model on heterogeneous data, CompNO first learns a library of Foundation Blocks, where each block is a parametric Fourier neural operator specialized to a fundamental differential operator (e.g. convection, diffusion, nonlinear convection). These blocks are then assembled, via lightweight Adaptation Blocks, into task-specific solvers that approximate the temporal evolution operator for target PDEs. A dedicated boundary-condition operator further enforces Dirichlet constraints exactly at inference time. We validate CompNO on one-dimensional convection, diffusion, convection--diffusion and Burgers' equations from the PDEBench suite. The proposed framework achieves lower relative L2 error than strong baselines (PFNO, PDEFormer and in-context learning based models) on linear parametric systems, while remaining competitive on nonlinear Burgers' flows. The model maintains exact boundary satisfaction with zero loss at domain boundaries, and exhibits robust generalization across a broad range of Peclet and Reynolds numbers. These results demonstrate that compositional neural operators provide a scalable and physically interpretable pathway towards foundation models for PDEs.

2601.05752 2026-05-13 cs.CL cs.SE

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

AI总结 本文介绍了 AutoMonitor-Bench,这是首个用于系统评估基于大语言模型(LLM)的异常行为监控可靠性 benchmark,涵盖问答、代码生成和推理等任务,包含 3,010 个精心标注的测试样本。研究通过误检率(MR)和误报率(FAR)两个指标评估监控性能,揭示了不同模型在检测能力与敏感度之间的权衡。此外,作者构建了大规模训练语料并微调 Qwen3-4B-Instruction,探索了针对已知异常行为数据训练是否能提升模型对未知隐性异常的监控能力,突显了构建可靠且可扩展的 LLM 异常监控系统所面临的挑战。

Comments ACL 2026 Findings

详情
英文摘要

We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

2601.03627 2026-05-13 cs.CL cs.AI

Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang

AI总结 本文提出EPAG,一个用于评估大语言模型(LLMs)预诊能力的基准数据集和框架,通过比较病史信息与诊断指南直接评估模型能力,并通过疾病诊断间接评估。研究发现,经过精心构建的特定任务数据集微调的小型开源模型在预诊任务中可超越前沿大模型,同时发现病史信息量的增加并不一定提升诊断性能。研究还揭示了预诊对话的语言特性受对话内容影响,并开源了数据集和评估流程以促进临床场景中LLM应用的发展。

Comments EACL 2026 Industry

详情
英文摘要

We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

2512.22933 2026-05-13 cs.AI cs.CL

RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

AI总结 本文提出 RW-Post,一个用于真实场景下多模态事实核查的可审计基准数据集,每个样本都关联原始社交媒体帖子、推理过程和来自人工事实核查文章的明确证据。该数据集支持多种评估模式,有助于系统分析模型在视觉关联和证据利用方面的能力。实验表明,当前模型在证据关联方面仍有较大提升空间,而基于证据的评估方式能有效提升模型的准确性和可信度。

Comments Code and dataset will be released at https://github.com/xudanni0927/AgentFact

详情
英文摘要

Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce RW-Post, a post-aligned text--image benchmark for real-world multimodal fact-checking with auditable annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide AgentFact as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness.

2512.22579 2026-05-13 cs.AI cs.NI

SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G

Yong Xiao, Xubo Li, Haoran Zhou, Yingyu Li, Yayu Gao, Guangming Shi, Ping Zhang, Marwan Krunz

AI总结 本文提出了一种名为SANet的语义感知智能体网络框架,旨在实现6G无线网络中的跨层优化。该框架通过理解用户的语义目标,自动分配不同网络层的智能体以完成任务,并针对多智能体多目标优化问题,提出了寻找帕累托最优解的优化方法。此外,文章还引入了模型划分与共享(MoPS)机制,以提升计算资源的利用效率,并通过实验验证了该框架在性能提升和计算效率方面的显著优势。

Comments Accepted at IEEE Transactions on Mobile Computing

详情
Journal ref
IEEE Transactions on Mobile Computing, 2026
英文摘要

Agentic AI networking (AgentNet) is a novel AI-native networking paradigm in which a large number of specialized AI agents collaborate to perform autonomous decision-making, dynamic environmental adaptation, and complex missions. It has the potential to facilitate real-time network management and optimization functions, including self-configuration, self-optimization, and self-adaptation across diverse and complex environments. This paper proposes SANet, a novel semantic-aware AgentNet architecture for wireless networks that can infer the semantic goal of the user and automatically assign agents associated with different layers of the network to fulfill the inferred goal. Motivated by the fact that AgentNet is a decentralized framework in which collaborating agents may generally have different and even conflicting objectives, we formulate the decentralized optimization of SANet as a multi-agent multi-objective problem, and focus on finding the Pareto-optimal solution for agents with distinct and potentially conflicting objectives. We propose three novel metrics for evaluating SANet. Furthermore, we develop a model partition and sharing (MoPS) framework in which large models, e.g., deep learning models, of different agents can be partitioned into shared and agent-specific parts that are jointly constructed and deployed according to agents' local computational resources. Two decentralized optimization algorithms are proposed. We derive theoretical bounds and prove that there exists a three-way tradeoff among optimization, generalization, and conflicting errors. We develop an open-source RAN and core network-based hardware prototype that implements agents to interact with three different layers of the network. Experimental results show that the proposed framework achieved performance gains of up to 14.61% while requiring only 44.37% of FLOPs required by state-of-the-art algorithms.

2512.12177 2026-05-13 cs.AI

Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation

Aydin Ayanzadeh, Tim Oates

AI总结 本文提出了一种基于大语言模型(LLM)引导的室内平面图解析方法Floorplan2Guide,旨在提升盲人和低视力(BLV)人群的室内导航能力。该方法将建筑平面图转化为可导航的知识图谱,并生成可读的导航指令,减少了传统方法对人工预处理的依赖。实验表明,该方法在模拟和真实环境中均能有效提升导航准确率,尤其在少样本学习下表现优异,且基于图结构的空间推理比直接视觉推理具有更高的成功率。

Comments Accepted for publication in the proceedings of the IEEE International Conference on Big Data (IEEE BigData 2025)

详情
Journal ref
IEEE International Conference on Big Data (IEEE BigData 2025), pp. 7477-7485
英文摘要

Indoor navigation remains a critical challenge for people with visual impairments. The current solutions mainly rely on infrastructure-based systems, which limit their ability to navigate safely in dynamic environments. We propose a novel navigation approach that utilizes a foundation model to transform floor plans into navigable knowledge graphs and generate human-readable navigation instructions. Floorplan2Guide integrates a large language model (LLM) to extract spatial information from architectural layouts, reducing the manual preprocessing required by earlier floorplan parsing methods. Experimental results indicate that few-shot learning improves navigation accuracy in comparison to zero-shot learning on simulated and real-world evaluations. Claude 3.7 Sonnet achieves the highest accuracy among the evaluated models, with 92.31%, 76.92%, and 61.54% on the short, medium, and long routes, respectively, under 5-shot prompting of the MP-1 floor plan. The success rate of graph-based spatial structure is 15.4% higher than that of direct visual reasoning among all models, which confirms that graphical representation and in-context learning enhance navigation performance and make our solution more precise for indoor navigation of Blind and Low Vision (BLV) users.

2512.12165 2026-05-13 cs.CV

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi, Sagnik Majumder, Kristen Grauman

AI总结 本文研究了如何利用被动场景声音和野外视频进行音频-视觉相机位姿估计,解决视觉退化条件下相机运动估计的难题。作者提出了一种简单有效的音频-视觉框架,将到达方向(DOA)谱和双耳嵌入特征融合到先进的视觉位姿估计模型中,显著提升了位姿估计的准确性和鲁棒性。该方法在两个大规模数据集上的实验表明,相比纯视觉方法具有明显优势,尤其在视觉信息受损时表现突出,为现实场景中的相机位姿估计提供了新的音频辅助思路。

详情
英文摘要

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

2512.12131 2026-05-13 cs.LG cs.DC

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

AI总结 本文提出了一种名为 BOOST 的高效训练框架,专门用于大规模低秩瓶颈架构的大语言模型。针对传统张量并行方法在低秩模型中通信开销大、GPU利用率低的问题,BOOST 引入了瓶颈感知的张量并行策略,并结合在线 RMSNorm、线性层分组和低秩激活检查点等优化技术,显著提升了训练速度。实验表明,BOOST 在多种低秩瓶颈架构上相比全秩模型和简单集成的 3D 并行方法分别实现了 1.46 到 1.91 倍和 1.87 到 2.27 倍的加速,同时提高了 GPU 利用率并减少了通信开销。

详情
英文摘要

The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

2512.11321 2026-05-13 cs.CV

KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

AI总结 本文提出了一种名为 KeyframeFace 的语言驱动面部动画生成方法,通过语义关键帧实现对人脸表情的精确控制。与现有方法直接从文本生成连续帧不同,该方法借鉴动画制作中的关键帧理念,在可解释的 ARKit 控制空间中使用语义关键帧表示动画,并利用大语言模型生成与文本描述和情绪线索对齐的关键帧。实验表明,该方法在表情保真度和语义一致性方面优于传统方法,同时提供了更清晰的语义控制结构。

详情
英文摘要

Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.

2512.05683 2026-05-13 cs.CV physics.optics

Physics-Informed Graph Neural Networks for Frequency-Aware Optical Aberration Correction

Yong En Kok, Bowen Deng, Alexander Bentley, Andrew J. Parkes, Michael G. Somekh, Amanda J. Wright, Michael P. Pound

AI总结 本文提出了一种基于物理信息的图神经网络ZRNet,用于频率感知的光学像差校正。该方法结合了Zernike多项式系数预测与光学图像复原,通过引入Zernike图模块和频率感知对齐损失,显式建模多项式间的物理关系并增强图像与系数预测在频域的一致性。实验表明,ZRNet在多种显微成像模态和复杂生物样本上均取得了最先进的像差校正和图像复原效果,并在真实光学系统数据上验证了其鲁棒性和泛化能力。

详情
英文摘要

Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. We further validate on experimental PSF data from a physical microscope and demonstrate robustness to realistic sensor noise, confirming generalisation beyond simulated conditions. Code is available at https://github.com/janetkok/ZRNet.

2512.00775 2026-05-13 cs.RO cs.SY eess.SY

SAGAS: Semantic-Aware Graph-Assisted Stitching for Offline Temporal Logic Planning

Ruijia Liu, Ancheng Hou, Xiang Yin

AI总结 本文研究了在严格离线、无模型设定下,基于线性时序逻辑(LTL)的机器人任务规划与执行问题。为解决该问题,作者提出了一种名为SAGAS的框架,结合符号合成的组合性与从离线轨迹中学习到的数据驱动可达结构。该方法通过学习可复用的潜在可达图和固定的目标条件执行器,并对每个新的LTL公式进行语义图增强和布奇积搜索,从而生成可执行且成本高效的路径规划,实现了对未见过的LTL任务的零样本泛化。

详情
英文摘要

Linear Temporal Logic (LTL) provides a rigorous framework for specifying long-horizon robotic tasks, yet existing approaches face a trade-off: model-based synthesis relies on accurate labeled transition systems, whereas learning-based methods often require online interaction, task-specific rewards, or specification-conditioned training. We study LTL-specified robotic planning and execution in a stricter offline, model-free setting, where the agent is given only fixed, task-agnostic trajectory fragments, with no dynamics model, task demonstrations, or online data collection. To address this setting, we propose SAGAS, a framework that combines the compositionality of symbolic synthesis with the data-driven reachability structure learned from offline trajectories. SAGAS first learns a reusable latent reachability graph and a frozen goal-conditioned executor from fragmented offline data. For each new LTL formula, it performs task-time semantic graph augmentation to ground state-defined propositions on the learned graph, and applies Büchi product search to synthesize a cost-aware accepting prefix--suffix waypoint plan executed by the frozen executor. By shifting formula-specific reasoning from policy learning to test-time graph augmentation and symbolic search, SAGAS enables zero-shot generalization to unseen, data-supported LTL specifications without task-specific reward design, policy retraining, or online interaction. Experiments on LTL task suites constructed from OGBench locomotion domains show that this design produces executable and cost-efficient prefix--suffix behaviors for diverse unseen LTL tasks from fragmented offline data.

2511.22475 2026-05-13 cs.LG cs.CV

Adversarial Flow Models

Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

AI总结 本文提出了一类生成模型——对抗流模型,结合了对抗学习和流模型的优点,支持一步或多步生成,并通过对抗目标进行训练。与传统GAN不同,该模型鼓励生成器学习确定性的噪声到数据映射,从而显著稳定训练过程;与基于一致性的方法相比,它无需学习概率流的中间时间步,直接实现一步或多步生成,避免了误差累积并保留了模型容量。实验表明,该模型在ImageNet-256px数据集上取得了优于现有方法的生成质量。

Comments ICML 2026

详情
英文摘要

We present adversarial flow models, a class of generative models that belongs to both the adversarial and flow families. Our method supports native one-step and multi-step generation and is trained with an adversarial objective. Unlike traditional GANs, in which the generator learns an arbitrary transport map between the noise and data distributions, our generator is encouraged to learn a deterministic noise-to-data mapping. This significantly stabilizes adversarial training. Unlike consistency-based methods, our model directly learns one-step or few-step generation without having to learn the intermediate timesteps of the probability flow for propagation. This preserves model capacity and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model achieves a new best FID of 2.38. We additionally demonstrate end-to-end training of 56-layer and 112-layer models without any intermediate supervision, achieving FIDs of 2.08 and 1.94 with a single forward pass and surpassing the corresponding 28-layer 2NFE and 4NFE counterparts with equal compute and parameters. The code is available at https://github.com/ByteDance-Seed/Adversarial-Flow-Models

2511.17038 2026-05-13 cs.AI eess.IV stat.ML

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

Hao Chen, Renzheng Zhang, Scott S. Howard

AI总结 本文提出了一种名为DAPS++的新型扩散逆问题求解方法,旨在解决传统扩散模型在逆问题中先验引导不足的问题。该方法通过将扩散初始化与似然驱动的优化过程完全解耦,使重建过程更直接地由测量一致性引导,同时保持数值稳定性。实验表明,DAPS++在减少函数评估次数和优化步骤的前提下,实现了高效的计算性能和鲁棒的图像恢复效果。

详情
英文摘要

From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. We show that the diffusion prior in these solvers functions primarily as a warm initializer that places estimates near the data manifold, while reconstruction is driven almost entirely by measurement consistency. Based on this observation, we introduce \textbf{DAPS++}, which fully decouples diffusion-based initialization from likelihood-driven refinement, allowing the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.

2511.16520 2026-05-13 cs.LG cs.CV eess.IV eess.SP

Saving Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

AI总结 本文提出了一种名为FMPlug的插件框架,旨在提升基础流匹配模型在逆问题中的应用效果。该方法结合了实例引导的时序预热策略和尖锐高斯正则化,既增强了问题特异性指导,又保持了高斯结构的稳定性。实验表明,FMPlug在图像修复和样本稀缺的科学逆问题中均表现出色,为在这些场景中实用化基础流匹配模型提供了有效途径。

Comments Accepted by ICML 2026

详情
英文摘要

Foundation flow-matching (FM) models promise universal priors for solving inverse problems (IPs); yet today, they trail behind domain-specific and even untrained priors. \emph{How can we unlock their potential?} We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. For evaluation, we consider both simple image restoration tasks and scientific IPs with a few similar samples -- where the prohibitive cost of data collection and model training hinders the development of domain-specific generative models. Our superior experimental results confirm the effectiveness of FMPlug. Overall, FMPlug paves the way for making foundation FM models practical, reusable priors for IPs, especially scientific ones with few similar samples. More details are available at https://sun-umn.github.io/xm-plug/ .

2511.14715 2026-05-13 cs.LG cs.AI cs.CR cs.DC cs.MA

FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning

Abolfazl Younesi, Leon Kiss, Zahra Najafabadi Samani, Juan Aznar Poveda, Thomas Fahringer

AI总结 联邦学习(FL)在保障数据隐私的同时实现协作训练,但易受到恶意客户端通过拜占庭攻击、数据投毒等手段破坏模型完整性。为应对这一问题,本文提出 FLARE,一种基于自适应多维信誉评估的框架,通过持续、多维的信誉评分机制动态评估客户端可靠性,并结合自适应阈值调整、信誉加权聚合和本地差分隐私等技术,提升系统鲁棒性。实验表明,FLARE 在多种攻击场景下均能保持较高的模型准确率和收敛速度,显著优于现有方法。

Comments The authors want to withdraw this manuscript for further verification and revision. We may release a substantially revised version in the future

详情
英文摘要

Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE

2511.12034 2026-05-13 cs.CV cs.LG cs.MM

Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

AI总结 多模态表征学习旨在将不同模态的信息对齐到统一的潜在空间中,但现有方法通常要求所有模态都存在,难以处理数据中缺失模态的情况。本文从锚点偏移的角度出发,揭示了缺失模态导致对齐偏差的理论机制,并提出了一种名为CalMRL的方法,通过利用模态间的先验知识和内在联系,在表征层面进行缺失模态的补全与对齐校准。实验表明,该方法有效缓解了锚点偏移问题,提升了模型在缺失模态数据上的表现。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.

2510.25609 2026-05-13 cs.LG cs.AI eess.SP

Revisiting GAN with Bayes-Optimal Discrimination

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero

AI总结 本文提出了一种改进的标准生成对抗网络(GAN)训练方法,其核心在于将判别器的目标从交叉熵损失转变为直接最小化判别贝叶斯错误率(BER)。为此,作者引入了贝叶斯最优学习阈值(BOLT)损失函数,并通过最大化判别BER的替代量来训练生成器。该方法统一了GAN训练的不同目标,揭示了它们在平滑性与紧致性之间的权衡关系,并在平衡类别先验的条件下,证明了最大化替代BER能够最小化数据分布与生成分布之间的总变分距离,同时与Wasserstein GAN建立了联系。实验表明,该方法在图像生成任务中提升了样本质量和覆盖范围。

详情
英文摘要

We propose an alternative to the standard GAN training approach, in which the discriminator is a binary classifier trained by cross-entropy to distinguish real samples from generated ones. Instead, we directly target the discrimination Bayes error rate (BER). To this end, we use the recently proposed Bayes optimal learning threshold (BOLT) loss and train the generator to maximize a surrogate of the discrimination BER. This viewpoint gives a unified perspective on GAN training: different objectives can be interpreted as parameterized bounds on the discrimination BER that describe a trade-off between smoothness and tightness. We show that, under balanced class priors, maximizing the surrogate BER with an unconstrained discriminator minimizes the total variation between the data and generator distributions. By constraining the discriminator to be $1$-Lipschitz, the proposed maximization objective defines a discrepancy that is upper-bounded by the Wasserstein-1 distance, thereby linking it to Wasserstein GAN. Experiments on several image-generation datasets under matched architectures and optimization settings show that GAN training using the surrogate BER improves sample quality and coverage over standard baselines. This analysis suggests that the proposed Bayesian viewpoint can achieve a better trade-off between training stability and convergence of the generator to the data distribution.

2510.24570 2026-05-13 cs.CL

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat, Irina Illina, Emmanuel Vincent

AI总结 本文提出了一种名为BEARD的新型框架,用于在缺乏标注数据的低资源场景下对Whisper语音识别模型进行领域自适应。该方法结合了BEST-RQ自监督学习目标与知识蒸馏技术,通过未标注数据微调Whisper编码器,并与预训练解码器保持互补性。实验表明,在具有非母语发音、噪声和专业术语的航空管制通信领域,该方法在仅使用5000小时未转录语音和2小时标注语音的情况下,相比已有基线和微调模型,相对提升了12%的识别性能,是首个将自监督学习应用于Whisper领域自适应的工作。

Comments Accepted to ICASSP 2026

详情
英文摘要

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

2510.06371 2026-05-13 cs.CL cs.AI

OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA

Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling

AI总结 OASIS 是一个大规模的多语言、多模态数据集,旨在支持基于文化背景的口语视觉问答任务。该数据集包含大量图像、文本和语音数据,涵盖英语和阿拉伯语多种变体,适用于评估模型在常识推理、文化理解和真实场景中的表现。研究提出了一种可扩展的半自动框架 EverydayMMQA 用于构建本地化的问答资源,并通过多阶段人工验证确保数据质量,为多模态模型的训练与评估提供了重要支持。

Comments Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Contextual Understanding, Culturally Informed

详情
英文摘要

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS

2510.05408 2026-05-13 cs.CV cs.AI

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras, Luis Toscano-Palomino, Mauro Dalla Mura, Jorge Bacca

AI总结 该研究提出了一种基于热成像和视觉语言模型的时序逆向重建方法,旨在从当前的热痕迹中恢复过去几秒内的场景状态。方法结合了视觉语言模型与约束扩散过程,通过生成场景描述并指导图像重建,确保语义与结构的一致性。实验表明,该方法能够在受控环境下重建出最多120秒前的合理场景画面,为基于热痕迹的时序逆向成像提供了初步实现。

详情
英文摘要

Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

2510.04265 2026-05-13 cs.AI cs.CL math.ST stat.ML stat.TH

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

AI总结 本文提出了一种基于贝叶斯框架的大语言模型评估方法,旨在解决传统Pass@k指标在样本量有限时排名不稳定、易误导的问题。该方法通过估计模型的底层成功概率及其可信区间,提供更稳定且具有统计意义的模型排名,并支持对评分标准的灵活加权。实验表明,该框架在收敛速度和排名稳定性方面优于Pass@k,且能明确区分统计显著差异与噪声,适用于二元和非二元评估场景。

Comments OpenReview (ICLR 2026): https://openreview.net/forum?id=PTXi3Ef4sT

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR), 2026
英文摘要

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

2510.02043 2026-05-13 cs.CV cs.HC cs.LG

Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor, Romit Roy Choudhury

AI总结 本文研究了在传感器数量有限的情况下实现零样本人体姿态估计的问题。作者将姿态估计建模为一个逆问题,并提出了一种基于扩散模型的逆求解算法,仅依赖旋转测量信息进行条件生成,同时结合位置测量的似然项进行引导。该方法无需针对每个用户进行微调,实现了跨用户的零样本泛化,为少传感器场景下的姿态估计提供了新思路。

Comments Published as a Conference Paper at The Fourteenth International Conference on Learning Representations

详情
英文摘要

Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both <location, rotation> measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

2509.19207 2026-05-13 cs.CV

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

AI总结 本文研究了对比视觉-语言模型(VLMs)在理解长篇组合性描述时面临的挑战,分析了组合推理与长描述理解之间的关系。通过在不同训练目标、数据集和架构设计下的受控实验,发现两者存在双向但敏感的关联,高质量且具有强视觉支撑的长描述数据有助于同时提升两种能力,而某些架构设计可能限制组合性学习。研究为改进VLM的泛化能力提供了数据选择和模型设计的实用指导。

Comments To be published in Findings of ACL 2026

详情
英文摘要

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

2509.14933 2026-05-13 cs.LG

DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables

Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Xingjian Wu, Bin Yang, Jilin Hu

AI总结 时间序列预测在多个领域具有重要意义,而引入外生变量(协变量)可以提供额外的预测信息,提高预测精度。然而,现有方法在利用外生变量,尤其是未来外生变量及其与内生变量之间的相关性方面存在不足。为此,本文提出DAG模型,通过在时间维度和通道维度上构建双重相关性网络,充分挖掘历史外生变量与未来外生变量、历史内生变量之间的相关性,并将其注入到未来内生变量的预测过程中,从而提升时间序列预测的准确性。

Comments Accepted by ICML 2026

详情
英文摘要

Time series forecasting is essential in various domains. Compared to relying solely on endogenous variables (i.e., target variables), considering exogenous variables (i.e., covariates) provides additional predictive information and often leads to more accurate predictions. However, existing methods for time series forecasting with exogenous variables (TSF-X) have the following shortcomings: 1) they do not leverage future exogenous variables, 2) they fail to fully account for the correlation between endogenous and exogenous variables. In this study, to better leverage exogenous variables, especially future exogenous variables, we propose DAG, which utilizes Dual correlAtion network along both the temporal and channel dimensions for time series forecasting with exoGenous variables. Specifically, we propose two core components: the Temporal Correlation Module and the Channel Correlation Module. Both modules consist of a correlation discovery submodule and a correlation injection submodule. The former is designed to capture the correlation effects of historical exogenous variables on future exogenous variables and on historical endogenous variables, respectively. The latter injects the discovered correlation relationships into the processes of forecasting future endogenous variables based on historical endogenous variables and future exogenous variables.

2509.10692 2026-05-13 cs.RO

STL-Based Motion Planning and Uncertainty-Aware Risk Analysis for Human-Robot Collaboration with a Multi-Rotor Aerial Vehicle

Giuseppe Silano, Amr Afifi, Martin Saska, Antonio Franchi

AI总结 本文提出了一种基于信号时序逻辑(STL)的运动规划与风险分析框架,旨在提升多旋翼无人机与人类的协作能力。该方法通过STL编码任务中的安全、时间约束和人体舒适性等关键目标,并结合优化规划生成符合无人机动力学约束的可行轨迹,同时引入不确定性感知的风险分析以应对人类姿态的不确定性。实验验证表明,该框架能够在真实操作条件下实现安全、高效且鲁棒的人机协作。

Comments 46 pages, 14 figures

详情
Journal ref
Journal of Intelligent & Robotic Systems, 2026
英文摘要

This paper presents a motion planning and risk analysis framework for enhancing human-robot collaboration with a Multi-Rotor Aerial Vehicle. The proposed method employs Signal Temporal Logic to encode key mission objectives, including safety, temporal requirements, and human preferences, with particular emphasis on ergonomics and comfort. An optimization-based planner generates dynamically feasible trajectories while explicitly accounting for the vehicle's nonlinear dynamics and actuation constraints. To address the resulting non-convex and non-smooth optimization problem, smooth robustness approximations and gradient-based techniques are adopted. In addition, an uncertainty-aware risk analysis is introduced to quantify the likelihood of specification violations under human-pose uncertainty. A robustness-aware event-triggered replanning strategy further enables online recovery from disturbances and unforeseen events by preserving safety margins during execution. The framework is validated through MATLAB and Gazebo simulations on an object handover task inspired by power line maintenance scenarios. Results demonstrate the ability of the proposed method to achieve safe, efficient, and resilient human-robot collaboration under realistic operating conditions.

2509.09838 2026-05-13 cs.LG cs.AI

Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives

Reza Asad, Reza Babanezhad, Sharan Vaswani

AI总结 本文研究了离散动作空间中Soft Actor-Critic(DSAC)算法的局限性,并提出了一种改进的原理性替代方法。作者发现DSAC表现不佳的主要原因是策略和价值函数之间的熵耦合,通过解耦这一部分可以显著提升性能。基于此,他们提出了一种灵活的离策略actor-critic框架,支持新的目标函数,并在理论和实验上证明了其在Atari游戏中的优越性,即使不依赖熵正则化或显式探索机制也能保持稳健表现。

详情
英文摘要

While Soft Actor-Critic (SAC) is highly effective in continuous control, its discrete counterpart (DSAC) performs poorly on challenging discrete-action domains such as Atari. Consequently, starting from DSAC, we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC's performance significantly improves. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case and yields novel objectives. Our framework allows using an m-step Bellman operator for the critic update, and instantiates the actor objective by combining standard policy optimization methods with entropy regularization. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting, generalizing the results in prior work. Empirically, we evaluate the proposed objectives on standard Atari games. Our ablations indicate that, unlike DSAC, these objectives, including novel ones, perform robustly even without entropy regularization or explicit exploration mechanisms.

2509.06701 2026-05-13 cs.LG cs.AI

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Su Hyeong Lee, Risi Kondor, Richard Ngo

AI总结 本文提出了一种基于概率建模的智能代理理论,用于理解深度神经网络中的潜在代理子结构。研究通过定义代理的成果分布及其认知效用,结合加权对数混合方法,探讨了代理组合的形成机制,并证明了在特定条件下实现严格共识的可能性。研究还揭示了大型语言模型中代理对齐的现象,表明通过引导良性代理可以诱发对抗性代理,从而为代理型人工智能系统的对齐问题提供了新的数学框架和启示。

Comments Accepted by ICML 2026

详情
英文摘要

We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

2508.21260 2026-05-13 cs.RO eess.SP math.ST stat.TH

Remarks on stochastic cloning and delayed-state filtering

Tara Mina, Lindsey Marinello, John Christian

AI总结 本文研究了在航空航天导航和机器人领域中处理依赖于先验状态的延迟状态测量的估计问题,重点探讨了随机克隆(SC)方法以及一种被长期忽视的替代方法——延迟状态卡尔曼滤波(DSKF)。研究发现,正确推导的DSKF能够在无需状态扩增的情况下,实现与SC相同的状态和协方差更新,并提供了两种等效的DSKF形式,从不同角度解释了如何在广义卡尔曼滤波框架中处理先验状态测量的相关性。研究还表明,DSKF在计算和存储复杂度上与SC相当,且在某些问题维度下可进一步降低计算和存储成本,澄清了卡尔曼滤波无法处理相关延迟状态测量的误解。

详情
英文摘要

Many estimation problems in aerospace navigation and robotics involve measurements that depend on prior states. A prominent example is odometry, which measures the relative change between states over time. Accurately handling these delayed-state measurements requires capturing their correlations with prior state estimates, and a widely used approach is stochastic cloning (SC), which augments the state vector to account for these correlations. This work revisits a long-established but often overlooked alternative--the delayed-state Kalman filter--and demonstrates that a properly derived filter yields exactly the same state and covariance update as SC, without requiring state augmentation. Moreover, two equivalent formulations of the delayed-state Kalman filter (DSKF) are presented, providing complementary perspectives on how the prior-state measurement correlations can be handled within the generalized Kalman filter. These formulations are shown to be comparable to SC in asymptotic computational and memory complexity, while one DSKF formulation can offer reduced arithmetic and storage costs for certain problem dimensions. Our findings clarify a common misconception that Kalman filter variants are inherently unable to handle correlated delayed-state measurements, demonstrating that an alternative formulation achieves the same results without state augmentation.