arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4033
2603.10047 2026-06-16 cs.SE cs.AI cs.HC

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon

发表机构 * Trane Technologies(特纳技术公司)

Comments 50 pages, 5 tables, 7 figures

详情
英文摘要

Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at tau = 0.7. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80% and 77% respectively; M1 reached 75%; and M2 was net negative at 34% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34% to 80%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.

2508.12365 2026-06-16 cs.IR cs.AI cs.CL

TaoSR1: The Thinking Model for E-commerce Relevance Search

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba(淘宝与天猫集团)

详情
Journal ref
KDD '26: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2026
英文摘要

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

2505.04382 2026-06-16 eess.AS cs.LG cs.SD

Discrete Optimal Transport and Voice Conversion

Anton Selitskiy, Maitreya Kocharekar

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Berkeley(加州大学伯克利分校)

Comments 5 pages, 1 figure, 7 tables. 11th International Conference on Machine Learning Technologies (ICMLT), Berlin, Germany, May 2026

详情
英文摘要

We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also revealing important security implications for spoof detection systems.

2509.16370 2026-06-16 math.OC cs.MS cs.RO cs.SY eess.SY

Dual-Regularized Riccati Recursions for Interior-Point Optimal Control

João Sousa-Pinto, Dominique Orban

发表机构 * IMT School for Advanced Studies, Lucca(利卡大学高级研究学院)

详情
英文摘要

We derive closed-form extensions of the sequential and parallel Riccati recursions for solving dual-regularized linear-quadratic regulator (LQR) problems, with $O(N)$ sequential time and $O(\log(N))$ parallel time, respectively. We show that these subproblems arise when using regularized primal-dual interior-point methods to solve smooth, constrained, non-convex, discrete-time optimal control problems via multiple-shooting, even in the presence of stagewise equality or inequality constraints, and without imposing any rank requirements on constraint Jacobians. We prove that, when certain inertia conditions on the Newton-KKT matrix are met, each nonzero primal step is a descent direction of an augmented barrier-Lagrangian merit function. We characterize these inertia conditions in terms of the positive-definiteness of the dual-regularized Riccati pivots (a weaker condition than the standard LQR positive-definiteness requirements), thereby yielding inexpensive certificates of the required inertia. We provide MIT-licensed implementations of our methods in C++ and in JAX, as well as a full formalization of our results in Lean. We benchmark our algorithm against leading optimal control and nonlinear programming solvers on complex trajectory optimization problems, establishing competitive performance on moderate problems and substantial gains as the horizon length, problem dimension, and constraint count increase.

2510.24987 2026-06-16 q-bio.QM cs.LG q-bio.GN

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

Jianle Sun, Chaoqi Liang, Ran Wei, Peng Zheng, Lei Bai, Wanli Ouyang, Hongliang Yan, Peng Ye

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Carnegie Mellon University(卡内基梅隆大学) The Chinese University of Hong Kong(香港中文大学) Guangzhou Laboratory(广州实验室)

Comments Accepted at NeurIPS 2025 (Spotlight)

详情
Journal ref
Advances in Neural Information Processing Systems 38 (2025): 154538-154565
英文摘要

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell's latent representations into modality-shared and modality-specific components using a well-designed $β$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-scale datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

2502.05214 2026-06-16 eess.IV cs.AI cs.CV

CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) NHS Lothian(NHS洛锡安)

详情
英文摘要

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

2410.20066 2026-06-16 eess.SP cs.AI

A Multi-Modal Non-Invasive Deep Learning Framework for Progressive Prediction of Seizures

Ali Saeizadeh, Douglas Schonholtz, Joseph S. Neimat, Pedram Johari, Tommaso Melodia

发表机构 * Institute for the Wireless Internet of Things, Northeastern University, Boston, MA, U.S.A.(无线物联网研究所,东北大学,波士顿,马萨诸塞州,美国) University of Louisville, Louisville, KY, U.S.A.(路易斯维尔大学,路易斯维尔,肯塔基州,美国)

Comments 4 pages, 5 figures, Proceedings of the IEEE 20th International Conference on Body Sensor Networks (BSN), October 2024

详情
Journal ref
2024 IEEE 20th International Conference on Body Sensor Networks (BSN)
英文摘要

This paper introduces an innovative framework designed for progressive (granular in time to onset) prediction of seizures through the utilization of a Deep Learning (DL) methodology based on non-invasive multi-modal sensor networks. Epilepsy, a debilitating neurological condition, affects an estimated 65 million individuals globally, with a substantial proportion facing drug-resistant epilepsy despite pharmacological interventions. To address this challenge, we advocate for predictive systems that provide timely alerts to individuals at risk, enabling them to take precautionary actions. Our framework employs advanced DL techniques and uses personalized data from a network of non-invasive electroencephalogram (EEG) and electrocardiogram (ECG) sensors, thereby enhancing prediction accuracy. The algorithms are optimized for real-time processing on edge devices, mitigating privacy concerns and minimizing data transmission overhead inherent in cloud-based solutions, ultimately preserving battery energy. Additionally, our system predicts the countdown time to seizures (with 15-minute intervals up to an hour prior to the onset), offering critical lead time for preventive actions. Our multi-modal model achieves 95% sensitivity, 98% specificity, and 97% accuracy, averaged among 29 patients.

2410.11861 2026-06-16 cs.HC cs.AI

Investigating Role of Big Five Personality Traits in Audio-Visual Rapport Estimation

Takato Hayashi, Ryusei Kimura, Ryo Ishii, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology(日本科学技术先进研究院) Human Informatics Laboratories, NTT Corporation(NTT公司人因实验室)

Comments 9 pages, 5 figures

详情
Journal ref
International Conference on Automatic Face and Gesture Recognition (FG2025)
英文摘要

Automatic rapport estimation in social interactions is a central component of affective computing. Recent reports have shown that the estimation performance of rapport in initial interactions can be improved by using the participant's personality traits as the model's input. In this study, we investigate whether this findings applies to interactions between friends by developing rapport estimation models that utilize nonverbal cues (audio and facial expressions) as inputs. Our experimental results show that adding Big Five features (BFFs) to nonverbal features can improve the estimation performance of self-reported rapport in dyadic interactions between friends. Next, we demystify how BFFs improve the estimation performance of rapport through a comparative analysis between models with and without BFFs. We decompose rapport ratings into perceiver effects (people's tendency to rate other people), target effects (people's tendency to be rated by other people), and relationship effects (people's unique ratings for a specific person) using the social relations model. We then analyze the extent to which BFFs contribute to capturing each effect. Our analysis demonstrates that the perceiver's and the target's BFFs lead estimation models to capture the perceiver and the target effects, respectively. Furthermore, our experimental results indicate that the combinations of facial expression features and BFFs achieve best estimation performances not only in estimating rapport ratings, but also in estimating three effects. Our study is the first step toward understanding why personality-aware estimation models of interpersonal perception accomplish high estimation performance.

2409.06708 2026-06-16 cs.CY cs.AI cs.HC

Ensuring Fairness with Transparent Auditing of Quantitative Bias in AI Systems

Chih-Cheng Rex Yuan, Bow-Yaw Wang

发表机构 * Institute of Information Science, Academia Sinica(中科院信息所)

详情
Journal ref
Proc. 2024 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC), Seoul, Republic of Korea, 2024, pp. 25-32
英文摘要

With the rapid advancement of AI, there is a growing trend to integrate AI into decision-making processes. However, AI systems may exhibit biases that lead decision-makers to draw unfair conclusions. Notably, the COMPAS system used in the American justice system to evaluate recidivism was found to favor racial majority groups; specifically, it violates a fairness standard called equalized odds. Various measures have been proposed to assess AI fairness. We present a framework for auditing AI fairness, involving third-party auditors and AI system providers, and we have created a tool to facilitate systematic examination of AI systems. The tool is open-sourced and publicly available. Unlike traditional AI systems, we advocate a transparent white-box and statistics-based approach. It can be utilized by third-party auditors, AI developers, or the general public for reference when judging the fairness criterion of AI systems.

2606.17049 2026-06-16 cs.CV 新提交

BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering

BRDFusion:物理与生成结合的城市场景逆渲染

Yi-Ruei Liu, Jie-Ying Lee, Zheng-Hui Huang, Yu-Lun Liu, Chih-Hao Lin

AI总结 提出BRDFusion框架,结合物理建模与生成先验,实现城市场景逆渲染,在保持物理一致性的同时修复伪影,支持新视角重光照、夜间模拟和动态物体编辑。

Comments Project page: https://shigon255.github.io/brdfusion-page/

详情
AI中文摘要

从捕获视频中对城市场景进行逆渲染可实现众多应用,包括内容创建和自动驾驶仿真。基于物理的渲染方法遵循并控制光照物理,但存在重建和渲染伪影。而生成模型能产生逼真视频,但一致性和可控性有限。我们提出BRDFusion,一个统一框架,结合两种互补模型用于逆渲染和前向渲染。具体而言,BRDFusion通过物理建模恢复显式、一致的场景属性,并利用生成先验缓解优化歧义。在前向渲染中,物理模型提供基于场景配置的可控渲染,生成模型则去噪并修复伪影。因此,我们的方法在允许精确控制的同时生成高质量视频,在真实和合成场景中均优于基线。此外,BRDFusion支持新视角重光照、夜间模拟以及动态物体插入/编辑。项目页面:https://shigon255.github.io/brdfusion-page/

英文摘要

Inverse rendering of urban scenes from captured videos enables numerous applications, including content creation and autonomous driving simulation. Physically-based rendering methods follow and control lighting physics, but suffer from reconstruction and rendering artifacts. While generative models produce realistic videos, they offer limited consistency and controllability. We present BRDFusion, a unified framework that combines two complementary models for inverse and forward rendering. Specifically, BRDFusion recovers explicit, consistent scene properties with physical modeling and alleviates optimization ambiguity with generative priors. During forward rendering, the physical model provides controllable rendering from the scene configuration, and the generative model denoises and fixes artifacts. Therefore, our method produces high-quality videos while allowing precise control, outperforming baselines in real and synthetic scenes. Moreover, BRDFusion supports novel-view relighting, night simulation, and dynamic object insertion/editing. Project page: https://shigon255.github.io/brdfusion-page/

2606.17037 2026-06-16 cs.CV cs.AI cs.LG 新提交

The Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

相位在神经表示中的重要性:图像分类器的内部Oppenheim-Lim测试

Alper Yıldırım

AI总结 通过内部相位-幅度移植实验,发现图像分类器(如PRISM2D、GFNet、ViT-B/16)的预测主要依赖相位/符号信息,而图像特定幅度对读出贡献有限;ResNet-50在ReLU前存在潜在符号编码,揭示了CNN与注意力模型在纹理-形状差异上的机制。

详情
AI中文摘要

Oppenheim和Lim(1981)表明,自然图像仅从傅里叶相位重建时仍可识别,而幅度几乎不携带其身份信息。我们探究训练后的图像分类器是否在其隐藏层内再现这种不对称性,并进行因果测试:给定两幅图像,我们在选定层将一幅图像的相位移植到另一幅图像的幅度上,并记录预测跟随哪幅图像。在PRISM2D、GFNet和ViT-B/16中,预测跟随相位或符号捐赠者,删除所有图像特定幅度几乎不影响准确率,因此身份信息依赖于相位,而图像特定幅度对读出而言在很大程度上是可舍弃的。ResNet-50起初似乎打破了这一模式,因为在ReLU之后移植符号无效;在ReLU之前的公平干预揭示了后期块中存在强烈的潜在符号编码,而仅DC对照表明读出消耗了通道空间平均值。对照排除了幅度简单地不依赖于图像的平凡情况。因此,这些架构共享一个相位/符号身份编码,但以不同基(由整流和读出几何决定)暴露出来,这为CNN与注意力模型之间的纹理-形状差异提供了机制性解释。

英文摘要

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

2606.17035 2026-06-16 cs.LG cs.CR 新提交

Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning

你的隐私我的伪装:差分隐私联邦学习中的后门攻击

Xiaolin Li, Ning Wang, Ninghui Li, Wenhai Sun

AI总结 针对差分隐私联邦学习,提出RING攻击,利用差分隐私的掩蔽效应绕过防御,在中等隐私预算下平均攻击成功率90.3%。

详情
AI中文摘要

先前的研究表明,差分隐私(DP)本质上增强了联邦学习(FL)对后门攻击的鲁棒性。在本文中,我们挑战了这一假设。通过对两种基线攻击策略的实证分析,我们揭示了DP-FL中的一个基本矛盾:虽然绕过DP使得最先进的防御能够检测并过滤恶意更新,但遵守DP却无意中掩盖了其独特的统计特征。因此,随着DP降低原始后门信号,现有防御变得无效。基于这种掩蔽效应,我们提出了RING,一种新颖的攻击,明确利用DP来隐藏恶意贡献,同时最大化攻击影响。通过协同制作对抗性扰动,受损客户端在聚合过程中重构强大的后门信号而不触发异常检测。RING作为一个与底层后门技术无关的扰动层,使其广泛适用且可与现有攻击组合——这一特性显著放大了其对DP-FL的威胁。在四个图像和文本数据集上进行的非独立同分布分布下的广泛评估表明,在中等隐私预算下,RING针对六种最先进防御的平均攻击成功率达到90.3%,比基线策略提高了高达26.08倍。最后,我们评估了潜在的防御措施,发现缓解这一威胁会带来显著的效用权衡,暴露了部署差分隐私FL中的基本安全漏洞。

英文摘要

Prior research suggests that differential privacy (DP) inherently enhances the robustness of federated learning (FL) against backdoor attacks. In this paper, we challenge this assumption. Through an empirical analysis of two baseline attack strategies, we uncover a fundamental tension in DP-FL: while bypassing DP allows state-of-the-art defenses to detect and filter malicious updates, complying with DP inadvertently masks their distinguishing statistical characteristics. Consequently, existing defenses become ineffective as DP reduces the raw backdoor signal. Building on this masking effect, we propose RING, a novel attack that explicitly exploits DP to conceal malicious contributions while maximizing attack impact. By collaboratively crafting adversarial perturbations, compromised clients reconstruct a strong backdoor signal during aggregation without triggering anomaly detection. RING operates as a perturbation layer that is agnostic to the underlying backdoor technique, making it broadly applicable and composable with existing attacks -- a property that significantly amplifies the threat it poses to DP-FL. Extensive evaluations across four image and text datasets under non-iid distributions show that RING achieves an average attack success rate of 90.3% against six state-of-the-art defenses under a moderate privacy budget, an improvement of up to 26.08x over baseline strategies. Finally, we evaluate potential countermeasures and find that mitigating this threat incurs significant utility trade-offs, exposing a fundamental security gap in the deployment of differentially private FL.

2606.17030 2026-06-16 cs.CV 新提交

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Chenxu Lv, Deqing Li, Gengze Zhou, Hang Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zhixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Xiong-Hui Chen, Chenfei Wu

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.17028 2026-06-16 cs.LG cs.AI cs.AR 新提交

HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting

HAMON: 用于长程预测的无源光学序列混合

Alper Yıldırım

AI总结 提出HAMON无源衍射光学预测核心,通过光学传播替代数字序列混合层,在多个基准上优于或接近最强数字基线,MSE最多降低14%。

详情
AI中文摘要

简单的线性模型和频域模型在长程时间序列预测中仍然出奇地具有竞争力,最近的机制证据表明,标准预测基准可能不需要使Transformer在其他领域强大的密集叠加表示。这引发了一个底层问题:如果核心预测算子通常是低复杂度的且近似线性,它是否需要被实现为学习到的数字时间混合?我们引入了HAMON,一种无源衍射光学预测核心,其中历史值被编码到光学孔径上,未来位置保持暗场,级联的可训练相位掩模与自由空间衍射直接在输出场中形成预测。在推理时,预测由单个无源光学传播过程完成,无需可训练的数字序列混合层。在标准基准上,HAMON在ETTm2的所有预测长度和ETTh2除最长预测长度外的所有长度上优于考虑的最强数字基线,MSE最多降低14%,并且在不同预测长度上一致地优于基线,而非孤立点。它在Weather上具有竞争力,在其余ETT设置以及高通道数的Traffic和Electricity数据集上略逊于最强基线。相位编码、强度兼容读出和相位扰乱消融实验,以及TorchOptics交叉模拟检查表明,预测来自承载数据的光场而非数字预测头。由于无源核心使用标准傅里叶光学,HAMON为光学硬件和无源物理序列混合定义了一个具体目标。

英文摘要

Simple linear and frequency-domain models remain surprisingly competitive in long-horizon time-series forecasting, and recent mechanistic evidence suggests that standard forecasting benchmarks may not require the dense superposed representations that make transformers powerful in other domains. This raises a substrate-level question: if the core forecasting operator is often low-complexity and approximately linear, does it need to be implemented as learned digital temporal mixing? We introduce HAMON, a passive diffractive optical forecasting core in which historical values are encoded onto an optical aperture, future positions are left dark, and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. At inference, prediction is performed by a single passive optical propagation pass with no trainable digital sequence-mixing layer. Across standard benchmarks, HAMON outperforms the strongest digital baselines considered on ETTm2 at all horizons and on ETTh2 at all but the longest horizon, improving MSE by up to 14\% and doing so consistently across horizons rather than at isolated points. It is competitive on Weather and trails the strongest baselines on the remaining ETT settings and on the high-channel-count Traffic and Electricity datasets. Phase encoding, intensity-compatible readout, and phase-scrambling ablations, together with a TorchOptics cross-simulator check, indicate that the forecasts arise from the data-bearing optical field rather than from a digital forecasting head. Because the passive core uses standard Fourier optics, HAMON defines a concrete target for optical hardware and for passive physical sequence mixing.

2606.17005 2026-06-16 cs.AI stat.ME 新提交

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

前沿AI评估公共档案的贝叶斯推断与决策审计

Yanan Long

AI总结 本文通过贝叶斯推断和审计方法,分析公共AI评估档案中的选择性报告和缺失数据,发现单一终端记录与多种历史路径兼容,并验证了审计门限对虚假声明的过滤作用。

详情
AI中文摘要

公共AI评估常被视为终端排行榜,但底层证据是由报告规则、基准修订和缺失数据塑造的选择性时间序列。LiveBench和Open LLM Leaderboard v2的重复公共档案作为主要纵向记录;LMArena提供偏好压力测试;GAIA和tau-bench贡献有限的智能体试点。这些档案共同实例化了一个贝叶斯推断问题:在固定报告约定下,一个仅包含$1{,}000$个系统的构造终端示例与两个终端前历史兼容,在相同终端尾模型下,达到天花板$0.05$以内的时间分别为$23.03$或$75.13$。在合成后验比较中,面向行动的诊断在不同观测制度下存在差异。候选选择感知的前沿模型未能通过合成恢复、目标档案预测、偏好转移和不确定性校准;相应地,固定审计门限拒绝了其更强的声明。一种档案与裁决协议重建了公共评估历史,隔离了验证的时间边界,并证伪了无依据的前沿声明。

英文摘要

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over $1{,}000$ systems is compatible with two pre-terminal histories, yielding times of $23.03$ or $75.13$ to reach within $0.05$ of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

2606.16995 2026-06-16 cs.AI cs.LG 新提交

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

存疑则计划:用于反应式强化学习的小型语言模型承诺式推理

Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues

AI总结 提出PACT混合架构,结合快速反应式强化学习策略与慢速小型语言模型规划器,通过异步生成和验证候选动作计划来提升策略在陌生环境中的表现。

Comments LM4Plan Workshop at ICML 2026

详情
AI中文摘要

强化学习(RL)策略在陌生环境中常常性能下降,因为它们缺乏明确的推理。我们提出了Plan, Align, Commit, Think (PACT),一种混合架构,结合了快速、反应式的RL策略与慢速、深思熟虑的小型语言模型(SLM)规划器。PACT异步调用SLM来生成和验证候选动作计划。一旦通过模拟验证计划是安全、可行且完整的,就直接执行该计划,绕过RL策略,无需重新训练或修改它。在三个难度递增的FrozenLake配置上评估,PACT在所有基线中表现最佳,同时依赖于一个2B参数的SLM骨干,这表明在这些设置中,深思熟虑的规划和反应式执行相结合比单独任何一种都更强大。

英文摘要

Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.

2606.16993 2026-06-16 cs.CV 新提交

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX-World 1.0:通用交互式世界模型

DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu

AI总结 提出通用交互式文图生视频世界模型DreamX-World 1.0,通过E-PRoPE相机控制、因果强制自回归生成、记忆条件场景持久化和事件指令微调,实现可控长时程生成,在多项指标上超越现有方法。

Comments Project page: https://amap-ml.github.io/DreamX_World, Code: https://github.com/AMAP-ML/DreamX-World

详情
AI中文摘要

DreamX-World 1.0 是一个通用的交互式文本/图像到视频的世界模型,用于可控的长时程生成。它支持相机导航、重新访问先前观察过的区域,以及在逼真、游戏风格和风格化领域中的可提示事件。我们的数据引擎结合了相机精确的虚幻引擎渲染、动作丰富的游戏录制以及带有恢复相机几何的真实世界视频。对于相机控制,我们引入了 E-PRoPE,一种轻量级的投影位置编码变体,它保留了 PRoPE 的投影相机几何,同时对空间缩减的令牌应用相机感知注意力。我们使用因果强制、DMD 风格蒸馏和长滚动训练,将双向视频生成器转换为几步自回归世界模型。在自生成的长时程上下文上进行训练,使模型暴露于其自身的生成历史,并减少跨自回归块累积的风格和颜色漂移。记忆条件场景持久性通过基于相机几何的检索来检索早期视图,而残差循环使得条件路径对不完美的记忆潜变量不那么敏感。事件指令微调增加了可组合的事件控制,而强化学习对齐在蒸馏后恢复了相机控制和视觉质量。通过混合精度 DiT 执行、残差重用、75% 剪枝的 VAE 解码和异步流水线并行,DreamX-World 1.0 在八块 RTX 5090 GPU 上达到高达 16 FPS。在我们的 5 秒基本评估中,DreamX-World 1.0 获得了 73.75 的相机控制分数和 84.76 的总分,在总分上优于 HY-WorldPlay 1.5 和 LingBot-World,后两者分别达到 80.79 和 80.45。

英文摘要

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

2606.16939 2026-06-16 cs.LG cs.AI 新提交

Scalable Circuit Learning for Interpreting Large Language Models

可扩展的电路学习用于解释大型语言模型

Naiyu Yin, Dennis Wei, Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Yue Yu

AI总结 提出CircuitLasso方法,基于稀疏线性回归高效学习LLM中的稀疏电路,以SAE特征为单元,在保持结构准确性的同时大幅降低计算成本,并揭示语义特征传播机制。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

机械可解释性中的一个重要研究方向是学习LLM组件上的稀疏电路,以揭示它们如何共同产生模型行为。然而,原始神经元具有多语义性,使得学习到的电路难以解释。稀疏自编码器(SAE)特征缓解了这一问题,但其高维度使得现有的基于干预的电路学习方法在计算上变得不可行。我们提出了CircuitLasso,一种基于稀疏线性回归的可扩展电路学习方法。CircuitLasso恢复的电路在基准数据上的结构准确性与最先进的基于干预的方法相匹配,而计算成本仅为后者的一小部分。为了可解释性,CircuitLasso高效地揭示了SAE特征之间的关系,展示了人类可解释的语义特征如何通过模型传播并影响其预测。最后,我们通过利用所学电路的见解,在领域泛化任务上以显著更低的成本实现了相当的性能,从而验证了所学电路的实用性。

英文摘要

A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

2606.16934 2026-06-16 cs.CL cs.LG 新提交

Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

探索代码解释器有效推理的外在属性与内在属性

Patomporn Payoungkhamdee, Napat Laosaengpha, Jenta Wonglertsakul, Pittawat Taveekitworachai, Pume Tuchinda, Panjapong Poobanchuen, Ekapol Chuangsuwanich, Can Udomcharoenchaikit, Samuel Cahyawijaya, Peerat Limkonchotiwat, Sarana Nutanong

AI总结 本文从外在属性(关键token)和内在属性(代码特定认知行为)两个角度研究代码解释器推理,发现强模型更频繁出现关键token和验证、回溯等行为,并利用这些属性在推理和训练中提升性能。

详情
AI中文摘要

使用代码解释器(CI)进行推理已成为一种有效范式,通过可执行计算和迭代验证增强大型语言模型(LLM)的推理能力。尽管其应用日益广泛,但有效代码推理的行为属性仍未被充分探索。在本工作中,我们受自然语言推理研究的启发,从两个不同视角研究代码推理:外在属性(由关键token表示)和内在属性(由代码特定的认知行为表示)。在多个LLM上,我们发现更强的CI推理模型一致地表现出更高比例的关键token和认知行为,特别是验证、回溯和反向链。基于这些观察,我们研究了如何在推理和训练期间利用这些属性。在推理时,附加代码特定的关键token在数学、排序和优化等若干推理能力上提升了性能,但在其他方面收益有限。在训练时,用代码特定的认知行为增强最先进的框架,在三个评估模型中的两个上提升了监督微调和强化学习性能。进一步分析表明,这些行为减少了错误回答中的过度思考,提高了token效率,同时也揭示了限制某个模型收益的因素。我们的发现首次系统性地描述了有效CI推理的特征,并展示了利用关键属性改进CI推理的潜力和局限性。

英文摘要

Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.

2606.16917 2026-06-16 cs.RO 新提交

Unified Motion-Action Modeling for Heterogeneous Robot Learning

统一运动-动作建模用于异构机器人学习

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

AI总结 提出UMA模型,利用3D物体运动轨迹作为共享接口,通过掩码生成目标统一视觉运动控制和动力学建模,实现跨异构数据源的多任务预训练,并在部署时支持多种推理模式。

Comments https://uma-manipulation.github.io/

详情
AI中文摘要

我们提出了统一运动-动作(UMA)模型,该方法使用3D物体运动轨迹作为共享接口,以桥接视觉运动控制和动力学建模。UMA将物体运动和机器人动作视为在掩码生成目标下共同演化的变量,其中掩码模式决定了预训练期间的监督机制和部署时的推理模式。通过使用事后重标记的运动上下文和对比目标(将任务意图与场景几何解耦),UMA能够在无需手动标注任务指令的情况下,跨异构数据源进行多任务预训练。在部署时,相同的预训练参数支持运动条件视觉运动控制、基于运动的动力学建模以及从少量示范中进行的任务适应。在机器人演示、人类视频和模拟数据的混合数据集上预训练后,UMA在每种推理模式下均持续优于专门针对该模式的最先进基线。

英文摘要

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

2606.16902 2026-06-16 cs.RO cs.AI 新提交

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

基于开放视觉语言模型的空间问答与导航的二值追踪

Dongbin Na, Chanwoo Kim, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong

AI总结 提出BinTrack,一种全开源的空间定位代理,通过二值搜索轨迹段,在SpaceLocQA基准上准确率提升22.8%,推理速度提升1.5倍,并发布多行程室外数据集GangnamLoop。

Comments 21 pages, 4 figures, 15 tables. Project page: https://ndb796.github.io/BinaryTracking ; Code and dataset: https://github.com/ndb796/BinaryTracking

详情
AI中文摘要

本工作针对服务机器人在长距离自我中心路线上的空间问答问题。给定诸如“在回家的路上哪里可以找到干洗店?”的查询,系统返回一个度量坐标,下游导航组件可以据此行动。先前的空间问答方法利用基于闭源模型(如GPT-4o)的检索增强代理进行路径探索。然而,在现实世界中运行的机器人通常无法可靠地依赖在线闭源模型,因为网络不稳定、通信延迟和部署成本。这需要能够在机器人上运行的开源空间问答方法,但先前在这方面的研究仍然有限。本工作提出BinTrack,一种简单而有效的全开源空间定位代理,它利用机器人轨迹的时间顺序。BinTrack对查询中识别的两个锚点地标之间的轨迹段进行二值搜索。与其他开源实现相比,它将整体准确率提高了22.8%,甚至在SpaceLocQA基准的全局类别上匹配了报告的闭源模型结果,这是迄今为止需要强大推理代理(如GPT-4o)的最具挑战性的设置。此外,其优化的推理策略始终比先前方法提供超过1.5倍的推理加速。最后,本工作发布了GangnamLoop,这是一个新颖且实用的多行程室外基准,通过在实际公共街道上部署真实四足机器人并采用匿名化策略收集而成。它在不同室外条件下重新访问相同位置,并将机器人的低视角与人类主人的视角配对。源代码和数据集可在https://github.com/ndb796/BinaryTracking公开获取。

英文摘要

This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking

2606.16878 2026-06-16 cs.LG 新提交

Integrated Marketing Attribution: A Bayesian Framework for Privacy-Safe Granular Measurement Anchored in MMM

集成营销归因:基于贝叶斯框架的隐私安全粒度测量,锚定于MMM

Meghana R. Bhat, Ankit Umare, Utsav Aggarwal, Richard Vecsler, Arunkumar Mani, Karthik Nair, Chandhu Nair

AI总结 提出集成营销归因(IMA)框架,结合营销组合模型(MMM)与贝叶斯归因模型,从聚合数据中推导出活动级效果,实现隐私安全且粒度精细的归因。

详情
AI中文摘要

零售营销测量日益需要精细的活动级洞察,而无需依赖用户级跟踪。然而,两种主流方法——营销组合模型(MMM)和多触点归因(MTA)——常常产生碎片化的洞察。MMM在渠道级规划中隐私安全且稳健,但对于活动优化过于粗糙;而MTA提供精细归因,但在日益增加的隐私限制下变得不太可靠。我们提出集成营销归因(IMA),一个统一框架,将MMM与特定渠道的贝叶斯归因模型相结合,从聚合数据中推导活动级效果。通过利用MMM信息先验,IMA提供精细、隐私安全的归因,同时保持与MMM的一致性。

英文摘要

Retail marketing measurement increasingly requires granular campaign-level insights without relying on user-level tracking. However, the two dominant approaches, Marketing Mix Modeling (MMM) and Multi-Touch Attribution (MTA), often produce fragmented insights. MMM is privacy-safe and robust for channel-level planning but is too coarse for campaign optimization, while MTA provides granular attribution but has become less reliable under increasing privacy restrictions. We propose Integrated Marketing Attribution (IMA), a unified framework that combines MMM with channel specific Bayesian attribution models to derive campaign-level effects from aggregated data. By leveraging MMM-informed priors, IMA delivers granular, privacy-safe attribution while preserving consistency with MMM.

2606.16856 2026-06-16 cs.RO 新提交

Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

基于视频的最优传输用于反馈高效的离线偏好强化学习

Tung M. Luu, Hwanhee Kim, Younghwan Lee, Chang D. Yoo

AI总结 提出VOTP框架,利用视频基础模型和最优传输生成伪标签,仅需少量人类反馈即可学习有效奖励函数,显著降低标注成本。

Comments ICML 2026 (Oral)

详情
AI中文摘要

向强化学习智能体传达复杂目标通常需要精心的奖励工程。偏好强化学习(PbRL)通过从人类反馈中学习奖励函数提供了一种有前景的替代方案,但其可扩展性受到高标注成本的阻碍。受视频基础模型(ViFMs)进展的启发,我们提出了基于视频的最优传输偏好(VOTP),这是一个半监督框架,仅需少量标签即可学习有效的奖励函数。通过利用最优传输在ViFMs的丰富表示空间中对齐视觉轨迹,VOTP有效地为大量未标注数据生成高保真伪标签,大幅减少了人类监督。在运动控制和操作基准上的大量实验证明了VOTP的优越性,在有限的反馈预算下,其性能优于最先进的离线PbRL方法。我们还展示了VOTP在视觉干扰存在时的鲁棒性,并在真实机器人任务上验证了其实用性,其中它以最少的人类输入学习了有意义的奖励。

英文摘要

Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.

2606.16813 2026-06-16 cs.AI 新提交

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

GIST-CMTF:LLM代理中因果最小工具过滤的目标状态推断

Rahul Suresh Babu, Rohit Shukla

AI总结 提出GIST-CMTF层,通过预测候选符号目标状态并估计歧义性,解决工具增强LLM代理因用户请求多义性导致的错误目标执行问题,在120个任务上达到97.0%成功率。

详情
AI中文摘要

工具增强的LLM代理依赖运行时过滤来决定每个步骤中哪些工具应可见。因果最小工具过滤(CMTF)通过仅暴露下一个因果必要的工具前沿来减少工具选择混淆,但它假设用户请求已映射到符号目标状态。实际上,诸如“处理我的预约”或“处理这封邮件”之类的请求可能对应多个可能的目标。这会导致错误目标执行,即代理为意外目标遵循有效的因果工具路径。我们引入GIST-CMTF,一个目标状态推断层,它预测在CMTF使用的相同状态转换词汇上的候选符号目标,估计歧义性,并要么应用CMTF,要么将澄清暴露为产生缺失目标或状态变量的因果动作。我们在七个模型后端、六种过滤方法和120个受控工具使用任务上评估GIST-CMTF。GIST-CMTF实现了97.0%的任务成功率,而top-goal CMTF为80.1%,semantic-goal CMTF为82.9%。它将错误目标执行从top-goal CMTF下的19.4%降低到2.5%,同时保留了因果过滤的单工具暴露,并且使用的令牌数远少于全工具暴露。这些结果表明,可靠的工具增强代理在暴露外部动作之前应验证目标状态,而不仅仅是工具相关性。

英文摘要

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as "handle my appointment" or "take care of this email" may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

2606.16807 2026-06-16 cs.CL 新提交

Connecting Speech to Words through Images

通过图像连接语音与文字

Gabriel Pirlogeanu, Dan Oneata, Horia Cucu, Herman Kamper

AI总结 提出一种基于视觉的方法,利用图像和语音描述构建口语词汇表,无需文本监督,在口语词检索和关键词检测中优于神经基线。

Comments Accepted at EUSIPCO 2026 - 5 pages, 3 figures, 2 tables

详情
AI中文摘要

在没有明确文本监督的情况下,我们如何学习书面单词与其口语对应词之间的映射?我们提出了一种基于视觉的方法,仅使用图像及其口语描述来构建口语词汇表。首先,图像字幕系统用于构建代表图像中显著视觉概念的书面词汇表。对于每个单词,我们找到其图像字幕包含该单词的话语。然后,我们使用无监督词发现技术对齐这些话语,以定位目标单词的实例。结果是口语单词片段与书面单词相关联——所有这些都在没有任何文本监督的情况下完成。在口语单词检索和关键词检测实验中,所提出的方法在更具可解释性的同时,优于强大的神经基线。这些结果证明了该方法在英语中的可行性,并激励了未来在缺乏转录的低资源语言上的工作。

英文摘要

How can we learn the mapping between written words and their spoken counterparts in the absence of explicit textual supervision? We present a visually grounded method for building a vocabulary of spoken words using only images and their spoken descriptions. First, image captioning systems are used to build a vocabulary of written words representing salient visual concepts in the images. For each word, we then find utterances whose image captions contain that word. Then we use an unsupervised word discovery technique to align these utterances to locate instances of the target word. The result is spoken word segments that are linked to written words -- all accomplished without any text supervision. In spoken word retrieval and keyword spotting experiments, the proposed approach outperforms a strong neural baseline while being more interpretable. These results demonstrate the feasibility of the approach in English and motivate future work on low-resource languages without transcripts.

2606.16799 2026-06-16 cs.CV cs.AI 新提交

Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

解耦语义与失真:面向AI生成图像质量评估的多尺度双流视觉-语言对齐

Zijie Meng

AI总结 提出MST-CLIPIQA多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐,在五个基准上取得质量SRCC平均提升1.11%、图文对应SRCC提升2.35%的新SOTA结果。

Comments 11 pages, 2 figures Accepted by ICME2026(spotlight)

详情
AI中文摘要

现有的基于视觉-语言模型(VLM)的AI生成图像质量评估(AIGIQA)方法存在根本性的语义-失真维度冲突:为语义区分优化的单一表示在本质上将组成性理解与低层感知敏感性纠缠在一起,使其对细粒度质量退化视而不见。我们提出MST-CLIPIQA,一种多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐。我们的架构利用具有互补补丁粒度的双CLIP编码器:粗粒度流捕获全局语义连贯性,而细粒度流保留纹理特征和伪影模式。一种受信息瓶颈启发的门控融合机制执行自适应跨尺度蒸馏,当生成提示可用时,可选交叉注意力实现基于提示的对应评估。在五个基准上的广泛实验建立了新的最先进结果,在质量预测上实现平均SRCC提升1.11%,在文本-图像对应预测上提升2.35%,同时仅需0.8M可训练参数即可保持效率。我们的项目可在https://github.com/YMlinfeng/MST-CLIPIQA获取。

英文摘要

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

2606.16794 2026-06-16 cs.CV 新提交

LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

基于LLM的视觉解释评估框架:用于评估面部皮肤病分类模型的可解释性

Gyuyeon Na

AI总结 提出基于LLM的视觉解释评估框架,通过渐进式提示工程评估Grad-CAM在面部皮肤病诊断模型中的解释质量,聚焦病变定位和可信度。

详情
AI中文摘要

本研究提出了一个特定领域的基于LLM的视觉解释评估框架,用于评估面部皮肤病诊断模型中Grad-CAM解释的质量。以往研究主要关注通过数据增强技术提升分类性能,而较少系统性地检验模型解释是否基于临床相关的病变区域。在本研究中,对基于EfficientNet-B0、MobileNetV3和ResNet18的面部皮肤病分类模型应用了几何增强、颜色增强和混合增强策略。采用Grad-CAM生成代表模型决策过程的视觉解释。此外,利用GPT-5.5、Gemini 3.5 Flash和Claude Sonnet 4.6设计了LLM-as-a-Judge评估框架,从病变定位和解释可信度两个角度评估Grad-CAM解释。为提高评估一致性和临床基础,引入了渐进式提示工程策略,包含评估准则、临床知识、惩罚规则和结构化输出格式。

英文摘要

This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.

2606.16765 2026-06-16 cs.LG physics.flu-dyn 新提交

A Validated LBM Dataset and Pipeline for Surrogate Modeling of Turbulent 3D Obstructed Channel Flows

一个经过验证的LBM数据集和用于湍流三维阻塞通道流代理建模的流水线

Lukas Schröder, Shubham Kavane, Harald Köstler

AI总结 提出一个可复现的流水线,生成雷诺数1000-10000的三维通道流训练数据,使用累积碰撞算子的格子玻尔兹曼求解器,并通过实验测量和网格收敛研究验证,为神经算子标准化比较提供基础。

Comments 4 pages + appendix, 9 figures, Accepted at the 1st Workshop on Differentiable Systems and Scientific Machine Learning (SysDiff) @ EurIPS 2025, OpenReview: https://openreview.net/forum?id=rdmHT72NQH

详情
AI中文摘要

评估三维湍流的神经算子需要经过验证的数据集和物理基准。我们提出了一个可复现的流水线,用于生成在雷诺数1000-10000范围内、围绕生成几何体的三维通道流的训练数据。我们的格子玻尔兹曼求解器采用累积碰撞算子,并通过实验测量(斯特劳哈尔数、阻力系数、湍流波动)进行了严格验证,在1024x512x512分辨率下进行了全面的网格收敛研究。基于已建立的框架,这个经过验证的流水线能够实现代理模型的标准化比较。我们概述了计划中的系统评估,包括傅里叶神经算子与U-Net变体在预测、超分辨率和误差校正任务上的表现,并使用物理信息度量来评估湍流能量级联的表示。未来的工作将比较数值求解器和神经代理之间的计算效率,探索实际应用。我们寻求社区对我们验证方法、计划中的基准方法论以及湍流中神经算子评估优先级的反馈。

英文摘要

Evaluating neural operators for 3D turbulent flow requires validated datasets with physical benchmarks. We present a reproducible pipeline generating training data for 3D channel flows around generated geometries at Re=1,000-10,000. Our lattice Boltzmann solver with cumulant collision operators is rigorously verified against experimental measurements (Strouhal number, drag coefficients, turbulent fluctuations) with comprehensive grid convergence studies at resolution 1024x512x512. Building upon an established framework, this validated pipeline enables standardized surrogate model comparison. We outline planned systematic evaluation of Fourier Neural Operator and U-Net variants on forecasting, super-resolution, and error correction tasks, using physics-informed metrics to assess turbulent energy cascade representation. Future work will compare computational efficiency between numerical solvers and neural surrogates, exploring practical application. We seek community feedback on our validation approach, planned benchmark methodology, and evaluation priorities for neural operators in turbulent flows.

2606.16729 2026-06-16 cs.LG math.OC 新提交

Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

从平均奖励马尔可夫决策过程中的单条轨迹学习策略

Jongmin Lee, Ernest K. Ryu, Vaneet Aggarwal

AI总结 针对弱通信平均奖励MDP,首次从单条轨迹建立有限样本复杂度保证,提出无模型方法,值函数和策略方法分别达到$\widetilde{O}(1/\varepsilon^2)$和$\widetilde{O}(1/\varepsilon^4)$的样本复杂度。

详情
AI中文摘要

尽管已有大量工作刻画了折扣累积奖励MDP的样本复杂度,但平均奖励MDP的有限样本分析仍然有限,且大多数现有工作依赖于遍历性或生成模型访问等限制性假设。在这项工作中,我们首次为弱通信平均奖励MDP从单条轨迹建立了有限样本复杂度保证。为此,我们研究了弱通信MDP中单条轨迹的动力学,并基于此分析,开发了新颖的无模型方法。值得注意的是,我们的基于值函数和基于策略的方法在弱通信MDP中从单条轨迹分别提供了$\widetilde{O}(1/\varepsilon^2)$和$\widetilde{O}(1/\varepsilon^4)$的有限样本复杂度保证。此外,我们引入了第一个无需问题相关参数先验知识的通信MDP无模型方法。

英文摘要

While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generative model. In this work, we establish the first finite sample complexity guarantees from a single trajectory for weakly communicating average-reward MDPs. To this end, we study the dynamics of a single trajectory in weakly communicating MDPs and based on this analysis, we develop novel model-free methods. Notably, our value-based and policy-based methods provide finite sample complexity guarantees of $\widetilde{O}(1/\varepsilon^2)$ and $\widetilde{O}(1/\varepsilon^4)$ from a single trajectory in weakly communicating MDPs, respectively. Furthermore, we introduce the first model-free method that requires no prior knowledge of problem-dependent quantities for communicating MDPs.

2606.16723 2026-06-16 cs.AI 新提交

AgentFairBench: Do LLM Agents Discriminate When They Act?

AgentFairBench: LLM智能体在行动时是否存在歧视?

Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

AI总结 提出AgentFairBench基准,通过反事实匹配集和偏差传导框架,评估LLM智能体在招聘、贷款和医疗分诊中的行动公平性,发现统计量级不匹配会夸大歧视,而匹配后Claude Haiku无显著人口统计效应。

Comments Submitted to IEEE Access

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地采取行动(筛选申请人、推荐信贷、分诊患者),但LLM的公平性仍通过评分答案来衡量。我们引入AgentFairBench,一个廉价、可复现、多领域的基准,用于评估LLM智能体行动中的人口统计差异。基于配套框架——偏差传导框架(BCF,在此重述),它涵盖三个监管锚定的领域:招聘、贷款和医疗分诊。在四种递增代理能力的智能体框架(直接、思维链、多智能体协商、工具增强)下,使用合成的人口统计中性档案,在仅改变姓名编码的种族×性别信号的反事实匹配集中进行评估(遵循Bertrand Mullainathan传统)。一个仅依赖NumPy的测试工具计算反事实翻转率、平均绝对分数差异(MASD)、行动率差异和工具调用差异,并提供自助置信区间、配对检验和错误发现率控制,每个模型的成本仅为个位数美元。一个包含保留私有分割和污染金丝雀的实时排行榜接受外部模型提交。我们的试点研究(864个决策加上重测复现)带来了一个方法论教训:将六组分数分布与两次运行的噪声差异进行比较,仅通过统计量级就会将差异夸大约2.4倍。在匹配量级的噪声基底和综合组检验下,Claude Haiku 4.5未显示出高于采样噪声的人口统计效应(120个成对对比中0个和9个综合对比中0个通过校正);植入偏差测试证实该工具能检测到存在的差异。贡献在于一个健全、敏感、可采用的工具、量级匹配的零假设方法以及可扩展的开源工件。代码、数据和测试工具以开放许可证发布,并附有匿名评审工件。

英文摘要

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.