arXivDaily arXiv每日学术速递 周一至周五更新
重置
2602.20958 2026-06-11 cs.RO cs.AI 版本更新

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

基于EKF的深度相机与深度学习融合用于搜救任务中无人机-人员距离估计与跟随

Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

发表机构 * University of Rijeka(里雅斯特大学)

AI总结 提出融合深度相机测量和单目相机人体距离估计的EKF方法,利用YOLO-pose实现实时融合,提高无人机跟随中距离估计的精度和鲁棒性,在三个测试场景中平均误差降低15.3%。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

基于视觉的无人机框架通过检测和识别特定个体,然后跟踪并跟随它们,同时保持安全距离,来辅助人类搜索任务。无人机跟随的一个关键安全要求是在现实条件下准确估计相机与目标物体之间的距离,这通过融合多种图像模态来实现。作为使用深度学习进行自动人员检测和面部识别系统的一部分,本文提出了融合深度相机测量和单目相机到人体距离估计的方法,以实现鲁棒的跟踪和跟随。使用YOLO-pose实现了基于深度学习的深度相机数据滤波和从单目相机估计相机到人体距离,从而利用扩展卡尔曼滤波算法实现深度信息的实时融合。所提出的子系统设计用于无人机,估计和测量深度相机与人体关键点之间的距离,以保持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计,并已通过运动捕捉地面真值数据进行了验证。该系统已在室内实时测试,在三个测试场景中,距离估计的平均误差、均方根误差和标准差降低了高达15.3%。基于测试结果,基于EKF融合的方法通过减少深度相机最佳工作范围之外的误差,增加了深度检测范围。它还在具有挑战性的条件下(如反射和能见度差)显示出改进的鲁棒性和精度,使其适用于搜救任务。

英文摘要

Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, RMSE and standard deviations of distance estimation up to 15,3% in three tested scenarios. Based on the test results, the EKF fusion-based approach increases the depth detection range by reducing the errors outside the optimal depth camera working range. It also shows improved robustness and precision in challenging conditions, such as reflections and poor visibility, making it suitable for SAR.

2602.19718 2026-06-11 cs.SE cs.AI 版本更新

Carbon-Aware Governance Gates: An Architecture for Sustainable GenAI Development

碳感知治理门:可持续生成式AI开发的架构

Mateen A. Abbasi, Tommi J. Mikkonen, Petri J. Ihantola, Muhammad Waseem, Pekka Abrahamsson, Niko K. Mäkitalo

发表机构 * arXiv.org University of Helsinki(赫尔辛基大学) Aalto University(阿尔托大学)

AI总结 针对生成式AI在软件开发中增加碳足迹的问题,提出碳感知治理门架构,通过嵌入碳预算、能源溯源和可持续验证编排来降低环境影响。

详情
Comments
5 pages, 1 figure. Preprint version under review
AI中文摘要

生成式AI在软件开发生命周期中的快速普及增加了计算需求,这可能提高开发活动的碳足迹。同时,组织越来越多地将治理机制嵌入到生成式AI辅助开发中,以支持信任、透明度和问责制。然而,这些治理机制引入了额外的计算负载,包括重复推理、再生循环和扩展的验证管道,增加了能源使用和生成式AI辅助开发的碳足迹。本文提出碳感知治理门(CAGG),一种架构扩展,将碳预算、能源溯源和可持续感知验证编排嵌入到人机治理层中。CAGG包含三个组件:(i)能源和碳溯源账本,(ii)碳预算管理器,以及(iii)绿色验证编排器,通过治理策略和可重用设计模式实现。

英文摘要

The rapid adoption of Generative AI (GenAI) in the software development life cycle (SDLC) increases computational demand, which can raise the carbon footprint of development activities. At the same time, organizations are increasingly embedding governance mechanisms into GenAI-assisted development to support trust, transparency, and accountability. However, these governance mechanisms introduce additional computational workloads, including repeated inference, regeneration cycles, and expanded validation pipelines, increasing energy use and the carbon footprint of GenAI-assisted development. This paper proposes Carbon-Aware Governance Gates (CAGG), an architectural extension that embeds carbon budgets, energy provenance, and sustainability-aware validation orchestration into human-AI governance layers. CAGG comprises three components: (i) an Energy and Carbon Provenance Ledger, (ii) a Carbon Budget Manager, and (iii) a Green Validation Orchestrator, operationalized through governance policies and reusable design patterns.

2602.19502 2026-06-11 cs.AI cs.LG 版本更新

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

人类引导的智能体AI用于多模态临床预测:来自AgentDS医疗基准的教训

Lalitha Pranathi Pulavarthy, Raajitha Muthyala, Aravind V Kuruvikkattil, Zhenan Yin, Rashmita Kudamala, Saptarshi Purkayastha

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 通过人类引导智能体AI在多模态临床预测任务中取得领先性能,提炼出领域知识引导特征工程、任务特定多模态融合和临床动机模型集成三大通用经验。

详情
Comments
Presented at the Data Challenge track at the 14th IEEE International Conference on Healthcare Informatics (ICHI) 2026 on June 3, 2026
AI中文摘要

智能体AI系统越来越能够自主执行数据科学工作流程,但临床预测任务需要纯自动化方法难以提供的领域专业知识。我们研究了人类引导智能体AI如何改进多模态临床预测,展示了我们在所有三个AgentDS医疗基准挑战中的方法:30天再入院预测(Macro-F1 = 0.8986)、急诊科费用预测(MAE = $465.13)和出院准备评估(Macro-F1 = 0.7939)。在这些任务中,人类分析师在关键决策点指导智能体工作流程:来自临床笔记、扫描PDF账单收据和时间序列生命体征的多模态特征工程;任务适当的模型选择;以及临床信息验证策略。我们的方法在医疗领域总体排名第5,在出院准备任务中获得第3名。消融研究表明,人类引导决策在自动化基线之上累积增益达到+0.065 F1,其中多模态特征提取贡献了最大的单一改进(+0.041 F1)。我们提炼出三个可推广的经验:(1)每个流水线阶段的领域信息特征工程产生累积增益,优于广泛的自动搜索;(2)多模态数据集成需要任务特定的人类判断,没有单一提取策略能泛化到临床文本、PDF和时间序列;(3)具有临床动机模型配置的刻意集成多样性优于随机超参数搜索。这些发现为在需要可解释性、可重复性和临床有效性的医疗环境中部署智能体AI的团队提供了实用指导。

英文摘要

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

2602.18291 2026-06-11 cs.AI 版本更新

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调:高效在线多智能体扩散策略

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个在线离线策略多智能体强化学习框架OMAD,利用扩散策略和松弛策略目标最大化缩放联合熵,实现高效探索与协调,在MPE和MAMuJoCo上样本效率提升2.5至5倍。

详情
AI中文摘要

在线多智能体强化学习(MARL)是实现高效智能体协调的重要框架。关键在于增强策略表达能力以实现更优性能。基于扩散的生成模型在图像生成和离线设置中展现出卓越的表达能力和多模态表示,因此非常适合满足这一需求。然而,它们在在线MARL中的潜力尚未被充分探索。主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为应对这一挑战,我们首次提出使用扩散策略的在线离线策略MARL框架(OMAD)来协调协调。我们的关键创新是采用松弛策略目标,最大化缩放联合熵,从而在无需可处理似然的情况下促进有效探索。此外,在集中训练与分散执行(CTDE)范式中,我们使用联合分布价值函数来优化分散扩散策略。它利用可处理的熵增强目标来指导扩散策略的同时更新,从而确保稳定协调。在MPE和MAMuJoCo上的广泛评估表明,我们的方法在10个不同任务上达到了新的最先进水平,样本效率显著提升了2.5至5倍。

英文摘要

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染:基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

发表机构 * University of Michigan(密歇根大学) Environmental Working Group(环保工作组) University of California, Davis(加州大学戴维斯分校)

AI总结 提出FOCUS框架,结合稀疏PFAS观测与水文连通性等环境先验,通过噪声感知损失实现鲁棒训练,在PFAS污染测绘中优于传统方法。

详情
Comments
Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop
AI中文摘要

全氟和多氟烷基物质(PFAS)是持久性环境污染物,对公共健康有显著影响,但由于现场采样的高成本和后勤挑战,大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散,并且对PFAS在地表水中传输的科学理解有限。然而,描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS,一个用于PFAS污染测绘的地理空间深度学习框架,该框架将稀疏的PFAS观测与大规模环境背景(包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验)相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中,从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证,FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法,同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学,这些风险图可优先安排后续采样,并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

2602.14913 2026-06-11 cs.LG eess.IV 版本更新

Coverage Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

分布漂移下伪校准保形预测的覆盖保证

Farbod Siahkali, Ashwin Verma, Vijay Gupta

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(艾洛姆家族电气与计算机工程学院,普渡大学)

AI总结 针对分布漂移下保形预测覆盖失效问题,利用伪校准和领域自适应工具,推导目标覆盖下界,并提出通过松弛参数膨胀保形阈值的方法及源调优伪校准算法,实验证明其能缓解覆盖退化。

详情
Comments
Under review. 6 pages, 2 figures, 1 table
AI中文摘要

保形预测(CP)在可交换性假设下提供无分布边际覆盖保证,但当数据分布发生漂移时,这些保证可能失效。我们分析了在有限标签条件协变量漂移模型下,使用伪校准作为应对这种性能损失的工具。利用领域自适应的工具,我们根据分类器的源域损失和漂移的Wasserstein度量推导出目标覆盖的下界。利用这一结果,我们提供了一种设计伪校准集的方法,该方法通过松弛参数膨胀保形阈值,使目标覆盖保持在规定水平以上。最后,我们提出了一种源调优伪校准算法,该算法根据分类器的不确定性在硬伪标签和随机化标签之间进行插值。数值实验表明,我们的界限定性地跟踪了伪校准行为,并且源调优方案在分布漂移下缓解了覆盖退化,同时保持了非平凡的预测集大小。

英文摘要

Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools from domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a Wasserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally, we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels as a function of classifier uncertainty. Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial prediction set sizes.

2507.11688 2026-06-11 cs.LG 版本更新

Composing Linear Layers from Irreducibles

从不可约元组合线性层

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出用Clifford代数将线性层分解为双向量(几何基元)的组合,仅需O(log^2 d)参数,在LLM注意力投影中匹配强基线性能。

详情
Comments
35 Pages, 11 Tables, 6 Figures, Appearing in NeurIPS 2025
AI中文摘要

当代大型模型常表现出暗示存在低级基元的行为,这些基元组合成功能更丰富的模块,但这些基本构建块仍未被很好理解。我们通过询问:能否从最小几何基元集合中识别/合成线性变换?来研究线性层中的这种组合结构。利用Clifford代数,我们证明线性层可以表示为双向量(编码有向平面的几何对象)的组合,并引入一种可微算法将其分解为转子乘积。这种构造仅需O(log^2 d)个参数,而稠密矩阵需要O(d^2)。应用于LLM注意力层中的键、查询和值投影,我们的基于转子的层匹配了块Hadamard和低秩近似等强基线的性能。我们的发现为这些几何基元如何在深度模型中组合成更高层次功能提供了代数视角。

英文摘要

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors -- geometric objects encoding oriented planes -- and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

2602.11995 2026-06-11 cs.LG 版本更新

Momentum LMS Theory beyond Stationarity: Stability, Tracking, and Regret

超越平稳性的动量LMS理论:稳定性、跟踪与遗憾

Yifei Jin, Xin Zheng, Lei Guo

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学科学国家重点实验室) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院)

AI总结 本文研究动量最小均方算法在非平稳时变线性系统中的跟踪性能与遗憾界,通过分析二阶时变随机向量差分方程,证明其快速适应和鲁棒跟踪能力。

详情
Comments
9 pages, 3 figures
AI中文摘要

在大规模数据处理场景中,数据通常以序列流的形式到达,这些序列由具有漂移分布和时变系统参数的复杂系统生成。这种非平稳性挑战了理论分析,因为它违反了i.i.d.(独立同分布)样本的经典假设,需要能够实时更新而无需昂贵重新训练的算法。一种有效的方法应在单次处理每个样本的同时,保持计算和内存复杂度与数据流长度无关。受这些挑战的启发,本文研究了动量最小均方(MLMS)算法作为自适应识别工具,利用其计算简单和在线处理能力。理论上,我们在各种实际条件下推导了MLMS在时变随机线性系统中的跟踪性能和遗憾界。与经典LMS不同,其稳定性可由一阶随机向量差分方程表征,而MLMS由于动量引入额外的动态状态,导致二阶时变随机向量差分方程,其稳定性分析依赖于更复杂的随机矩阵乘积,这构成了一个极具挑战性的问题。在合成和真实数据流上的实验表明,MLMS实现了快速适应和鲁棒跟踪,与我们的理论结果一致,尤其是在非平稳环境中,突显了其在现代流式和在线学习应用中的潜力。

英文摘要

In large-scale data processing scenarios, data often arrive in sequential streams generated by complex systems that exhibit drifting distributions and time-varying system parameters. This nonstationarity challenges theoretical analysis, as it violates classical assumptions of i.i.d. (independent and identically distributed) samples, necessitating algorithms capable of real-time updates without expensive retraining. An effective approach should process each sample in a single pass, while maintaining computational and memory complexities independent of the data stream length. Motivated by these challenges, this paper investigates the Momentum Least Mean Squares (MLMS) algorithm as an adaptive identification tool, leveraging its computational simplicity and online processing capabilities. Theoretically, we derive tracking performance and regret bounds for the MLMS in time-varying stochastic linear systems under various practical conditions. Unlike classical LMS, whose stability can be characterized by first-order random vector difference equations, MLMS introduces an additional dynamical state due to momentum, leading to second-order time-varying random vector difference equations whose stability analysis hinges on more complicated products of random matrices, which poses a substantially challenging problem to resolve. Experiments on synthetic and real-world data streams demonstrate that MLMS achieves rapid adaptation and robust tracking, in agreement with our theoretical results especially in nonstationary settings, highlighting its promise for modern streaming and online learning applications.

2602.11801 2026-06-11 cs.LG 版本更新

SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG

SpaTeoGL: 用于颅内脑电图可解释癫痫发作起始区分析的时空图学习

Elham Rostami, Aref Einizade, Taous-Meriem Laleg-Kirati

发表机构 * Inria Saclay(Inria萨克莱实验室) Palaiseau, France(法国帕莱伊索)

AI总结 提出SpaTeoGL框架,通过联合学习窗口级空间图和时间图,在平滑图信号处理框架下交替求解,实现癫痫发作起始区的可解释定位,在多中心iEEG数据集上优于基线方法。

详情
Comments
5 pages, 4 figures
AI中文摘要

从颅内脑电图(iEEG)中准确定位癫痫发作起始区(SOZ)对癫痫手术至关重要,但受复杂时空发作动态的挑战。我们提出SpaTeoGL,一种用于可解释癫痫网络分析的时空图学习框架。SpaTeoGL联合学习捕捉iEEG电极间相互作用的窗口级空间图,以及基于空间结构相似性连接时间窗口的时间图。该方法在平滑图信号处理框架内制定,并通过具有收敛保证的交替块坐标下降算法求解。在具有成功手术结果的多中心iEEG数据集上的实验表明,SpaTeoGL与基于水平可见图与逻辑回归的基线方法相比具有竞争力,同时改善了非SOZ识别,并为癫痫发作起始和传播动态提供了可解释的见解。

英文摘要

Accurate localization of the seizure onset zone (SOZ) from intracranial EEG (iEEG) is essential for epilepsy surgery but is challenged by complex spatiotemporal seizure dynamics. We propose SpaTeoGL, a spatiotemporal graph learning framework for interpretable seizure network analysis. SpaTeoGL jointly learns window-level spatial graphs capturing interactions among iEEG electrodes and a temporal graph linking time windows based on similarity of their spatial structure. The method is formulated within a smooth graph signal processing framework and solved via an alternating block coordinate descent algorithm with convergence guarantees. Experiments on a multicenter iEEG dataset with successful surgical outcomes show that SpaTeoGL is competitive with a baseline based on horizontal visibility graphs and logistic regression, while improving non-SOZ identification and providing interpretable insights into seizure onset and propagation dynamics.

2602.10743 2026-06-11 cs.LG 版本更新

Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

Kalman线性注意力:用于高效语言建模和状态跟踪的并行贝叶斯滤波

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, Amos Storkey

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Kalman线性注意力层,将序列混合重写为信息形式的精确贝叶斯滤波,实现时间并行推理,在相同计算成本下比GLA更具表达力,并在状态跟踪任务中超越线性SSM和注意力。

详情
Comments
Accepted at ICML 2026. An earlier version of this work was presented at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) at EurIPS 2025
AI中文摘要

状态空间语言模型如Mamba和门控线性注意力(GLA)提供了线性复杂度、可并行的Transformer替代方案,但其线性状态更新限制了表达力和鲁棒的状态跟踪。我们从概率角度弥合这一差距,将序列混合视为精确贝叶斯滤波,以卡尔曼滤波为核心原语。经典卡尔曼滤波提供有原则的状态和不确定性估计,但被认为是固有顺序的;我们展示了将其重参数化为信息形式后,更新变为关联扫描——因此每个token的循环更新是非线性的(莫比乌斯/精度递归),但保持时间并行。由此产生的Kalman线性注意力(KLA)层是一个即插即用的序列混合器,执行时间并行概率推理,携带显式的信念状态不确定性,并且在相同计算成本下比GLA风格的线性更新具有严格更强的表达力。这种表达力直接转化为更强的状态跟踪:KLA解决了线性SSM和注意力无法解决的排列组合($A_5$)任务,同时保持扫描并行。作为即插即用原语,它在合成token操作和零样本常识基准测试中匹配或改进了现代SSM和GLA,并且是首批在十亿token规模下训练的堆叠贝叶斯滤波原语之一。

英文摘要

State-space language models such as Mamba and gated linear attention (GLA) offer linear-complexity, parallelisable alternatives to transformers, but their linear state updates limit expressivity and robust state tracking. We close this gap from a probabilistic angle, casting sequence mixing as exact Bayesian filtering with the Kalman filter as the core primitive. Classical Kalman filters give principled state and uncertainty estimates but are viewed as inherently sequential; we show that reparameterising them in information form turns their updates into an associative scan - so the per-token recurrent update is non-linear (a Möbius/precision recursion) yet remains temporally parallel. The resulting Kalman Linear Attention (KLA) layer is a drop-in sequence mixer that performs time-parallel probabilistic inference, carries an explicit belief-state uncertainty, and is strictly more expressive than GLA-style linear updates at the same computational cost. This expressivity translates directly into stronger state tracking: KLA solves permutation-composition ($A_5$) tasks that linear SSMs and attention cannot, while staying scan-parallel. As a drop-in primitive it also matches or improves on modern SSMs and GLAs across synthetic token-manipulation and zero-shot commonsense benchmarks, and is among the first stacked Bayesian-filtering primitives trained at the billion-token scale.

2602.10392 2026-06-11 cs.LG 版本更新

Tensor Methods: A Unified and Interpretable Approach for Material Design

张量方法:一种统一且可解释的材料设计方法

Shaan Pakala, Aldair E. Gongora, Brian Giera, Evangelos E. Papalexakis

发表机构 * University of California, Riverside(加州大学河滨分校) Dept. of Computer Science & Engineering(计算机科学与工程系) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) Materials Engineering Division(材料工程 division) Data Science Institute(数据科学研究所)

AI总结 提出使用张量补全方法作为材料设计的统一框架,兼具可解释性和预测性能,在非均匀采样下优于传统机器学习,最高提升5%的R²并减半分布外误差。

详情
Comments
Accepted to ACM SIGKDD 2026 AI for Sciences track
AI中文摘要

在设计新材料时,通常需要根据所需性能定制材料设计。随着设计参数数量的增长,搜索空间呈指数级增长,使得所有材料组合的实际合成和评估几乎不可能。即使使用有限元分析等传统计算方法,搜索设计空间也变得过于计算密集。近期方法使用机器学习(ML)代理模型来更高效地确定最优材料设计;不幸的是,这些方法通常(i)难以解释,且(ii)当训练数据来自设计空间的非均匀采样时性能不佳。我们建议使用张量补全方法作为可解释性和预测的统一方法。我们观察到经典张量方法在预测上能够与传统ML竞争,并且额外具有可解释的张量因子(作为预测的副产品完全免费获得)。在我们的实验中,我们能够通过张量因子重新发现物理现象,表明我们的预测与问题的真实底层物理一致。这也意味着,鉴于我们能够重新发现现有模式,实验人员可以利用这些张量因子识别潜在的新模式。我们还研究了当遇到来自设计空间非均匀采样的训练数据时,两种代理模型的效果。我们观察到更专门的张量方法在这些非均匀采样场景下能够提供更好的泛化能力。我们发现最佳的泛化来自一个张量模型,它在总体$R^2$上比基线ML方法提升高达5%,并在某些分布外区域将误差减半。

英文摘要

When designing new materials, it is often necessary to tailor the material design to have some desired properties. As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate $R^2$, and halve the error in some out of distribution regions.

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

发表机构 * University of Tokyo(东京大学)

AI总结 研究强化学习训练的语言模型中推理长度与准确率的非单调关系,发现存在最优中间长度,并通过模式准确率分析揭示其成因。

详情
Comments
18 pages, 12 figures
AI中文摘要

强化学习显著提高了大型语言模型的推理能力,但也倾向于延长思维链输出并增加计算成本。尽管已经提出了长度控制方法,但它们所引发的长度-准确率关系仍不清楚。我们在受控设置下,在多个基础模型上使用几种长度控制方法训练策略,发现在数学推理和代码生成中,准确率随输出长度呈非单调变化,在中间值达到峰值。然而,即使在样本准确率趋于平稳或下降的情况下,模式准确率仍随长度持续提高,这表明非单调的长度-准确率关系是由围绕越来越正确的中心的分散性驱动的。

英文摘要

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

2602.09533 2026-06-11 cs.AI 版本更新

Autoregressive Direct Preference Optimization

自回归直接偏好优化

Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

发表机构 * University of Tokyo(东京大学)

AI总结 提出自回归直接偏好优化(ADPO),在应用Bradley-Terry模型前显式引入自回归假设,通过将DPO目标中的求和操作移至log-sigmoid函数外部,实现更优的偏好对齐,并首次区分token长度μ和反馈长度μ'两种度量。

详情
Comments
ICML 2026
AI中文摘要

直接偏好优化(DPO)已成为将大型语言模型(LLMs)与人类偏好对齐的一种有前景的方法。然而,对响应级Bradley-Terry(BT)模型的广泛依赖可能限制了其全部潜力,因为参考模型和可学习模型仅在推导目标函数后才被假定为自回归。受此限制的启发,我们重新审视DPO的理论基础,并提出一种新的公式,在应用BT模型之前显式引入自回归假设。通过重新表述和扩展DPO,我们推导出一种新的变体,称为自回归DPO(ADPO),它将自回归建模显式整合到偏好优化框架中。在不违反理论基础的情况下,推导出的损失采用了一种优雅的形式:它将DPO目标中的求和操作移至log-sigmoid函数外部。此外,通过对ADPO的理论分析,我们表明在设计基于DPO的算法时需要考虑两种长度度量:token长度μ和反馈长度μ'。据我们所知,我们是第一个明确区分这两种度量并分析它们对LLMs中偏好优化影响的工作。

英文摘要

Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $\mu$ and the feedback length $\mu'$. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

2602.08735 2026-06-11 cs.CV 版本更新

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

从对应到动作:多模态大语言模型中类人多图像空间推理

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学)

AI总结 提出HATCH框架,通过补丁级空间对齐和动作-答案推理两个目标,提升多模态大模型在多图像空间推理中的性能,在三个基准上超越同规模基线。

详情
Comments
ICML 2026
AI中文摘要

尽管多模态大语言模型(MLLMs)在单图像空间推理方面取得了实质性进展,但多图像空间推理(需要整合来自多个视角的信息)仍然具有挑战性。认知研究表明,人类通过两种机制处理此类任务:跨视图对应(识别不同视图中对应于相同物理位置的区域)和逐步视角变换(顺序组合相对视角变化)。然而,现有研究仅部分且通常隐式地整合这些机制,没有对两者进行显式监督。我们提出了用于跨视图对应和视角变化的类人感知训练(HATCH),这是一个具有两个互补目标的训练框架:(1)补丁级空间对齐,鼓励补丁表示在空间对应区域跨视图对齐;(2)动作-答案推理,要求模型在预测最终答案之前生成显式的视角转换动作。在三个基准上的实验表明,HATCH以明显优势持续优于同规模基线,并与更大的模型相比取得了有竞争力的结果,同时保持了单图像推理能力。

英文摘要

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

2602.07840 2026-06-11 cs.IR cs.AI 版本更新

SAGE: Scalable AI Governance & Evaluation

SAGE: 可扩展的人工智能治理与评估

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation(LinkedIn公司)

AI总结 本文提出SAGE框架,通过双向校准循环将高质量的人类产品判断转化为可扩展的评估信号,解决了大规模搜索系统中相关性评估的治理差距问题,并实现了92倍成本降低的模型迭代和政策监督。

详情
AI中文摘要

在大规模搜索系统中评估相关性本质上受到人类监督与生产系统高吞吐要求之间的治理差距的限制。传统方法依赖于参与代理或稀疏手动审查,但这些方法往往无法捕捉高影响的相关性失败的全部范围。我们提出了SAGE(可扩展的人工智能治理与评估)框架,该框架将高质量的人类产品判断作为可扩展的评估信号。SAGE的核心是一个双向校准循环,其中自然语言政策、精心编写的先例和一个LLM替代法官共同进化。SAGE系统性地解决语义模糊和不一致,将主观的相关性判断转化为可执行的多维标准,具有接近人类水平的一致性。为了弥合前沿模型推理与工业级推理之间的差距,我们应用教师-学生蒸馏技术,将高保真判断转移到紧凑的学生替代体,成本降低92倍。SAGE部署在LinkedIn搜索生态系统中,通过模拟驱动开发指导模型迭代,蒸馏出符合政策的模型用于在线服务,并实现快速的离线评估。在生产环境中,它推动了政策监督,测量了升级的模型变体并检测到无法被参与指标检测到的回归。集体上,这些措施推动了LinkedIn每日活跃用户的0.25%提升。

英文摘要

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

2602.08986 2026-06-11 cs.LG cs.AI 版本更新

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

改进分层多标签学习中稀有节点的检测

Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg

发表机构 * Faculty of Computer Science(计算机科学学院) Dalhousie University(达尔豪斯大学) Department of Geography(地理系) Memorial University of Newfoundland(纽芬兰纪念大学) Department of Oceanography(海洋学系)

AI总结 针对分层多标签分类中稀有节点检测困难的问题,提出结合节点不平衡加权和焦点加权的损失函数,利用集成不确定性量化,在基准数据集上将召回率提升至五倍,并显著提高F1分数。

详情
Comments
Accepted for publication in Transactions on Machine Learning Research (TMLR), 2026
AI中文摘要

在分层多标签分类中,一个持续的挑战是使模型预测能够达到层次结构的更深层次,以实现更详细或更细粒度的分类。这一困难部分源于某些类别(或层次节点)的自然稀有性,以及确保子节点几乎总是比其父节点频率更低的分层约束。为了解决这个问题,我们为神经网络提出了一种加权损失目标,该目标结合了节点不平衡加权和焦点加权组件,后者利用了集成不确定性的现代量化。通过强调稀有节点而非稀有观测(数据点),并在训练过程中关注每个模型输出分布中的不确定节点,我们观察到在基准数据集上召回率提高了高达五倍,并且$F_{1}$分数有统计显著的提升。我们还展示了我们的方法有助于卷积网络处理具有挑战性的任务,例如在编码器次优或数据有限的情况下。

英文摘要

In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

2509.23248 2026-06-11 cs.AI cs.NI 版本更新

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

面向移动边缘通用智能的资源感知LLM推理

Mingyi Luo, Ruichen Zhang, Xiangwang Hou, Jun Du, Chunxiao Jiang, Yong Ren, Shiwen Mao

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen(清华大学深圳国际研究生院,清华大学,深圳) College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学 computing 和数据科学学院,新加坡) Department of Electronic Engineering, Tsinghua University, Beijing(清华大学电子工程系,北京) State Key Laboratory of Space Network and Communications, Tsinghua University, Beijing(空间网络与通信国家重点实验室,清华大学,北京) Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing(北京信息科学与技术国家研究中心,清华大学,北京) Department of Electrical and Computer Engineering, Auburn University, Auburn, USA(阿伯丁大学电气与计算机工程系,阿伯丁,美国)

AI总结 提出联合优化框架,通过自适应CoT提示和分布式MoE架构协同优化推理深度、专家激活和传输功率,在资源受限的移动边缘环境中实现LLM高效推理,推理质量与资源效率平衡,额外推理时间小于1秒时准确率和延迟满足率均达90%。

详情
AI中文摘要

大型语言模型(LLM)的快速发展催生了具有强大推理和自主决策能力的智能体人工智能(AI)。与边缘计算的集成推动了移动边缘通用智能(MEGI)的发展,将实时、隐私保护的推理带到网络边缘。然而,在MEGI环境中部署基于LLM的智能体AI推理面临重大挑战,原因是推理的高计算需求与边缘设备的有限资源。为应对这些挑战,我们提出了一种在MEGI中高效部署LLM推理的联合优化框架。首先,我们系统回顾增强方法,识别适合边缘适配的机制。随后,我们提出一个分布式框架,通过自适应思维链(CoT)提示协同推理增强,并通过分布式专家混合(MoE)架构实现可扩展部署。该方法的一个重要创新是将推理深度建模为动态网络资源变量,并与专家激活和传输功率联合优化。该机制使系统能够根据任务需求和设备能力动态调节专家网络和推理复杂度。在移动边缘环境中的实验评估表明,所提框架有效平衡了推理质量和资源效率。结果显示,在额外推理时间小于1秒的情况下,准确率和延迟满足率均可达到90%,验证了在资源受限的MEGI系统中部署复杂LLM推理的实际可行性。

英文摘要

The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别:上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和,并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情
AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白:哪些建模选择实质性地影响性能,以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别,我们使用10个随机种子进行受控消融实验,并进行多重比较校正的配对显著性检验,得到三个发现。首先,对话上下文是主导因素,但性能快速饱和:大约90%的性能提升来自最近的前10-30轮对话,具体取决于标签集。其次,层级句子表示仅在仅话语设置中帮助最大,并在MELD上显示出明显优势,但一旦轮次级别的上下文可用,其益处消失,表明对话历史吸收了大量话语内部结构。第三,整合外部情感词典不会改善结果,这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下,我们的简单模型实现了强性能(4-way 82.69%;6-way加权F1 67.07%),表明无需未来轮次即可达到竞争性准确率。对于语言分析,我们检查了5,286个话语标记出现,发现情绪与标记位置之间存在可靠关联(p <.0001)。悲伤话语的左边缘标记使用率(21.9%)低于其他情绪(28-32%),这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致,其中悲伤从对话上下文中获益最多(+22个百分点),表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p <.0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

2602.06868 2026-06-11 cs.RO 版本更新

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

基于共识的优化(CBO):迈向机器人学的全局最优性

Xudong Sun, Armand Jordana, Massimo Fornasier, Jalal Etesami, Majid Khadiv

发表机构 * Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML),德国慕尼黑)

AI总结 提出将共识优化(CBO)引入机器人学,在温和假设下保证收敛到全局最优,并在三个挑战性轨迹优化场景中优于现有方法。

详情
AI中文摘要

零阶优化最近在机器人系统的最优轨迹和策略设计中受到显著关注。然而,大多数现有方法(如MPPI、CEM和CMA-ES)本质上是局部的,因为它们依赖于梯度估计。在本文中,我们将基于共识的优化(CBO)引入机器人学,该方法在温和假设下保证收敛到全局最优。我们提供了理论分析和说明性示例,以直观理解CBO与现有方法之间的根本差异。为了展示CBO在机器人问题上的可扩展性,我们考虑了三个具有挑战性的轨迹优化场景:(1)一个简单系统的长时域问题,(2)一个高度欠驱动系统的动态平衡问题,以及(3)一个仅具有终端成本的高维问题。我们的结果表明,在所有三个具有挑战性的设置中,CBO相对于现有方法能够实现更低的成本。这为研究机器人学中的全局轨迹优化开辟了一个新框架。

英文摘要

Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

Global Geometry Is Not Enough for Vision Representations

全局几何不足以用于视觉表示

Jiwan Chung, Seon Joo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过实验发现全局嵌入几何与组合绑定能力几乎无关,而输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪该能力,并分析指出这是由于现有损失函数显式约束嵌入几何但未约束局部输入-输出映射所致。

详情
AI中文摘要

表示学习中的一个常见假设是,全局分布良好的嵌入支持鲁棒且可泛化的表示。这一关注点塑造了训练目标和评估协议,隐含地将全局几何视为表示能力的代理。虽然全局几何有效地编码了哪些元素存在,但它通常对元素如何组合不敏感。我们通过测试几何度量预测跨多种视觉编码器的组合绑定的能力来研究这一局限性。我们发现,基于标准几何的统计量与组合绑定几乎无相关性。相比之下,由输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪这一能力。我们进一步提供了分析性解释,表明这种差异源于目标设计,因为现有损失显式约束嵌入几何,但未约束局部输入-输出映射。这些结果表明,全局嵌入几何仅捕捉了表示能力的部分视图,并将功能敏感性确立为建模复合结构的关键补充轴。

英文摘要

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input--output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input--output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

2602.02726 2026-06-11 cs.LG cs.CL 版本更新

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

向量量化潜在概念:聚类式概念发现的可扩展替代方案

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

发表机构 * Dalhousie University, Canada(加拿大达尔豪斯大学) University of Calgary, Canada(加拿大卡尔加里大学)

AI总结 提出VQLC框架,通过向量量化学习离散潜在概念,在保持可解释性的同时,实现与K-Means相当的计算效率,并优于层次聚类在大规模数据上的扩展性。

详情
AI中文摘要

大型语言模型(LLMs)在其隐藏状态中编码了丰富的语义信息,但理解这些内部表示捕获了哪些信息仍然困难。从隐藏状态中提取的潜在概念为解释LLMs提供了有希望的方向,但现有的基于聚类的方法面临权衡:层次聚类产生连贯的概念,但由于其二次内存成本而仅限于小数据集,而K-Means高效扩展但可能产生语义连贯性较差的概念。我们提出向量量化潜在概念(VQLC),一种离散概念学习框架,在冻结的隐藏状态上学习潜在概念的码本。在12个数据集-模型设置中,VQLC在计算成本上接近K-Means,扩展性优于层次聚类,并在忠实度上保持竞争力,在仅解码器模型上增益最明显。基于LLMs的评估、定性分析和稀疏自编码器(SAE)比较表明,学习到的概念是可解释且任务相关的。

英文摘要

Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.

2512.16415 2026-06-11 cs.CV 版本更新

CountZES: Counting via Zero-Shot Exemplar Selection

CountZES: 通过零样本示例选择进行计数

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫莫德·本·扎耶德人工智能大学)

AI总结 针对零样本计数中示例质量差导致计数不准的问题,提出CountZES方法,通过检测锚定、密度引导和特征共识三阶段协同选择多样化示例,提升计数准确性。

详情
AI中文摘要

在零样本(ZS)设置下,复杂场景中的目标计数尤其具有挑战性,其中仅使用类别名称对未见类别的实例进行计数。现有的ZS计数方法通常依赖现成的开放词汇检测器(OVD)从文本推断示例,但在密集场景中,这些方法会受到语义噪声、外观变异和多实例提议的影响。或者,采用随机图像块采样,但无法准确描绘目标实例。由于计数对示例质量敏感,此类选择策略通常产生代表性差的示例,导致计数估计不准确。为解决这些问题,我们提出CountZES,一种通过零样本示例选择进行目标计数的纯推理方法。CountZES通过三个协同阶段发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE细化OVD检测以分离出精确的单实例示例。DGE引入密度驱动的自监督范式,识别统计一致且语义紧凑的示例,而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生互补的示例集,平衡了文本基础、计数一致性和特征代表性。在多个数据集上的实验表明,CountZES在零样本计数方法中表现出优越性能,同时有效跨领域泛化。

英文摘要

Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出MentisOculi基准,通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力,发现视觉策略普遍无法提升性能,且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

详情
Comments
9 pages, 8 figures, Accepted at ICML 2026
AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型(MLLMs)过渡到能够原生交错生成的统一多模态模型(UMMs)。这一转变激发了将中间可视化作为推理辅助的兴趣,类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力,我们开发了MentisOculi,这是一个程序化的、分层的多步推理问题套件,适用于视觉解决方案,旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略,我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制:虽然它们拥有解决任务的文本推理能力,并且有时能生成正确的视觉内容,但它们遭受复合生成错误,并且无法利用甚至真实的可视化。我们的发现表明,尽管视觉思维具有内在吸引力,但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

2602.02285 2026-06-11 cs.LG cs.CL math.ST 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文首次在 Lean 4 中完整形式化统计学习理论,基于实证过程理论,通过人机协作工作流构建了可验证的定理证明工具箱,并揭示了教材中的隐含假设。

详情
Comments
Accepted by ICML 2026
AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论(SLT)在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容,包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理,以及具有尖锐速率的(稀疏)最小二乘回归应用。该项目采用人机协作工作流,其中人类设计证明策略,AI 代理执行战术性证明构建,从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外,形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节,强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础,并为机器学习理论的未来发展打开了大门。代码可在以下网址获取:https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in this https URL.

2602.02229 2026-06-11 cs.LG eess.SP 版本更新

Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts

预测驱动的已部署模型风险监控:检测有害分布漂移

Guangyi Zhang, Yunlong Cai, Guanding Yu, Osvaldo Simeone

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校)

AI总结 提出预测驱动风险监控(PPRM),一种基于预测驱动推断的半监督方法,通过结合合成标签与少量真实标签构建运行风险的随时有效下界,实现对有害漂移的检测,并在图像分类、大语言模型和电信监控任务中验证有效性。

详情
Comments
Accepted by ICML2026
AI中文摘要

我们研究了在动态环境中模型性能监控的问题,其中标记数据有限。为此,我们提出了预测驱动风险监控(PPRM),一种基于预测驱动推断(PPI)的半监督风险监控方法。PPRM通过结合合成标签与少量真实标签,构建运行风险的随时有效下界。通过基于阈值的比较与名义风险的上界,检测有害漂移,满足无假设的有限样本I型误差保证。我们通过在图像分类、大语言模型(LLM)和电信监控任务上的大量实验,证明了PPRM的有效性。

英文摘要

We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees on the type-I error. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.

2601.12164 2026-06-11 cs.CY cs.CL 版本更新

The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents

提问的语言:语言条件对LLM分析争议性政治文件时的意识形态分歧的影响

Oleg Smirnov

发表机构 * Microsoft(微软)

AI总结 研究通过俄语和乌克兰语语义等价提示,发现ChatGPT和Claude Opus在分析同一乌克兰公民社会文件时,输出出现系统性意识形态分歧,且分歧程度因模型而异。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为跨多语言语境的分析工具,但其输出可能带有由提示语言条件引起的系统性偏差。本研究对LLM生成的乌克兰公民社会文件政治分析进行了实验比较,使用俄语和乌克兰语的语义等价提示,分别对来自不同开发者的两个前沿模型——ChatGPT 5.2和Claude Opus 4.5进行测试。尽管源材料相同且查询结构平行,两个模型沿同一轴线出现分歧:俄语输出倾向于去合法化框架,将公民社会行为者描述为限制民主授权的外部资助精英,而乌克兰语输出则将同一行为者视为民主竞争中的合法利益相关者。然而,这种分歧的程度因模型而异。ChatGPT的俄语输出再现了俄罗斯国家话语的特征词汇;Claude Opus的输出则保持在主流批评语境内,并在两种语言中对其判断进行限定。这些发现表明,仅提示语言就能系统性地改变分析相同内容的同一模型的意识形态取向。这种转变是多语言LLM的一个普遍属性,其严重程度及其与宣传叙事的对齐程度因系统而异。这些影响涉及AI在极化信息环境中的部署、跨语言研究以及多语言社会中的AI治理。

英文摘要

Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian administered to two frontier models from different developers, ChatGPT 5.2 and Claude Opus 4.5. Despite identical source material and parallel query structures, both models diverged along the same axis: Russian-language outputs leaned toward delegitimizing framings, characterizing civil society actors as externally funded elites constraining a democratic mandate, while Ukrainian-language outputs treated the same actors as legitimate stakeholders in democratic contestation. The magnitude of this divergence, however, was model-dependent. ChatGPT's Russian output reproduced vocabulary characteristic of Russian state discourse; Claude Opus's stayed in a mainstream critical idiom and hedged its judgments in both languages. These findings demonstrate that prompt language alone can systematically shift the ideological orientation of an unchanged model analyzing identical content. The shift is a general property of multilingual LLMs whose severity, and whose alignment with propaganda narratives, varies across systems. The implications reach AI deployment in polarized information environments, cross-lingual research, and AI governance in multilingual societies.

2602.00945 2026-06-11 cs.CL cs.AI 版本更新

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Neural FOXP2——面向大型语言模型目标语言改进的语言特定神经元引导

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Meta, USA(Meta, 美国) Apple, USA(Apple, 美国) Pragya Lab, BITS Pilani Goa, India(Pragya实验室,BITS Pilani Goa,印度)

AI总结 提出Neural FOXP2方法,通过定位语言神经元、计算引导方向和施加稀疏激活偏移,将模型默认语言从英语切换为印地语或西班牙语,实现可控的语言主导性。

详情
AI中文摘要

LLMs通过训练成为多语言模型,但其通用语言通常是英语,反映了英语在预训练中的主导地位。其他语言保留在参数记忆中,但被系统性抑制。我们认为语言默认性由稀疏、低秩的控制电路(语言神经元)支配,可以机械地隔离并安全引导。我们引入Neural FOXP2,通过引导语言特定神经元,使模型以选定语言(印地语或西班牙语)为主。Neural FOXP2分三个阶段进行:(i) 定位:我们训练每层的SAE,使每个激活分解为一小组活跃特征组件。对于每个特征,我们量化英语与印地语/西班牙语的选择性,基于整体logit质量向目标语言令牌集的提升。将排名靠前的特征追溯回其最强贡献单元,得到紧凑的语言神经元集。(ii) 引导方向:我们通过谱低秩分析定位可控的语言转换几何。对于每层,我们构建英语到目标激活差异矩阵,并执行逐层SVD以提取主导语言变化的奇异方向。特征间隙和有效秩谱识别出紧凑的引导子空间和经验选择的干预窗口(这些方向最强且最稳定)。(iii) 引导:我们对语言神经元应用有符号的稀疏激活偏移。具体地,在低到中层,我们沿目标语言主导方向添加正向引导,并对英语神经元在零空间施加补偿性负偏移,实现可控的目标语言默认性。

英文摘要

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

2602.00560 2026-06-11 cs.SD eess.AS 版本更新

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

编辑内容,保留声学:基于自一致性奖励的不可感知文本语音编辑

Yong Ren, Jiangyan Yi, Jianhua Tao, Tao Wang, Le Xu, Zhengqi Wen

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Department of Automation, Tsinghua University(清华大学自动化系) BNRist, Tsinghua University(清华大学BNRist)

AI总结 提出一种在稳定语义空间中编辑内容、通过流匹配解码器保持声学连续性的框架,并利用自一致性奖励组相对策略优化实现不可感知的文本语音编辑。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

不可感知的基于文本的语音编辑通过转录操作修改口语内容,同时保持声学连续性。先前的声学空间方法存在内容-风格纠缠,导致生成不稳定和边界伪影。我们引入了一个以“编辑内容,保留声学”原则为指导的框架。编辑在稳定的语义空间中进行,而声学实现由流匹配解码器处理。为了确保感知一致性,我们提出了自一致性奖励组相对策略优化,该优化利用预训练的文本到语音模型作为隐式评判器,并结合可理解性和持续时间约束。实验表明,在可理解性、鲁棒性和感知质量方面,该方法持续优于最先进的自回归和非自回归基线。

英文摘要

Imperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.

2602.00424 2026-06-11 cs.LG cond-mat.mtrl-sci 版本更新

Open Materials Generation with Inference-Time Reinforcement Learning

基于推理时间强化学习的开放材料生成

Philipp Hoellmer, Stefano Martiniani

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出OMatG-IRL框架,通过策略梯度强化学习直接作用于学习的速度场,无需显式计算得分,实现晶体结构预测中的能量目标强化,采样效率提升一个数量级。

详情
Comments
25 pages, 12 figures, 6 tables
AI中文摘要

晶体材料的连续时间生成模型通过学习预测稳定晶体结构实现逆向材料设计,但将显式目标属性纳入生成过程仍然具有挑战性。策略梯度强化学习(RL)为生成模型与下游目标对齐提供了原则性机制,但通常需要访问得分,这阻碍了其应用于仅学习速度场的基于流的模型。我们提出了一种推理时间强化学习的开放材料生成(OMatG-IRL)框架,这是一种直接作用于学习的速度场的策略梯度RL框架,无需显式计算得分。OMatG-IRL利用底层生成动力学的随机扰动,保持预训练生成模型的基线性能,同时在推理时实现探索和策略梯度估计。通过OMatG-IRL,我们首次将RL应用于晶体结构预测(CSP)。我们的方法能够有效强化基于能量的目标,同时通过成分条件保持多样性,并且取得了与基于得分的RL方法竞争的性能。最后,我们展示了OMatG-IRL可以学习时间相关的速度退火调度,实现精确的CSP,采样效率提高一个数量级,相应地生成时间减少。OMatG-IRL代码包含在开放材料生成(OMatG)框架的新版本中,可从该https URL获取。

英文摘要

Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time. The OMatG-IRL code is included in a new release of the Open Materials Generation (OMatG) framework available at this https URL.

2601.23278 2026-06-11 cs.LG cs.AR cs.CL 版本更新

FOCUS: DLLMs Know How to Tame Their Compute Bound

FOCUS: DLLMs 知道如何驯服它们的计算瓶颈

Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学)

AI总结 针对扩散大语言模型解码中大部分计算浪费在不可解码令牌上的问题,提出 FOCUS 推理系统,通过动态聚焦可解码令牌并驱逐不可解码令牌,提升有效批大小,实现高达 3.52 倍的吞吐量提升。

详情
Comments
ICML 2026 camera-ready version
AI中文摘要

扩散大语言模型(DLLMs)为自回归模型提供了一种引人注目的替代方案,但其部署受到高解码成本的制约。在这项工作中,我们识别出 DLLM 解码中的一个关键低效问题:虽然计算在令牌块上并行化,但每个扩散步骤中只有一小部分令牌是可解码的,导致大部分计算浪费在不可解码的令牌上。我们进一步观察到注意力导出的令牌重要性与逐令牌解码概率之间存在强相关性。基于这一洞察,我们提出了 FOCUS,一个专为 DLLMs 设计的推理系统。通过动态地将计算聚焦于可解码令牌并实时驱逐不可解码令牌,FOCUS 增加了有效批大小,缓解了计算限制并实现了可扩展的吞吐量。实验评估表明,在大批量设置下,FOCUS 相比生产级引擎 LMDeploy 实现了高达 3.52 倍的吞吐量提升,同时在多个基准测试中保持或提升了生成质量。

英文摘要

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS, an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy in large-batch settings, while preserving or improving generation quality across multiple benchmarks.