arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.24515 2026-05-26 cs.LG

Lake Detection and Water Quality Estimation in Sentinel-2 Data

Sentinel-2 数据中的湖泊检测与水质估计

Iulia Pleşu, Alexandra Băicoianu, Ioana Cristina Plajer

AI总结 本文比较了三种机器学习架构用于水体识别与监测,并提出了针对水质指数的有意义配色方案,以提高可解释性和决策支持。

详情
AI中文摘要

随着气候变化和人类对自然景观的压力增加,内陆水资源变得越来越稀缺、脆弱且难以可持续管理。因此,可靠且自动化的地表水体检测、监测和评估方法具有日益增长的科学和实践重要性。在本文中,我们研究并比较了三种不同的机器学习架构用于水体识别与监测。通过定量指标和实际案例评估其性能。此外,在代表性测试图像上与经典的 NDWI 阈值法进行直接比较,以突出数据驱动方法与基于指数方法之间的差异。这一分析使我们能够识别出在准确性、鲁棒性和实际适用性方面表现最佳的模型。除了检测之外,有意义的水质评估的一个主要挑战在于光谱水指数的一致且可解释的可视化。标准颜色映射技术通常不足或可能对环境应用产生误导。为弥补这一差距,我们提出了一套适用于水质指数的有意义配色方案,有助于人类用户更清晰地解释、比较和决策。

英文摘要

With climate change and increasing human pressure on natural landscapes, inland water resources are becoming progressively scarcer, more vulnerable, and more difficult to manage sustainably. Reliable and automated methods for detecting, monitoring, and assessing surface water bodies are therefore of growing scientific and practical importance. In this paper, we investigate and compare three distinct machine learning architectures for water body identification and monitoring. Their performance is evaluated through quantitative metrics and real-world examples. Furthermore, a direct comparison with classical NDWI thresholding is conducted on a representative test image to highlight differences between data-driven and index-based approaches. This analysis allows us to identify the best-performing model in terms of accuracy, robustness, and practical applicability. Beyond detection, a major challenge for meaningful water quality assessment lies in the consistent and interpretable visualization of spectral water indices. Standard color mapping techniques are often inadequate or potentially misleading for environmental applications. To address this gap, we propose a suite of meaningful color schemes adapted for water quality indices, facilitating clearer interpretation, comparison, and decision-making for human users.

2605.24513 2026-05-26 cs.LG

Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

具有重尾噪声的零阶非凸非光滑优化

Zhuanghua Liu, Luo Luo

AI总结 针对目标函数Lipschitz连续的非凸非光滑问题,提出一种通过裁剪两点梯度估计器的在线到非凸转换框架的随机零阶算法,在重尾噪声下实现$(δ, ε)$-Goldstein驻点,其零阶复杂度为${\\mathcal O}(d^{\\frac{p}{2(p-1)}}δ^{-1}ε^{-\\frac{2p-1}{p-1}})$,与已知最优结果一致。

详情
AI中文摘要

本文考虑目标函数Lipschitz连续的非凸非光滑问题。我们关注随机设置,其中算法可以访问带有重尾噪声的随机函数值评估,这在许多流行的机器学习应用中普遍存在。我们提出了一种随机零阶算法,通过裁剪两点梯度估计器来改进在线到非凸转换的框架。理论分析表明,我们的算法可以找到$(δ, ε)$-Goldstein驻点,其零阶复杂度为${\\mathcal O}(d^{\\frac{p}{2(p-1)}}δ^{-1}ε^{-\\frac{2p-1}{p-1}})$,其中$d$是问题维度,$p\\\in(1,2]$是有界矩的阶数。注意,我们对维度$d$的依赖性与随机零阶优化在寻找随机凸非光滑问题的次优解方面的已知最佳结果相匹配。此外,我们对精度参数$δ$和$ε$的依赖性与随机非凸非光滑问题的已知最佳随机一阶算法一致。最后,我们进行了数值实验,以证明所提出方法的有效性。

英文摘要

This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed noise, which is prevalent in many popular machine learning applications. We propose a stochastic zeroth-order algorithm that refines the framework of online-to-nonconvex conversion by clipping the two-point gradient estimator. The theoretical analysis shows that our algorithm can find a $(δ, ε)$-Goldstein stationary point with zeroth-order oracle complexity of ${\mathcal O}(d^{\frac{p}{2(p-1)}}δ^{-1}ε^{-\frac{2p-1}{p-1}})$, where $d$ is the problem dimension and $p\in(1,2]$ is the order of bounded moments. Note that our dependence on dimension $d$ matches the best-known results of stochastic zeroth-order optimization for finding the sub-optimal solution of a stochastic convex nonsmooth problem. In addition, our dependence on accuracy parameters $δ$ and $ε$ is consistent with that of the best-known stochastic first-order algorithms for stochastic nonconvex nonsmooth problems. Finally, we conduct numerical experiments to demonstrate the effectiveness of the proposed method.

2605.24509 2026-05-26 cs.CV cs.AI cs.GR cs.LG

Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation

Φ-Noise:基于相位噪声操作的无训练时间视频条件生成

Ofir Abramovich, Nadav Z. Cohen, Adi Rosenthal, Ariel Shamir

AI总结 提出一种无需训练的方法,通过将参考视频的低频相位信息注入扩散噪声潜变量,实现运动条件视频生成,无需修改模型架构或推理流程。

Comments Under Review; 26 pages, 21 figures

详情
AI中文摘要

潜在视频扩散模型通过逐步将高斯噪声转换为基于文本或视觉输入的真实样本来生成视频。然而,现有的条件方法通常需要额外的训练和计算开销。受最近关于频率分量在生成模型中重要性的发现启发,我们提出了一种简单、无需训练的运动条件视频生成方法,通过将参考视频的低频相位信息直接注入扩散噪声潜变量。我们的方法在不修改模型架构或推理流程的情况下传递运动线索。通过多个应用,我们展示了在生成视频中对外观和动态的有效控制,同时与更复杂的条件方法相比取得了具有竞争力或更优的结果。

英文摘要

Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.

2605.24508 2026-05-26 cs.CV

FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios

FDDet: 实现真实场景下的数据高效食品缺陷检测

Ruihao Xu, Yong Liu, Yansong Tang

AI总结 针对食品缺陷检测中数据稀缺和缺乏统一基准的问题,提出了包含48种缺陷类别的FDD-48数据集,并设计了半监督框架FDDet,通过BBoxMixUp数据增强和CGPC伪标签校准方法,在数据有限场景下显著优于主流检测器。

详情
AI中文摘要

食品缺陷检测对于自动化质量控制至关重要,然而现有研究缺乏统一基准且面临数据稀缺问题。我们引入了FDD-48,一个在多种真实世界条件下涵盖13种食品类型和48种缺陷类别的细粒度标注综合数据集。为了在有限标注数据下提高检测性能,我们提出了FDDet,一个半监督框架,包含两个关键组件:(1) BBoxMixUp,一种数据增强技术,通过混合同类别缺陷区域来减少虚假特征关联;(2) CGPC(一致性引导的伪标签校准),基于样本内一致性过滤伪标签。实验表明,FDDet在FDD-48上显著优于主流检测器,证明了其在数据有限场景下进行食品缺陷检测的有效性。

英文摘要

Food defect detection is critical for automated quality control, yet existing studies lack unified benchmarks and suffer from data scarcity. We introduce FDD-48, a comprehensive dataset with fine-grained annotations across 13 food types and 48 defect categories under diverse real-world conditions. To improve detection with limited labeled data, we propose FDDet, a semi-supervised framework featuring two key components: (1) BBoxMixUp, a data augmentation technique that mixes same-category defect regions to reduce spurious feature associations, and (2) CGPC (Consistency-Guided Pseudo-Label Calibration), which filters pseudo-labels based on intra-sample consistency. Experiments show FDDet significantly outperforms mainstream detectors on FDD-48, demonstrating its effectiveness for food defect detection under data-limited scenarios.

2605.24503 2026-05-26 cs.CV cs.AI

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

FoodMonitor:用于可解释合规性分析的多模态大语言模型基准测试

Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu, Haoji Zhang, Yansong Tang

AI总结 针对现有视频异常检测缺乏规则驱动可解释性的问题,提出FoodMonitor基准,包含双通道违规标注和两阶段匹配评估协议,揭示当前多模态大语言模型在空间定位和细粒度规则理解上的瓶颈。

详情
AI中文摘要

随着基于AI的合规性监测在公共治理和工业安全中日益重要,提供可验证证据和可追溯问责信号的能力至关重要。然而,现有的视频异常检测数据集侧重于事件级二元分类,缺乏真实世界合规场景所需的规则驱动、可解释分析。我们引入了FoodMonitor,一个用于商业厨房监控中可解释合规性分析的基准。FoodMonitor包含477个视频片段,具有3307个违规标注,采用双通道设计覆盖人员级和环境级违规。每个标注指定了违反哪条规则、发生了何种不合规行为以及由谁实施,并附有帧级边界框。我们建立了一个统一的评估协议,包含两阶段匹配机制,分别评估空间定位和语义理解,以及一个复合指标($C_{ ext{score}}$),平衡环境和人员检测性能。对几种最先进的多模态大语言模型的系统评估显示,表现最佳的模型仅达到0.360 $C_{ ext{score}}$,空间定位和细粒度规则理解成为主要瓶颈。我们的分析识别出两种不同的失败模式:定位主导的错误和语义主导的错误,为未来模型开发提供了诊断性见解。

英文摘要

As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ($C_{\text{score}}$) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 $C_{\text{score}}$, with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.

2605.24497 2026-05-26 cs.AI

Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

推理作为攻击面:针对大语言模型的自适应进化思维链越狱方法

Jianan Li, Simeng Qin, Xiaojun Jia, Lionel Z. Wang, Tianhang Zheng, Xiaoshuang Jia, Yang Liu, Xiaochun Cao

AI总结 提出自适应进化思维链越狱框架AE-CoT,通过教师角色扮演重写有害目标、分解推理片段、多代进化搜索及自适应变异率控制,有效生成高破坏性越狱提示,在多个模型和数据集上超越现有方法。

详情
AI中文摘要

大型推理模型(LRM)在推理和生成任务中展现出卓越能力,并越来越多地部署于实际应用。然而,其显式的思维链(CoT)机制引入了新的安全风险,使其特别容易受到越狱攻击。现有方法通常依赖静态CoT模板来引发有害输出,但这种固定设计存在多样性、适应性和有效性不足的问题。为克服这些局限,我们提出一种自适应进化CoT越狱框架,称为AE-CoT。具体而言,该方法首先通过教师角色扮演将有害目标重写为温和提示,并将其分解为语义连贯的推理片段,构建CoT越狱候选池。然后,在结构化表示空间内,进行多代进化搜索,通过片段级交叉和具有自适应变异率控制机制的变异策略扩展候选多样性。一个独立的评分模型提供分级有害性评估,高分候选者进一步通过有害CoT模板增强,以诱导更具破坏性的生成。跨多个模型和数据集的广泛实验证明了所提出的AE-CoT的有效性,其持续优于最先进的越狱方法。

英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought (CoT) mechanism introduces new security risks, making them particularly vulnerable to jailbreak attacks. Existing approaches often rely on static CoT templates to elicit harmful outputs, but such fixed designs suffer from limited diversity, adaptability, and effectiveness. To overcome these limitations, we propose an adaptive evolutionary CoT jailbreak framework, called AE-CoT. Specifically, the method first rewrites harmful goals into mild prompts with teacher role-play and decomposes them into semantically coherent reasoning fragments to construct a pool of CoT jailbreak candidates. Then, within a structured representation space, we perform multi-generation evolutionary search, where candidate diversity is expanded through fragment-level crossover and a mutation strategy with an adaptive mutation-rate control mechanism. An independent scoring model provides graded harmfulness evaluations, and high-scoring candidates are further enhanced with a harmful CoT template to induce more destructive generations. Extensive experiments across multiple models and datasets demonstrate the effectiveness of the proposed AE-CoT, consistently outperforming state-of-the-art jailbreak methods.

2605.24495 2026-05-26 cs.RO

Elevator-LIO: Robust LiDAR-Inertial Odometry for Multi-Floor Navigation under Elevator-Induced Non-Inertial Motion

Elevator-LIO:电梯引起的非惯性运动下多层导航的鲁棒激光雷达-惯性里程计

Yifan Zhang, Yudong Huang, Yuchong Zhang, Changze Li, Haoran Liu, Ming Yang, Tong Qin

AI总结 提出Elevator-LIO框架,通过解耦状态估计模型和模式依赖的迭代误差状态卡尔曼滤波器,实现电梯内连续定位,并利用自适应体素降采样和事件触发更新抑制垂直漂移。

Comments 16 pages, 10 figures, 5 tables

详情
AI中文摘要

本文提出了Elevator-LIO,一种旨在电梯行驶过程中实现机器人连续定位的激光雷达-惯性里程计框架,从而支持跨楼层机器人任务。为了解决非惯性框架下的状态估计问题,Elevator-LIO建立了一个解耦的状态估计模型,分别对机器人相对于电梯的运动和电梯自身的运动进行建模,并将其嵌入到模式依赖的迭代误差状态卡尔曼滤波器框架中。该框架在普通室内环境中退化为常规LIO估计,同时在电梯非惯性环境中实现电梯相关状态的传播和约束更新,从而实现连续稳定的定位。电梯模式管理器利用激光雷达测距统计和估计状态检测电梯进出事件,并在电梯停止时引入事件触发的零速度和零加速度更新,以抑制累积的垂直漂移。此外,本文采用自适应体素降采样策略,在环境尺度显著变化时保持有效点数的稳定。我们在包含79次电梯乘坐的20个真实世界序列上进行了广泛实验,包括大尺度空间、长垂直行程、动态行人干扰和镜面反射等实际挑战。结果表明,Elevator-LIO在所有序列中保持连续定位精度,其中17个序列的终端高度误差低于1厘米。相比之下,现有代表性定位系统在这些电梯序列上表现不佳。在Hilti 2022/2023数据集上的测试进一步表明,所提方法在标准室内场景中仍具有竞争力。项目页面位于https://xiaofan4122.github.io/Elevator_LIO_Page/。

英文摘要

This paper presents Elevator-LIO, a LiDAR-inertial odometry framework designed to achieve continuous robot localization during elevator travel, thereby supporting cross-floor robotic tasks. To address the state-estimation problem in non-inertial frames, Elevator-LIO establishes a decoupled state-estimation model that separately models the robot motion relative to the elevator and the elevator motion itself, and embeds it into a mode-dependent iterated error-state Kalman filter framework. This framework degenerates to conventional LIO estimation in ordinary indoor environments, while enabling the propagation and constrained update of elevator-related states in elevator non-inertial environments, thereby achieving continuous and stable localization. An elevator mode manager detects elevator entry and exit events using LiDAR ranging statistics and estimated states, and introduces event-triggered zero-velocity and zero-acceleration updates when the elevator stops to suppress accumulated vertical drift. In addition, this paper adopts an adaptive voxel downsampling strategy to maintain a stable number of effective points under significant environmental scale changes. We conduct extensive experiments on 20 real-world sequences containing 79 elevator rides, including practical challenges such as large-scale spaces, long vertical travel, dynamic pedestrian interference, and mirror reflections. The results show that Elevator-LIO maintains continuous localization accuracy in all sequences, with terminal height error below 1 cm in 17 sequences. In contrast, existing representative localization systems perform poorly on these elevator sequences. Tests on the Hilti 2022/2023 datasets further show that the proposed method remains competitive in standard indoor scenarios. The project page is available at https://xiaofan4122.github.io/Elevator_LIO_Page/.

2605.24492 2026-05-26 cs.CV

Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs

Med-R2: 面向医学视觉语言模型中基于证据推理的对抗性基准

Wen Ma, Fucheng Niu, Zhiting Fan, Zikai Xiao, Jiaxiang Liu, Zuozhu Liu

AI总结 提出 Med-R2 Bench,一个分层对抗性基准,通过逐步QA任务和对抗扰动评估医学VLM在临床工作流中的视觉证据推理鲁棒性。

详情
AI中文摘要

视觉语言模型在通用医学视觉问答中展现出令人印象深刻的能力,但由于可解释性有限,尚不清楚其预测是反映了基于证据的临床推理还是依赖于虚假先验。我们引入 Med-R2 Bench,一个与临床工作流对齐的分层基准,用于评估视觉定位的对抗鲁棒性。我们设计逐步QA任务,以评估推理链是否严格基于四个临床阶段的视觉证据,并采用对抗性扰动测试对误导线索的鲁棒性。Med-R2 包含 42,432 张图像、31 个任务类别和 110,406 个 QA 对。在 14 个 VLM 上的评估揭示了沿四阶段临床工作流的顺序性能下降。对抗实验表明,模型严重依赖正确的提示来猜测答案。即使提供了明确的视觉线索,模型也难以准确对齐文本描述。最后,我们证明使用我们的分层数据进行逐步微调显著提高了推理鲁棒性,突显了其在推动基于证据的医学AI未来发展方面的潜力。

英文摘要

Vision-language models have demonstrated impressive capabilities in general medical visual question answering, yet due to limited interpretability, it remains unclear whether their predictions reflect evidence-grounded clinical reasoning or reliance on spurious priors. We introduce Med-R2 Bench, a hierarchical benchmark aligned with the clinical workflow to evaluate adversarial robustness with visual grounding. We design stepwise QA tasks to assess whether reasoning chains are strictly grounded in visual evidence across the four clinical stages, and employ adversarial perturbations to test robustness against misleading cues. Med-R2 comprises 42,432 images, 31 task categories, and 110,406 QA pairs. Evaluation across 14 VLMs reveals a sequential performance degradation along the four-stage clinical workflow. Adversarial experiments show that models rely heavily on correct prompts to guess answers. Even when provided with explicit visual cues, the models struggle to accurately align textual descriptions. Finally, we demonstrate stepwise fine-tuning using our hierarchical data significantly improves reasoning robustness, highlighting its potential to drive future improvements in evidence-based medical AI.

2605.24490 2026-05-26 cs.AI cs.LG q-fin.PM

Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

市场制度委员会:多智能体LLM决策系统中的动态信用分配

Yunhua Pei, Zerui Ge, Jin Zheng, John Cartlidge

AI总结 提出市场制度委员会(MRC),一种基于Shapley值进行在线智能体加权、贝叶斯自适应混合和制度依赖乘数的多智能体决策系统,在加密货币投资中实现高夏普比率和累计收益。

Comments 35 pages, 13 figures, preprint

详情
AI中文摘要

用于投资组合管理的多智能体LLM决策系统仍然缺乏一种原则性的方法来跨专业智能体分配信用,在制度转变下容易受到冷启动主导的影响,并且最终分配如何形成的透明度有限。我们提出了市场制度委员会(MRC),一种合作式多智能体决策系统,它计算所有单个、成对和大联盟输出的精确Shapley信用,用于在线智能体加权。实例化为N=3个专业智能体,在每个交易周期,MRC从指数加权性能历史中重新计算基于联盟的Shapley权重,使用贝叶斯自适应混合来稳定早期阶段,应用制度依赖乘数调整智能体权威,并通过五层因果追踪记录每次再平衡。在13种加密资产和5个种子的1037个交易日中,MRC实现了1.51的夏普比率和440.1%的累计收益,在主动基准中排名第一(CR、SR和IR),并在主动方法中实现了最低的最大回撤。消融实验表明,收益来自跨联盟输出的Shapley加权集成,而非任何单一阶段。代码和演示数据包含在补充材料中。

英文摘要

Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

2605.24489 2026-05-26 cs.AI q-bio.BM

TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

TIGER:文本引导的通用酶-反应检索

Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song

AI总结 提出TIGER框架,利用蛋白质到文本生成模型提取文本语义知识,通过动态门控网络融合序列特征,实现酶与反应的双向检索,显著提升跨任务泛化性和鲁棒性。

Comments Accepted to ACL2026

详情
AI中文摘要

酶-反应检索是计算生物学中的一个基本问题,支撑着酶表征、反应机理阐明以及代谢途径和生物催化剂的合理设计。作为一个双向任务,它涉及酶到反应和反应到酶的映射。然而,现有方法在跨任务和跨分布泛化方面表现不佳,性能对数据集分割高度敏感,且检索方向之间存在显著的不对称性。为了应对这些挑战,我们提出了TIGER,一个文本引导的通用酶-反应检索框架,利用蛋白质到文本生成模型从酶序列中提取文本语义知识,提供连接酶和生化反应的通用表示。为了确保文本语义的质量和可靠性,我们设计了一个动态门控网络,自适应地将文本派生知识与序列特征融合,从而产生更一致和信息丰富的酶表示,同时一个结构共享特征投影器将酶和反应表示对齐到统一的潜在空间中。大量实验表明,在双向检索监督下,TIGER在多种分布上显著优于最先进的基线,并展现出强大的鲁棒性和跨任务迁移能力。

英文摘要

Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.

2605.24486 2026-05-26 cs.AI cs.CL

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

AgentFugue:通过集体推理实现长时域任务的智能体扩展

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou

AI总结 提出AgentFugue框架,通过共享推理中心实现多个对等智能体并行探索和选择性信息共享,无需显式角色分工或工作流编排,从而提升长时域任务性能。

详情
AI中文摘要

近期长时域智能体任务的进展主要通过更强模型、更好工具和更有效脚手架来扩展单个智能体。相比之下,对于扩展(scaling out)的理解要少得多:多个对等智能体,都针对同一任务,能否在不依赖显式角色分工或工作流编排的情况下成为额外能力来源?我们研究这个问题并提出AgentFugue,一个围绕共享推理中心构建的集体推理框架。当对等智能体并行探索同一任务时,中心记录每个智能体已建立、尝试或排除的简明笔记,并使每个智能体能够以对其当前搜索有用的形式选择性访问其他智能体的发现。这种设计将原本孤立的轨迹转变为可重用中间推理的互联生态,无需集中规划。我们将中心实例化为一个即插即用的通信层,使用监督微调和端到端强化学习进行训练。在我们研究的具有挑战性的长时域设置中,AgentFugue优于强基线。我们的结果表明,集体推理可以将对等智能体系统的扩展转变为能力增益的独特来源,而不仅仅是消耗更多计算的方式。

英文摘要

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

2605.24484 2026-05-26 cs.AI cs.LG

SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

SPACE:统一对称与非对称路由问题的通用神经求解器

Rongsheng Chen, Changliang Zhou, Canhong Yu, Yuanyao Chen, Yu Zhou, Zhuo Chen, Zhenkun Wang

AI总结 针对现有神经求解器在对称与非对称车辆路径问题中表现不一致的问题,提出基于空间枢轴对齐的无坐标嵌入框架SPACE,通过双向弗雷歇表示和权重解耦自适应解码机制,实现统一节点表示与解生成,在110个变体上取得优异零样本泛化。

详情
AI中文摘要

通用神经路由求解器在利用统一模型解决多种车辆路径问题(VRPs)方面显示出巨大潜力。然而,现有求解器通常局限于对称设置,或在切换到非对称设置时由于输入不一致或固有结构差异而性能下降,这严重限制了它们在包含两种场景的实际应用中的实用性。为解决这一限制,我们基于每个节点到特定枢轴集的相对距离定义其空间位置,并进一步提出一种空间枢轴对齐的无坐标嵌入(SPACE)框架,该框架统一了对称和非对称VRP中的节点表示和解生成。具体而言,我们使用一种新颖的最远枢轴采样策略构建双向弗雷歇表示,以实现跨不同问题设置的不变节点表示。此外,我们引入了一种权重分解的自适应解码机制,将几何感知从问题表示中解耦,减轻约束决策对特定几何设置的过拟合。在110个VRP变体(包括55个对称问题及其非对称对应问题)上的大量实验表明,SPACE在对称和非对称VRP中均实现了有前景的零样本泛化。

英文摘要

Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. However, existing solvers are typically limited to symmetric settings or degrade in performance when switching to asymmetric settings due to input inconsistencies or inherent structural differences, substantially limiting their practicality in real-world scenarios that encompass both scenarios. To address this limitation, we define the spatial position of each node based on the relative distances to a specific set of pivots and further propose a Spatial Pivot-Aligned Coordinate-free Embedding (SPACE) framework that unifies node representation and solution generation across symmetric and asymmetric VRPs. Specifically, we construct a bidirectional Frechet representation using a novel furthest pivot sampling strategy to enable invariant node representations across distinct problem settings. Furthermore, we introduce a weight-decomposed adaptive decoding mechanism that decouples geometric perception from problem representations, mitigating the overfitting of constraint decisions to a specific geometry setting. Extensive experiments on 110 VRP variants, comprising 55 symmetric problems and their asymmetric counterparts, demonstrate that SPACE achieves promising zero-shot generalization in both symmetric and asymmetric VRPs.

2605.24477 2026-05-26 cs.LG cs.IT math.IT math.ST stat.TH

The Normalized Maximum Likelihood for Regular Non-Smooth Models: Measure-Theoretic Foundations and Geometric Sampling

正则非光滑模型的归一化最大似然:测度理论基础与几何采样

Trenton Lau, Gary P. T. Choi

AI总结 针对现代机器学习中非光滑估计器(如Lasso、稀疏SVM)的归一化最大似然(NML)编码长度计算问题,本文利用几何测度论和保守雅可比矩阵建立严格框架,并提出一种几何MCMC采样器(PDL-PPMH)以精确计算非光滑模型的随机复杂度。

详情
AI中文摘要

归一化最大似然(NML)编码长度,或称随机复杂度,代表了通用编码的一个原则性准则。虽然最近基于余面积公式的公式化为光滑模型提供了计算方法,但该框架对于现代机器学习中普遍存在的非光滑估计器(例如Lasso、稀疏SVM)失效。在这项工作中,我们为正则路径可微Lipschitz(PDL)估计器提供了计算NML的严格框架。通过应用经典几何测度论并将余面积公式与保守雅可比矩阵联系起来,我们证明了非光滑模型的随机复杂度是适定的,并且在理论上与现代自动微分的输出一致。为了精确计算该量,我们引入了提议-投影Metropolis-Hastings(PDL-PPMH)采样器,这是一种能够遍历最大似然估计器非可微水平集的几何MCMC算法。我们在理论上证明了其组成部分的合理性,包括随机切空间提议和可证明收敛的非光滑投影求解器。我们通过从高维Lasso后验($P=2000$)中采样来展示该方法的鲁棒性,同时量化了控制精确性与混合时间之间权衡的计算规模。关键的是,我们通过实验证明,我们的精确NML准则提供了一种高度数据高效的交叉验证替代方案,无需数据分割即可获得统计上不可区分的预测最优值。总之,我们的工作为正则非光滑模型的NML编码长度理论分析铺平了道路。

英文摘要

The Normalized Maximum Likelihood (NML) codelength, or stochastic complexity, represents a principled criterion for universal coding. While recent coarea-based formulations provided a calculation method for smooth models, this framework collapses for the non-smooth estimators ubiquitous in modern machine learning (e.g., Lasso, Sparse SVMs). In this work, we provide a rigorous framework for computing the NML for regular path-differentiable Lipschitz (PDL) estimators. By applying classical geometric measure theory and bridging the coarea formula with conservative Jacobians, we prove that the stochastic complexity for non-smooth models is well-posed and theoretically consistent with the outputs of modern Automatic Differentiation. To compute this quantity exactly, we introduce the Propose-and-Project Metropolis-Hastings (PDL-PPMH) sampler, a geometric MCMC algorithm capable of traversing the non-differentiable level sets of the maximum likelihood estimator. We theoretically justify its components, including a stochastic tangent space proposal and a provably convergent non-smooth projection solver. We demonstrate the method's robustness by sampling from a high-dimensional Lasso posterior ($P=2000$), while simultaneously quantifying the computational scaling that governs the trade-off between exactness and mixing time. Crucially, we empirically demonstrate that our exact NML criterion provides a highly data-efficient alternative to cross-validation, achieving statistically indistinguishable predictive optima without requiring data splitting. Altogether, our work paves the way for the theoretical analysis of the NML codelength for regular non-smooth models.

2605.24475 2026-05-26 cs.CV cs.AI cs.MM

Robust Fuzzy Multi-view Learning under View Conflict

视角冲突下的鲁棒模糊多视角学习

Siyuan Duan, Yuan Sun, Dezhong Peng, Yingke Chen, Xi Peng, Peng Hu

AI总结 针对多视角分类中视角冲突问题,提出基于模糊集理论的鲁棒模糊多视角学习框架(R-FUML),通过模糊隶属度量化类别可信度、熵值融合及冲突样本惩罚机制,提升鲁棒性和不确定性估计。

详情
AI中文摘要

可信多视角分类旨在提供可靠的融合以实现准确预测,近年来在学术界和工业界引起了广泛关注。然而,现有的TMVC方法通常假设训练和测试阶段不同视角之间严格对齐,这在现实场景中往往不切实际。这一局限性促使我们重新审视TMVC并将其扩展到更具挑战性的设置:如何在训练和推理过程中减轻视角冲突(VC)的影响。针对这一设置,现有的TMVC方法存在三个关键缺陷:低估不确定性、误导性决策以及对VC的过拟合。为解决这些问题,本文提出了一种基于模糊集理论的新型鲁棒模糊多视角学习(R-FUML)框架。具体而言,R-FUML将网络输出建模为模糊隶属度以量化类别可信度,并使用基于熵的方法进行可靠的多视角融合。为此,我们提出了一种鲁棒多视角融合(RMF)策略,该策略同时考虑了视角特定的不确定性和视角间的冲突,从而减轻VC对决策的不利影响。为了在训练过程中识别并克服VC,我们进一步设计了一种针对VC的鲁棒学习(RLVC)框架。RLVC通过利用神经网络的记忆效应隔离冲突样本,然后通过对这些冲突视角施加惩罚来重新训练模型。在八个公开数据集上的大量实验表明,R-FUML在鲁棒性和不确定性估计方面始终优于15个最先进的基线方法。代码将在论文被接收后发布。

英文摘要

Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention in both academia and industry. However, existing TMVC methods typically assume strict alignment across different views during both training and testing phases, which is often impractical in real-world scenarios. This limitation motivates us to revisit TMVC and extend it to a more challenging setting: how to mitigate the impact of view conflict (VC) during both training and inference. To tackle this setting, existing TMVC methods suffer from three critical limitations: underestimated uncertainty, misleading decisions, and overfitting to VC. To address these issues, this paper proposes a novel Robust Fuzzy Multi-View Learning (R-FUML) framework grounded in Fuzzy Set Theory. Specifically, R-FUML models network outputs as fuzzy memberships to quantify category credibility and uses an entropy-based method for reliable multi-view fusion. To this end, we present a Robust Multi-view Fusion (RMF) strategy that accounts for both view-specific uncertainty and inter-view conflicts, thereby alleviating the adverse impacts of VC on decision-making. To identify and conquer VC during training, we further design a Robust Learning Against VC (RLVC) framework. RLVC isolates conflicting samples by leveraging neural networks' memory effects and then retrains the model by applying a penalty to these conflicting views. Extensive experiments across eight public datasets demonstrate that R-FUML consistently outperforms 15 state-of-the-art baselines in robustness and uncertainty estimation. The code will be released upon acceptance.

2605.24468 2026-05-26 cs.AI

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

SAM:面向长程推理智能体的状态自适应记忆

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, Zhicheng Dou

AI总结 提出状态自适应记忆框架SAM,通过紧凑记忆线索和原始轨迹页面实现意图驱动的信息重建,无需重新训练基础模型,在多个基准上超越强基线。

详情
AI中文摘要

长程智能体推理要求大语言模型在包含思考、工具调用、观察和部分结论的长时间交互历史中行动。挑战不仅在于这些历史变长,而且当前决策所需的信息可能分散在遥远的步骤中,并且只在后来才变得相关。现有方法通过截断交互历史、将其压缩为更短的替代品或检索其选定部分进行重用来解决这一困难,但它们没有明确建模对过去交互的访问应如何适应智能体不断变化的状态。相反,我们将长程推理视为一个状态自适应记忆问题。为此,我们提出了状态自适应记忆(SAM),这是一个独立的框架,它将正在进行的交互整合为紧凑的记忆线索,同时保留原始轨迹页面用于意图驱动的回忆。这些线索不被视为历史的替代品;相反,它们充当轻量级句柄,使智能体能够根据当前需求重建时间上遥远的信息,而无需重新训练底层骨干网络。我们进一步通过专家引导的监督和强化学习优化记忆模块,使其与轨迹级别的效用对齐。在BrowseComp、BrowseComp-ZH、WideSearch和HLE上,SAM在各种智能体骨干网络上持续优于强基线。我们的结果表明,显式记忆建模为长程智能体推理提供了一个简单而有效的基础。

英文摘要

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

2605.24458 2026-05-26 cs.LG cs.AI

Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems

平衡公平性、隐私和准确性:面向集中式数据驱动系统的多任务对抗框架

Imesh Ekanayake, Elham Naghizade, Jeffrey Chan

AI总结 提出一种多任务对抗模型,将公平性和隐私作为核心目标,通过优化代价函数动态平衡三者,在最小化性能损失的同时实现高公平性和隐私保护。

Comments 13 Pages, 6 figures, IEEE TKDE

详情
AI中文摘要

在集中式数据驱动应用中,公平性和隐私的整合至关重要,尤其是当这些系统日益影响具有重大社会影响的领域时。当前方法很少同时考虑隐私、公平性和准确性,这可能会损害伦理标准和隐私法规。然而,平衡这三个目标相当具有挑战性,因为每个目标通常对模型的设计和训练提出相互冲突的要求,使得优化一个目标而不损害其他目标变得困难。本文提出了一种新颖的多任务对抗模型,将公平性和隐私视为整体目标而非事后考虑,并学习一个隐藏敏感属性同时保留任务相关信息的潜在表示。我们的方法通过优化的代价函数动态平衡公平性与准确性及隐私,即使在严格条件下也能实现最小的性能损失。在多种数据集上的广泛测试表明,我们的模型能够在不大幅牺牲准确性的情况下实现高标准的公平性和隐私。与最先进的隐私和公平标准进行基准测试表明,我们的方法增强了隐私、公平性和准确性优化的鲁棒性,证明了其在不同数据集上的适应性。

英文摘要

The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influence sectors with significant societal impact. Current methods rarely address privacy, fairness, and accuracy together, which can potentially compromise ethical standards and privacy regulations. However, balancing these three objectives is quite challenging since each of objective often imposes conflicting requirements on the design and training of models, making it difficult to optimize one without compromising the others. This paper introduces a novel multitask adversarial model that treats fairness and privacy as integral objectives rather than afterthoughts, and learns a latent representation that hides sensitive attributes while preserving essential task-related information. Our approach dynamically balances fairness with accuracy and privacy through an optimized cost function with minimal performance loss even under strict conditions. Extensive testing on diverse datasets shows the ability of our model to achieve high standards of fairness and privacy without significant sacrifice to accuracy. Benchmarking against state-of-the-art privacy and fairness standards shows that our method enhances the robustness of privacy, fairness, and accuracy optimization, proving its adaptability across various datasets.

2605.24454 2026-05-26 cs.CL

Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

分解与精炼:基于参数化检索的结构化法律问答

Jihyung lee, Hyounghun Kim, Gary Lee

AI总结 提出Decompose-and-Refine (DaR)框架,通过逐步分解复杂法律问题为原子子问题并生成与法规对齐的参数化查询,以解决多跳法律问答中的检索准确性和幻觉问题。

详情
AI中文摘要

大型语言模型(LLMs)在法律领域表现出强大性能,在法律问答(LQA)中展现出显著潜力。然而,与通用问答不同,LQA要求答案不仅准确,而且严格基于明确的法律权威。在成文法LQA中,许多问题需要跨多个法律问题的多跳推理,大大增加了幻觉风险,因此准确检索支持性法规条款成为关键前提。尽管多跳问答近期取得进展,现有方法通常依赖自然语言推理或无需显式查询重构的检索,导致用户问题与法规文本之间的词汇差距未得到充分解决。为应对这一挑战,我们提出Decompose-and-Refine(DaR),一种基于法规的LQA框架,它将逐步的问题分解与基于参数化知识的查询精炼紧密结合。DaR逐步将复杂法律问题分解为原子子问题,并为每个子问题生成与法规对齐的参数化查询,从而能够为每个法律问题选择最核心的单一法规条款。我们在基于成文法的韩语多跳LQA基准KoBLEX上,使用Qwen3-32B和Gemma3-27B评估DaR。实验结果表明,DaR在检索准确性和最终答案质量上均持续优于现有方法。此外,通过显式分离子问题及其对应法规条款,DaR促进了复杂法律推理过程的透明、逐问题验证。

英文摘要

Large language models (LLMs) have shown strong performance in the legal domain, demonstrating notable potential in Legal Question Answering (LQA). However, unlike general QA, LQA requires answers that are not only accurate but also rigorously grounded in explicit legal authority. In statutory LQA, many questions require multi-hop reasoning across multiple legal issues, substantially increasing the risk of hallucination, thereby making accurate retrieval of supporting statutory provisions a critical prerequisite. Despite recent progress in multi-hop QA, existing approaches often rely on reasoning in natural language or retrieval without explicit query reformulation, leaving the vocabulary gap between user questions and statutory text largely unaddressed. To address this challenge, we propose Decompose-and-Refine (DaR), a statute-grounded LQA framework that tightly integrates step-wise question decomposition with parametric knowledge-based query refinement. DaR progressively decomposes a complex legal question into atomic sub-questions and generates statute-aligned parametric queries for each sub-question, enabling the selection of a single most central statutory provision corresponding to each legal issue. We evaluate DaR on KoBLEX, a Korean multi-hop LQA benchmark grounded in statutory law, using Qwen3-32B and Gemma3-27B. Experimental results demonstrate that DaR consistently improves both retrieval accuracy and final answer quality over existing approaches. Moreover, by explicitly separating sub-questions and their corresponding statutory provisions, DaR facilitates transparent, issue-level verification of complex legal reasoning processes.

2605.24452 2026-05-26 cs.CL cs.AI cs.LG

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

法律判决预测中的时间概念漂移:跨越乌克兰法院判决三个时期的神经基线

Volodymyr Ovcharov

AI总结 通过微调四种Transformer编码器在乌克兰法院三个时期(战前、混合战争、全面入侵)的判决上,研究法律语言的时间漂移,发现前向性能严重下降(最多27.2个百分点),法律领域预训练不能提升绝对性能但能减轻漂移,时序持续学习可消除灾难性遗忘。

Comments 17 pages, 6 tables, 5 figures. Dataset: https://huggingface.co/datasets/overthelex/ukrainian-court-decisions

详情
AI中文摘要

法律NLP基准测试在随机分割的数据上评估模型,隐含假设法律语言是平稳的。我们通过微调四种Transformer编码器——XLM-RoBERTa(base和large)及其法律领域变体——在地缘政治事件定义的三个时间时期的乌克兰法院判决上测试这一假设:战前(2008-2013)、混合战争(2014-2021)和全面入侵(2022-2026)。每个模型在一个时期上训练,并在所有三个时期上评估,产生一个3x3的跨时间泛化矩阵。四个发现出现。(1)前向退化严重:在战前数据上训练的模型应用于全面入侵时期判决时,宏F1最多下降27.2个百分点。(2)退化不对称:后向迁移(全面入侵到战前)比前向迁移稳健得多,与法律语言是加性的假设一致。(3)法律领域预训练(Legal-XLM-R)不提升绝对性能,但减少前向退化的幅度和不对称性。(4)时序持续学习消除了通用XLM-R的灾难性遗忘:战前知识完全保留(+1.8至+6.2个百分点),而全面入侵性能提升+16.5至+19.0个百分点;逆时序训练导致严重遗忘。跨司法管辖区在瑞士判决预测数据上的预训练提升绝对性能,但不减少时间退化幅度,确认时间漂移是法律语言演化的内在属性。数据集(三个时期共428K判决)作为LEXTREME贡献公开可用。

英文摘要

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders -- XLM-RoBERTa (base and large) and their legal-domain variants -- on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

2605.24451 2026-05-26 cs.CL

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

越南语音中方言变体的语音建模

Quan Ngoc Hoang, Long Hoang Huu Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

AI总结 提出一种方言感知的语音框架,通过结构化语音成分和方言特定IPA映射,在词汇和解码层面显式建模越南语方言变体,在UIT-ViMD数据集上以更少参数和无外部预训练达到与最强预训练模型相当的性能。

详情
AI中文摘要

越南语在北部、中部和南部地区表现出显著的方言语音变体,其中相同的词汇项可能以明显不同的发音实现。这种变体给自动语音识别(ASR)带来了挑战,并且由于越南语正字法与音系之间的复杂关系,在计算上仍然难以建模。现有方法通常在词汇层面处理方言变异性,假设拼写与发音之间的方言不变映射,这限制了它们捕捉系统性语音差异的能力。我们提出了一种方言感知的语音框架,在词汇和解码层面显式建模越南语音系结构和方言变体。该框架引入了一个语音词汇表,将每个音节分解为结构化的语音成分,并将它们映射到方言特定的IPA表示,同时结合一个语音结构解码器联合预测这些成分。在UIT-ViMD(越南语中唯一可用的多方言数据集)上的实验表明,所提出的方法优于各种预训练基线, extbf{尤其在使用更少参数且无需外部预训练的情况下,跨方言匹配了最强预训练模型wav2vec2-base-vi-250h的性能}。为便于实验复现,代码将在论文被接收后公开。

英文摘要

Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbf{especially matches the performance of the strongest pretrained wav2ve2-base-vi-250h} across dialects while \textbf{using substantially fewer parameters and no external pretraining}. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

2605.24449 2026-05-26 cs.RO cs.LG

Vision-Guided Outdoor Flight and Obstacle Evasion via Reinforcement Learning

基于强化学习的视觉引导户外飞行与避障

Shiladitya Dutta, Aayush Gupta, Varun Saran, Avideh Zakhor

AI总结 提出一种基于立体视觉深度和视觉惯性里程计的传感器运动策略,通过强化学习和特权学习在仿真中训练,实现零样本迁移到未知户外环境和无人机平台进行自主避障导航。

Comments Published in IEEE Robotics and Automation Letters, vol 11, no 2. Presented at the IEEE International Conference on Robotics and Automation 2026

详情
AI中文摘要

尽管四旋翼飞行器凭借其全向机动性拥有令人印象深刻的穿越能力,但在复杂环境中需要持续的人工操控限制了其在GNSS和遥测信号缺失场景中的应用。为此,我们提出了一种新颖的传感器运动策略,该策略使用立体视觉深度和视觉惯性里程计(VIO)在未知环境中自主穿越障碍物以到达目标点。该策略由一个预训练的自编码器作为感知前端,后接一个规划与控制LSTM网络,输出速度指令,可由现成的商用无人机执行。我们利用强化学习和特权学习范式,通过两阶段过程在仿真中训练该策略:1)以全局运动规划器生成的优化轨迹作为监督骨干进行初始训练;2)在课程环境中进一步微调。为弥合仿真到现实的差距,我们采用领域随机化和奖励塑造来创建对噪声和领域偏移具有鲁棒性的策略。在户外实验中,我们的方法成功实现了对训练中从未遇到的障碍环境和无人机平台的零样本迁移。

英文摘要

Although quadcopters boast impressive traversal capabilities enabled by their omnidirectional maneuverability, the need for continuous pilot control in complex environments impedes their application in GNSS and telemetry-denied scenarios. To this end, we propose a novel sensorimotor policy that uses stereo-vision depth and visual-inertial odometry (VIO) to autonomously navigate through obstacles in an unknown environment to reach a goal point. The policy is comprised of a pre-trained autoencoder as the perception head followed by a planning and control LSTM network which outputs velocity commands that can be followed by an off-the-shelf commercial drone. We leverage reinforcement and privileged learning paradigms to train the policy in simulation through a two-stage process: 1) initial training with optimal trajectories generated by a global motion planner acting as a supervisory backbone, 2) further fine-tuning in a curriculum environment. To bridge the sim-to-real gap, we employ domain randomization and reward shaping to create a policy that is both robust to noise and domain shift. In outdoor experiments, our approach achieves successful zero-shot transfer to both obstacle environments and a drone platform that were never encountered during training.

2605.24448 2026-05-26 cs.CV

SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement

SILSM:一种可持续交互式水平集方法用于渐进式细化

Jiachen Song, Dazhi Zhang, Fanghui Song, Zhichang Guo, Shengzhu Shi

AI总结 提出一种可持续交互式水平集方法(SILSM),通过解耦用户引导为独立交互项并采用高阶正则化,实现稳定、渐进细化的交互式分割。

详情
AI中文摘要

交互式分割旨在利用稀疏的用户引导精确分离目标对象。然而,传统方法通常面临交互负担重和参数敏感的问题,而深度学习方法则受限于数据依赖和迭代不稳定性。受这些限制的启发,我们提出了可持续交互式水平集方法(SILSM)。所提出的水平集演化方程包含交互项、正则化项和分割项。具体来说,采用高阶正则化以保持数值稳定性,并且与传统方法不同,我们将用户引导解耦为一个独立的交互项,从而能够直接手动控制零水平集的演化。此外,我们开发了一种适用于多次交互的数值算法,通过基于顺序用户输入有效更新分割结果,促进动态细化。我们从理论上证明,高阶项比传统长度项提供更强的正则化约束,而交互项确保分割严格在用户选择的区域内。实验结果进一步表明,所提出的方法对交互输入具有鲁棒性,在首次交互时即达到有竞争力的性能,并支持稳定的多轮交互,分割质量逐步提高。

英文摘要

Interactive segmentation aims to precisely isolate target objects using sparse user guidance. However, traditional methods often suffer from heavy interaction burdens and parameter sensitivity, while deep learning approaches struggle with data dependency and iterative instability. Motivated by these limitations, we propose the Sustainable Interactive Level Set Method (SILSM). The proposed level set evolution equation incorporates interaction, regularization, and segmentation terms. Specifically, high-order regularization is employed to maintain numerical stability, and unlike traditional methods, we decouple user guidance into an independent interaction term to enable direct manual control over the zero-level set evolution. Furthermore, we develop a numerical algorithm tailored for multiple interactions, which facilitates dynamic refinement by effectively updating the segmentation results based on sequential user inputs. We theoretically demonstrate that the high-order term provides stronger regularization constraints than the conventional length term, while the interaction term ensures segmentation strictly within the user-selected region. Experimental results further demonstrate that the proposed method is robust to interactive inputs, achieves competitive performance at the first interaction, and supports stable multi-round interactions with progressively improved segmentation quality.

2605.24442 2026-05-26 cs.CV

Benchmarking Composed Image Retrieval for Applied Earth Observation

面向应用地球观测的组合图像检索基准测试

Bill Psomas, Dionysis Christopoulos, Thanasis Petropoulos, Nikos Efthymiadis, Ioannis Kakogeorgiou, Ondřej Chum, Yannis Avrithis, Giorgos Tolias, Konstantinos Karantzalos

AI总结 针对遥感组合图像检索(RSCIR),本文通过统一基准测试和面向应用的研究,系统评估了现代组合方法在地球观测图像上的可迁移性,并引入面向灾害监测的变化中心数据集xView2-CIR,揭示了无训练组合方法的优势及变化中心检索的独特挑战。

详情
AI中文摘要

遥感组合图像检索(RSCIR)能够使用结合参考图像和文本修饰符的组合查询在大型卫星图像档案中进行搜索。尽管RSCIR为表达目标检索意图提供了灵活的接口,但现代组合方法在地球观测(EO)图像上的可迁移性及其与操作化EO工作流的相关性仍未得到充分探索。我们通过统一的基准测试和面向应用的研究来填补这一空白。首先,我们在PatternCom上使用标准化协议,系统地调整并评估了具有六个视觉-语言骨干网络的代表性组合图像检索方法,分析了它们在不同骨干网络、组合策略和查询类型上的行为。其次,我们引入了xView2-CIR,这是一个面向灾害和损害监测的变化中心数据集,其中检索以场景身份和目标灾后状态为条件。我们的结果表明,无训练组合方法为EO检索提供了强大且可扩展的基线,而变化中心检索则呈现出与基于属性的检索不同的挑战,特别是由于需要保持场景身份。总体而言,本研究为RSCIR建立了一个实用的基准测试,并将组合检索定位为遥感图像检索、档案探索和变化分析的补充工具。数据集和代码可在https://github.com/billpsomas/rscir获取。

英文摘要

Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at https://github.com/billpsomas/rscir.

2605.24437 2026-05-26 cs.LG

CAffNet: Hard Constraint-Affine Neural Networks

CAffNet: 硬约束仿射神经网络

Yang Zhao, Jungeun Lee, Jeong hwan Jeon, Sze Zheng Yong

AI总结 提出一种将任意基数的输入相关仿射约束硬嵌入前馈神经网络和Transformer的框架,通过可训练的约束仿射层实现联合优化并保持通用逼近性质。

详情
AI中文摘要

我们提出了一种新颖的框架,用于将硬约束满足嵌入神经网络(NN)架构中,特别是前馈神经网络和Transformer,约束为任意基数的输入相关仿射约束。传统的约束执行方法要么依赖于基于惩罚的软约束,无法保证满足性,要么依赖于训练后执行约束的后处理方法,可能导致次优性。我们在神经网络中引入了一个可训练的约束仿射(CAffine)层,得到CAffNet,它超越了通过固定正交或平行投影执行仿射约束的方式,并实现了与网络参数的联合优化。此外,我们对约束空间维度没有施加任何限制,并证明了我们的构造保持了神经网络的通用逼近性质,同时为所有输入提供了约束遵守的可证明保证。实验验证表明,在需要保证约束满足的各个领域中,性能稳健。

英文摘要

We present a novel framework for embedding hard constraint satisfaction into neural network (NN) architectures, specifically feedforward neural networks and transformers, with input-dependent affine constraints of arbitrary cardinality. Traditional constraint enforcement approaches either rely on penalty-based soft constraints, which offer no guarantee of satisfaction, or on post-processing methods that enforce constraints after the NN is trained, which may lead to suboptimality. We introduce a trainable constraint-affine (CAffine) layer into NNs, yielding CAffNet, which goes beyond enforcing affine constraints via fixed orthogonal or parallel projections and enables joint optimization with network parameters. Moreover, we impose no restrictions on the constraint space dimensions and establish that our construction preserves the universal approximation properties of NNs, while providing provable guarantees on constraint adherence for all inputs. Experimental validation demonstrates robust performance across diverse domains requiring guaranteed constraint satisfaction.

2605.24433 2026-05-26 cs.RO cs.LG

Smoother Action Chunking Flow Policy via Prior-Corrected Orthogonal Trust-Region Guidance

基于先验校正的正交信任区域引导的平滑动作块流策略

Kai Fang, Hailong Pei, Xuemin Chi

AI总结 提出POTR方法,通过先验校正权重和正交信任区域约束,改善流匹配机器人策略中动作块推理的边界不连续性和横向扰动,提升成功率和运动平滑性。

详情
AI中文摘要

流匹配机器人策略通常使用动作块推理进行高效的闭环控制,但块边界可能引入不连续的动作转换。现有的RTC引导通过在去噪过程中注入校正信号来改善连续性,但其权重调度在中间时间步较弱,且无约束的校正方向可能引入横向扰动。我们提出POTR,一种先验校正的正交信任区域引导方法。首先,我们将数据先验尺度$σ_d$纳入RTC引导权重,产生更强的中间时间校正。其次,我们将引导向量分解为与去噪速度平行和垂直的分量,并将垂直分量约束在信任区域内。在LIBERO上使用$π_{0.5}$,与RTC相比,POTR提高了成功率,并持续减少了块边界不连续性、加速度和加加速度。消融实验表明,先验校正权重提供了主要的校正增益,而正交信任区域进一步提高了稳定性。

英文摘要

Flow-matching robot policies commonly use action-chunking inference for efficient closed-loop control, but chunk boundaries can introduce discontinuous action transitions. Existing RTC guidance improves continuity by injecting correction signals during denoising, yet its weight schedule is weak at intermediate timesteps and its unconstrained correction direction may introduce transverse perturbations. We propose POTR, a **p**rior-corrected **o**rthogonal **t**rust-**r**egion guidance method. First, we incorporate a data-prior scale $σ_d$ into the RTC guidance weight, yielding stronger intermediate-time correction. Second, we decompose the guidance vector into components parallel and perpendicular to the denoising velocity, and constrain the perpendicular component within a trust region. On LIBERO with $π_{0.5}$, POTR improves success rate and consistently reduces chunk-boundary discontinuity, acceleration, and jerk compared with RTC. Ablations show that the prior-corrected weight provides the main correction gain, while the orthogonal trust region further improves stability.

2605.24432 2026-05-26 cs.CL

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

在对话中发现:LLMs 自我学习弥合多轮差距

Tianlang Chen, Shirley Wu, Jure Leskovec

AI总结 提出 Found in Conversation (FiC) 框架,通过视图非对称自蒸馏方法,让模型从单轮视角向多轮视角迁移能力,显著缩小多轮对话与单轮性能的差距。

Comments 17 pages, 3 figures, 6 tables

详情
AI中文摘要

大型语言模型(LLM)的交互通常是不明确的,用户需要在多个对话轮次中澄清所有必要的细节。然而,最近的研究表明,LLM 在这种多轮设置中的表现远不如在单轮中同时获得相同信息时的表现,这一现象被称为“Lost-in-Conversation”。然而,有效弥合这一差距仍然是一个开放问题。在这里,我们引入了 Found in Conversation (FiC),一个训练框架,其中模型自我学习在给定不明确的多轮提示时找到并恢复其单轮能力。我们开发了视图非对称自蒸馏方法,该方法在同一任务信息的两个视图之间进行蒸馏——教师视角为单轮视图,学生视角为多轮视图——将强大的单轮行为转移到弱的多轮行为中。这不需要更强的外部教师,因为即使是前沿的 LLM 也存在这种差距。在多个模型家族(Llama、Qwen、Phi 和 OLMo)和规模(3B-14B)上,FiC 恢复了至少 92% 的单轮性能,并在两个 Llama 骨干上达到了 100%,从而在保持单轮能力的同时实现了更高效、更有帮助的多轮对话。

英文摘要

Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting than in a single turn with same information being available at once, a phenomenon termed "Lost-in-Conversation." However, bridging this gap effectively remains an open problem. Here we introduce Found in Conversation (FiC), a training framework where a model teaches itself to find and recover its single-turn competence given underspecified multi-turn prompts. We develop View-Asymmetric Self-Distillation, which distills across two views of the same task information--single-turn view for the teacher, multi-turn view for the student--transferring strong single-turn behavior into weak multi-turn behavior. This requires no stronger external teacher, which is unavailable as even frontier LLMs exhibit this gap. Across model families (Llama, Qwen, Phi, and OLMo) and sizes (3B-14B), FiC recovers at least 92% of single-turn performance and reaches 100% on two Llama backbones, yielding more efficient and helpful multi-turn conversations with single-turn capabilities intact.

2605.24428 2026-05-26 cs.LG

Representation-Guided Discrete Molecular Graph Retrosynthesis

表示引导的离散分子图逆合成

Jiahai Huang, Anjie Qiao, Zhen Wang, Defu Lian, Yutong Lu

AI总结 提出表示引导的分子图逆合成方法GRG,通过将预训练编码器的化学语义注入扩散模型,在USPTO-50k上达到58.6/77.2/83.4/87.1的top-1/3/5/10准确率,多样性提升至15.5,并加速收敛35%的epoch和30%的时间。

详情
AI中文摘要

基于随机过程的分子图生成器已成为无模板单步逆合成的最先进方法。然而,这些模型通常仅在产物-反应物对上训练,从而以间接和隐式的方式获取化学相关表示。与此同时,计算机视觉的最新进展表明,向生成器提供表示引导可以有效地将预训练编码器的语义提取到DiTs中,显著改善收敛性和生成质量。类似的增益是否适用于逆合成任务,以及哪些图特定的设计选择可以使其工作,仍然是一个开放问题。为了解决这些问题,我们在一个统一的设计空间上进行了系统的实证研究,该空间涵盖教师分子表示、端点和粒度选择、去噪器中的注入深度、对应策略和引导方案。在这些考虑的指导下,我们开发了图导向的表示引导(GRG),在USPTO-50k上实现了58.6/77.2/83.4/87.1的top-1/3/5/10准确率,同时将多样性提高到15.5,两者均大幅优于所采用的基础生成器。值得注意的是,GRG在分布外设置中一致地改进了所有top-k指标,表明表示引导有助于获取内在的化学语义。同时,引入的表示引导将达到可比性能所需的epoch数减少了35%,挂钟时间减少了30%。此外,我们引入了一种简单而有效的基于表示相似性的重排序机制,该机制无需训练额外的验证器即可进一步改善排序列表的顶部。

英文摘要

Stochastic process-based molecular graph generators have become the state of the art for template-free single-step retrosynthesis. However, these models are typically trained only on product-reactant pairs, thereby acquiring chemistry-relevant representations in an indirect and implicit manner. Meanwhile, recent advances in computer vision demonstrate that offering representation guidance to a generator can effectively distill semantics from pretrained encoders into DiTs, substantially improving both convergence and generation quality. Whether similar gains extend to the retrosynthesis task, and what graph-specific design choices can make them work, remains an open question. To address these questions, we conduct a systematic empirical study over a unified design space spanning teacher molecular representations, endpoint and granularity choices, injection depths in the denoiser, correspondence strategies and guidance scheme. Guided by these considerations, we develop Graph-oriented Representation Guidance (GRG), which achieves 58.6 / 77.2 / 83.4 / 87.1 top-1 / 3 / 5 / 10 accuracy on USPTO-50k, while increasing diversity to 15.5, both substantially outperforming the adopted base generator. Notably, GRG consistently improves all top-k metrics in out-of-distribution settings, suggesting that representation guidance facilitates the acquisition of intrinsic chemical semantics. Meanwhile, the introduced representation guidance reduces the number of epochs by 35% and the wall-clock time by 30% to reach comparable performance. In addition, we introduce a simple yet effective representation-similarity-based reranking mechanism, which further improves the top of the ranked list without training an additional verifier.

2605.24426 2026-05-26 cs.CL

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

SEAL: 智能体与学习环境的协同共演化

Yihao Hu, Zhihao Wen, Xiujin Liu, Pan Wang, Xin Zhang, Wei Wu

AI总结 提出SEAL框架,通过协同演化智能体策略与训练环境,解决智能体与环境错配问题,在低资源多轮工具使用任务中提升性能并实现正向分布外迁移。

详情
AI中文摘要

大型语言模型(LLM)智能体通过交互不断改进,然而大多数自我演化方法孤立地调整策略或学习环境。我们识别出这一结构性问题为“智能体-环境错配”:智能体的能力边界在训练过程中变化,而提供监督的环境保持静态或仅与智能体暴露的失败弱耦合。我们提出SEAL,一个用于交互式工具使用智能体的闭环共演化框架。SEAL在可执行验证下收集在线策略轨迹,将失败轨迹诊断成回合级失败标签,并将这些诊断作为环境侧适应和模型侧策略优化的共享信号。环境通过暴露更清晰的工具能力线索、约束信息和面向恢复的反馈来演化其训练时的学习接口,而策略则通过诊断引导的优势加权进行更新。在分布内和分布外多轮工具使用评估中的大量实验表明,SEAL改进了低资源智能体学习:仅使用400个训练样本,它在三个骨干网络上取得了+8.25到+26.25的平均分提升,并表现出正向的分布外迁移。这些结果证明了联合调整学习器及其训练时学习基质对于鲁棒的自我改进LLM智能体的价值。

英文摘要

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

2605.24425 2026-05-26 cs.LG cs.AI cs.CL

Momentum Streams for Optimizer-Inspired Transformers

动量流:优化器启发的Transformer

Jingchu Gai, Nai-Chieh Huang, Jiayun Wu

AI总结 提出一类优化器启发的Transformer(如三重动量TMMFormer),通过将残差更新解释为优化器步骤,发现动量是性能提升的关键,能收敛到更平坦的极小值,减少遗忘并改善泛化。

详情
AI中文摘要

预归一化Transformer层的残差更新可以被解释为对代理token能量执行一阶优化器的一步,其中注意力和MLP子层充当梯度预言。基于这一观察,我们构建了一族优化器启发的Transformer(三重动量、Adam/AdamW、Muon、SOAP),并在匹配计算量下进行比较。在我们的主要预训练实验中,三重动量TMMFormer取得了最低的验证损失,优于普通Transformer和先前的架构变体。受控消融实验和支持理论表明,动量(而非预条件)是增益的主要来源。我们进一步证明,TMMFormer和其他基于动量的设计比普通Transformer收敛到更平坦的极小值,这导致更少的遗忘和更好的泛化。

英文摘要

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

2605.24423 2026-05-26 cs.AI

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

临时团队协作中上下文强化学习的极限基准测试

Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng

AI总结 提出ICRL4AHT基准,基于Overcooked-V2评估上下文强化学习在临时团队协作中的表现,发现算法在未见队友和布局下常不如随机基线,凸显多智能体环境下的适应挑战。

Comments 41 pages, 14 figures

详情
AI中文摘要

上下文强化学习(ICRL)使基础智能体能够即时适应新任务,但其在需要与未知伙伴协调的临时团队协作(AHT)中的有效性尚未被探索。为严格评估这一点,我们引入了一个大规模基准ICRL4AHT,基于高吞吐量JAX实现的Overcooked-V2构建。我们的基准包括一个大型、多样化的队友套件,涵盖RL和启发式策略,支持可控的训练-测试转移,并提供了一个可复现的端到端流水线,用于队友生成、学习历史收集、数据集构建和在线多回合评估。我们评估了代表性的历史条件ICRL算法,包括算法蒸馏(AD)和决策预训练Transformer(DPT),跨越数百万次转移。结果揭示了显著的局限性:与它们在单智能体领域的成功相反,这些基线在多智能体设置中未能展现出稳健的测试时适应。具体来说,这些方法在未见队友和未见布局轨迹上经常表现不如随机基线,并且在长时间跨度内没有明显的上下文改进。这些发现凸显了在OvercookedV2 AHT协议下部分可观测性中战略推理的挑战,将我们的基准确立为下一代协调算法的关键测试平台。

英文摘要

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

2605.24420 2026-05-26 cs.LG cs.AI

Batch Normalization Amplifies Memorization and Privacy Risks

批归一化加剧记忆化和隐私风险

Ngoc Phu Doan, Chongyan Gu, Ihsen Alouani

AI总结 本文通过实证和理论分析,发现批归一化层会显著增加模型对异常样本的记忆化,从而加剧隐私泄露风险。

详情
AI中文摘要

批归一化(BN)被广泛采用以加速深度神经网络的收敛并实现更稳定的训练。然而,其对隐私和记忆化的影响在很大程度上尚未被探索。在这项工作中,我们研究了BN层对非典型或异常样本记忆化的影响及其对隐私泄露的启示。我们使用三种互补方法进行了广泛的实证研究:(i)对分布外训练样本的无意记忆化,(ii)通过梯度范数测量的每个样本影响,以及(iii)对成员推断攻击(MIA)的敏感性。跨多个数据集和架构,我们一致观察到,与没有BN的模型相比,BN显著增加了对异常值的记忆化。关键的是,这种放大的记忆化直接转化为隐私漏洞:具有BN的模型对MIA表现出显著更高的敏感性。我们通过理论分析补充了实证结果,表明BN在训练过程中放大了异常样本的每步影响,为这一现象提供了机制性见解。我们的结果突显了与BN相关的被低估的隐私风险,并为归一化层如何放大罕见或敏感训练样本的影响提供了实践和理论见解。

英文摘要

Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impact on privacy and memorization has remained largely unexplored. In this work, we investigate the effect of BN layers on the memorization of atypical or outlier samples and its implications for privacy leakage. We conduct an extensive empirical study using three complementary approaches: (i) unintended memorization of out-of-distribution training samples, (ii) per-sample influence measured via gradient norms, and (iii) susceptibility to membership inference attacks (MIA). Across multiple datasets and architectures, we consistently observe that BN substantially increases the memorization of outliers compared to models without BN. Critically, this amplified memorization translates directly into privacy vulnerabilities: models with BN exhibit significantly higher susceptibility to MIAs. We complement our empirical findings with a theoretical analysis showing that BN amplifies the per-step influence of outlier samples during training, providing mechanistic insight into this phenomenon. Our results highlight an underappreciated privacy risk associated with BN and provide both practical and theoretical insights into how normalization layers can amplify the influence of rare or sensitive training examples.