arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.10682 2026-06-10 cs.LG 新提交

PL-KKT-hPINN: Enforcing Nonlinear Equality Constraints on Neural Networks via Piecewise-Linear Projection

PL-KKT-hPINN：通过分段线性投影在神经网络上强制非线性等式约束

Fateme Mohammad Mohammadi, Hector Budman, Joshua L. Pulsipher

发表机构 * Department of Chemical Engineering, University of Waterloo（滑铁卢大学化学工程系）

AI总结提出PL-KKT-hPINN框架，通过分段线性投影严格强制非线性等式约束，在CSTR案例中保持预测精度的同时大幅降低约束违反，并在小样本下提升鲁棒性。

详情

AI中文摘要

尽管物理信息神经网络（PINN）在过程建模中显示出强大潜力，但物理方程仅在训练期间作为软约束强制执行，因此无法保证推理时的约束满足。我们提出一个称为分段线性Karush--Kuhn--Tucker硬约束PINN（PL-KKT-hPINN）的框架，通过分段线性投影严格强制非线性等式约束。这扩展了KKT-hPINN框架，后者通过Karush--Kuhn--Tucker（KKT）条件精确强制线性等式，该条件与将神经网络输出正交投影到约束可行域相关。该方法在连续搅拌釜反应器（CSTR）案例研究中进行了单输入和双输入情况的演示。结果表明，PL-KKT-hPINN保持了与标准神经网络相当的预测精度，同时实现了显著更低的约束违反。此外，所提出的模型在低数据情况下显示出改进的鲁棒性，在有限的训练样本量下，其RMSE低于无约束神经网络。这些结果表明，PL-KKT-hPINN为非线性化学工程系统的代理建模提供了一种计算高效且物理一致的框架。

英文摘要

While physics-informed neural networks (PINNs) have shown strong potential for process modeling, physical equations are only enforced as soft constraints during training, and thus, they do not guarantee constraint satisfaction at inference. We propose a framework, called piecewise-linear Karush--Kuhn--Tucker hard-constrained PINNs (PL-KKT-hPINNs), that strictly enforces nonlinear equality constraints through piecewise-linear projection. This extends the KKT-hPINN framewor, which exactly enforces linear equalities through the Karush--Kuhn--Tucker (KKT) conditions associated with orthogonally projecting neural network outputs onto the constraint feasible region. The method is demonstrated on a continuous stirred-tank reactor (CSTR) case study for both one and two inputs. Results show that PL-KKT-hPINN preserves predictive accuracy comparable to that of a standard neural network while achieving substantially lower constraint violations. In addition, the proposed model shows improved robustness in low-data regimes, yielding lower RMSE than the unconstrained neural network for limited training sample sizes. These results demonstrate that PL-KKT-hPINN provides a computationally efficient and physically consistent framework for surrogate modeling of nonlinear chemical engineering systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10678 2026-06-10 cs.LG 新提交

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

更接近真实：一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs（RobotBulls实验室）； North South University（南北大学）

AI总结提出两阶段模型无关框架，通过显式解耦预测与残差学习，使用元校正器动态建模结构误差模式，提升Transformer预测精度。

详情

Comments: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

AI中文摘要

近年来，基于Transformer的模型已成为时间序列预测的主要范式，利用自注意力机制捕获长程依赖关系。尽管取得了成功，但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足，表现出持续的系统性残差偏差。当残差被视为不可约噪声时，这一局限性依然存在，阻碍了对结构化误差模式的自适应校正。为解决这一问题，我们引入了一个两阶段、模型无关的框架，将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后，专用的元校正器动态建模跨多元通道的结构化误差模式，保留跨变量依赖关系，并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展，我们的框架解决了单阶段架构固有的近似局限性，消除了对限制性假设的依赖，并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估，我们的方法达到了最先进的性能，在标准指标（MSE、MAE）上有显著改进。结果表明，该框架能够减轻系统性偏差，增强对复杂时间动态的鲁棒性，推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

URL PDF HTML ☆

赞 0 踩 0

2606.10677 2026-06-10 cs.AI cs.CL 新提交

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory：用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI（InfiniGen AI）； Tsinghua University（清华大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出Infini Memory架构，将智能体记忆组织为主题文档，通过缓冲合并和迭代检索实现可维护的长期记忆，在MemoryAgentBench上达到64.7%的总体得分。

详情

AI中文摘要

长期LLM智能体需要持久记忆，以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段，这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory，一种可维护的基于文本的持久记忆架构，将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元，用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中，然后定期合并为连贯的文本上下文。在推理时，一种智能体检索过程允许LLM通过迭代工具调用读取记忆，而不是单次检索步骤。在MemoryAgentBench上，Infini Memory取得了64.7%的总体得分。消融实验表明，主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

URL PDF HTML ☆

赞 0 踩 0

2606.10675 2026-06-10 cs.CL eess.AS 新提交

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

基于自监督表示和学习动态规划的多语言词级强制对齐

Roy Weber, Meidan Zehavi, Rotem Rousso, Joseph Keshet

发表机构 * Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology（以色列理工学院电气与计算机工程学院）

AI总结提出一种结合自监督表示和学习动态规划的多语言词级强制对齐方法，通过融合MMS和UnSupSeg特征并学习词边界概率，在多个语言上超越现有方法。

详情

Comments: Interspeech 2026

AI中文摘要

我们提出了一种准确的多语言词级强制对齐方法，包括一个对齐编码器和一个学习对齐解码器。编码器整合两种表示：一种来自大规模多语言语音（MMS）模型，另一种来自自监督音素边界检测器（UnSupSeg）。它学习融合这些表示，并在长时间上下文中估计词边界概率。对齐解码器是一种学习动态规划，它将编码器输出与基于MMS和UnSupSeg表示的段特征相结合，以推断最终词边界。在TIMIT和Buckeye上迭代训练后，所提方法在两个数据集上均优于Montreal Forced Aligner（MFA）和基于MMS的对齐方法。在未见语言（荷兰语、德语和希伯来语）上，所提模型的性能始终优于或与现有对齐方法相当，表明其有潜力在不进行进一步训练的情况下扩展到MMS支持的1100多种语言。

英文摘要

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

URL PDF HTML ☆

赞 0 踩 0

2606.10671 2026-06-10 cs.CV 新提交

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

FadeMem: 面向自回归视频扩散的距离感知记忆巩固

Yu Lu, Junjie Yang, Piotr Koniusz, YuXin Song, Yi Yang

发表机构 * Zhejiang University（浙江大学）； University of New South Wales (UNSW)（新南威尔士大学）； Data61/CSIRO ； Baidu Inc（百度公司）

AI总结提出FadeMem，一种距离感知的KV记忆巩固机制，在固定缓存预算下将历史KV块组织成时间层次，通过幂律分配实现近密远疏的记忆，提升长视频生成中的主体一致性和时间连贯性。

详情

Comments: 11 pages, 4 figures

AI中文摘要

自回归视频生成器通过生成连续的时间片段来合成长时间视频，但其历史KV缓存随视频长度增长。现有的有界缓存方法通过局部窗口、汇合令牌或压缩记忆状态来降低这一成本，但它们通常为历史的不同部分分配固定角色。我们提出FadeMem，一种距离感知的KV记忆巩固机制，在固定缓存预算下将历史KV块组织成时间层次。这一设计受频率依赖的时间衰减启发：细节快速去相关，而粗略场景结构和身份在更长时间内保持有用。在生成过程中，新历史作为细粒度条目插入，而较旧的相邻条目在幂律时间分配调度下逐步合并，从而在单个缓存中产生近密远疏的记忆。无需架构更改，FadeMem保留近期上下文以处理短期动态，并保留紧凑的长程锚点以保持身份和场景连贯性。实验表明，与现有有界缓存策略相比，主体一致性、背景稳定性和时间连贯性均得到提升。

英文摘要

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.10666 2026-06-10 cs.CV cs.DB 新提交

Analyzing Training-Free Corruption Detection for Object Detection Datasets

分析目标检测数据集的无训练腐败检测

Christian Sieberichs, Simon Geerkens, Thomas Waschulzik, Viswanathan Ramesh, Alexander Braun

发表机构 * University of Applied Sciences Düsseldorf（杜塞尔多夫应用科学大学）； Siemens Mobility GmbH（西门子交通有限公司）； Goethe University Frankfurt（法兰克福大学）

AI总结本文研究无训练特征空间方法在目标检测数据集中检测标注错误的应用，发现该方法能可靠暴露语义错误，但位置错误难以检测。

详情

Comments: Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

AI中文摘要

注释错误在计算机视觉数据集中普遍存在，并且会显著降低在其上训练的系统的性能，特别是在目标检测等复杂任务中。存在多种识别注释错误的方法，包括无训练的特征空间方法，这些方法提供了一种快速且可解释的方式来分析注释。然而，对于包含语义和空间信息的目标检测注释，其行为在很大程度上仍未探索。在这项工作中，我们分析了基于特征空间的方法在检测目标检测数据集中的注释错误时的适用性。通过调整现有的特征空间方法，我们表明此类方法可靠地暴露语义错误，而位置错误仍然难以检测。我们使用VOC2012和KITTI，在多个预训练嵌入模型、合成噪声类型（对称、非对称和位置）以及真实世界注释错误上评估了这种行为。所有代码和真实世界腐败数据均可在以下仓库公开获取：https://github.com/ChristianSieberichs/BoundingBox_corruption_detection

英文摘要

Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: https://github.com/ ChristianSieberichs/BoundingBox\_corruption\_detection

URL PDF HTML ☆

赞 0 踩 0

2606.10657 2026-06-10 cs.CL 新提交

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

我们是在评估知识还是措辞？使用ParaEval减轻MCQA敏感性

João Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski, Patrick Gallinari, Loic Barrault

发表机构 * FAIR at Meta（Meta FAIR）； Sorbonne Université, CNRS, ISIR, F-75005 Paris, France（索邦大学，法国国家科学研究中心，智能系统与机器人研究所，法国巴黎）； Criteo AI Lab, Paris, France（Criteo AI实验室，法国巴黎）

AI总结针对多选题基准测试对答案措辞敏感的问题，提出ParaEval框架，通过对每个选项使用多种释义并选择最有利的评分，将虚假性能差距从2分以上降至1分以下，从而评估模型真实能力。

详情

AI中文摘要

多选题（MCQA）基准测试是评估预训练大语言模型的标准方法，但其依赖于对数似然评分使得结果不可靠。具体而言，标准评分对答案的确切措辞（表面形式）高度敏感，将模型对特定短语的熟悉程度与其实际能力混为一谈。我们使用一个受控测试床（1B-8B模型，基于相同知识训练）证明了这一缺陷。尽管拥有相同的知识，标准指标错误地报告了超过2分的性能差距。为了解决这个问题，我们提出了ParaEval，一个评估框架，它对每个答案选项使用多个释义来查询模型。通过根据每个模型最有利的措辞进行评分，ParaEval成功地将虚假性能差距降低到1分以下。我们确认这些评估伪影以及ParaEval的改进在前沿的70B和120B开源模型中仍然存在。最终，ParaEval提供了一种稳健且高效的方式来评估真正的底层能力，而不是表面形式的熟悉度。

英文摘要

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

URL PDF HTML ☆

赞 0 踩 0

2606.10656 2026-06-10 cs.CV 新提交

Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

Envision4D: 通过前馈4D高斯泼溅展望自动驾驶的视觉未来

Qi Song, Yifei He, Chi Zhang, Zheng Fu, Xuhe Zhao, Mengmeng Yang, Kun Jiang, Rui Huang, Diange Yang

发表机构 * Tsinghua University（清华大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出Envision4D，一种全自监督前馈框架，通过未来姿态预测、层内时间注意力和条件运动提升，实现无位姿的未来外推，在自动驾驶动态场景预测中达到最先进性能。

详情

Comments: Project Page: https://maggiesong7.github.io/research/Envision4D/

AI中文摘要

预测动态场景的未来演变在自动驾驶中至关重要。然而，现有的前馈范式主要设计用于插值。当扩展到未来外推时，它们在大位移下会出现重影伪影，并受限于简化的运动假设或严格的未来先验。为了克服这些挑战，我们提出了Envision4D，一种完全自监督的前馈框架，用于无位姿的未来外推。具体来说，我们引入了一个未来姿态预测模块，通过迭代去噪过程推断未来相机参数。此外，为了捕捉非线性动态，我们提出了层内时间注意力，并采用条件运动提升，将高度不确定的外推过程转化为稳健的关系映射。最后，利用渐进式训练策略来稳定无监督运动学习，防止误差累积。大量实验表明，Envision4D实现了最先进的性能，在未来的视图合成中显著优于现有方法。

英文摘要

Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

URL PDF HTML ☆

赞 0 踩 0

2606.10654 2026-06-10 cs.CL 新提交

Speaker Group Encoding in Self-supervised Speech Recognition Models

自监督语音识别模型中的说话人群体编码

Felix Herron, Solange Rossato Alexandre Allauzen, Benoit Favre, François Portet

发表机构 * MILES Team, LAMSADE, Université Paris Dauphine-PSL, France（法国巴黎多芬纳-PSL大学LAMSADE实验室MILES团队）； GETALP Team, LIG, Université Grenoble Alpes, France（法国格勒诺布尔阿尔卑斯大学LIG实验室GETALP团队）； NLP team, LIS, Aix-Marseille University, France（法国艾克斯-马赛大学LIS实验室NLP团队）

AI总结研究自监督语音识别模型如何编码说话人群体信息，发现微调任务和公平性算法对不同类型群体信息的影响不同。

详情

DOI: 10.1007/978-3-032-02548-7_11
Journal ref: Text, Speech, and Dialogue. TSD 2025. Lecture Notes in Computer Science(), vol 16029

AI中文摘要

我们研究了自监督语音识别模型（S3Ms）学习了关于说话人群体（SGs）的哪些信息。我们检查了S3Ms的几种状态：预训练、在说话人识别（SID）上微调、在自动语音识别（ASR）上微调，以及使用公平性增强算法进行ASR微调。我们发现S3Ms编码了关于几个说话人群体类别（SGCs）的信息，包括他们的性别、年龄、方言、种族以及是否为母语者。我们发现，针对SID的微调放大了某些SGCs，即那些方差更偏向语音性质的SGCs，尽管它没有放大其他SGCs，即那些方差更偏向语义性质的SGCs。另一方面，针对ASR的微调丢弃了语音变异的说话人群体信息（SGI），但保留了语义变异的SGI。我们发现，为改善公平性而设计的ASR算法改变了S3Ms中编码SGI的程度；然而，这主要适用于语音变异的SGCs，而对于语义变异的SGCs则不太适用。我们讨论了SGI如何被每一层编码，并识别了负责编码不同SGCs的嵌入子维度。最后，我们讨论了我们的发现如何有助于设计更公平的ASR算法。

英文摘要

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.10653 2026-06-10 cs.CV 新提交

STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

STEDiff: 增强文本嵌入以提升扩散模型中文本到图像的对齐

Hailan Zhang, Haipeng Liu, Bo Fu, Yang Wang

发表机构 * Hailan Zhang, Haipeng Liu, Yang Wang（未明确机构）； Bo Fu（未明确机构）

AI总结提出训练免费的STEDiff方法，通过利用[EOT]令牌增强子句语义并引入语义增强损失，在文本嵌入空间中改善扩散模型对复杂提示的语义对齐，无需微调或布局先验。

详情

Comments: 8 pages, 8 figures, to appear at IJCNN 2026

AI中文摘要

尽管预训练的文本到图像（T2I）生成模型可以产生高质量图像，但由于随机噪声和固有的模型限制，它们常常无法忠实地反映复杂提示的语义意图。这个问题经常表现为模型忽略特定对象或无法正确地将属性绑定到其对应的实体上，这一挑战被称为语义对齐。与依赖计算昂贵的微调或劳动密集的布局先验的现有方法不同，我们提出了STEDiff，一种无需训练的方法，旨在直接在文本嵌入空间中增强语义表示。具体来说，我们引入了一种方法，主要利用[EOT]令牌来增强子句的相关语义，然后替换原始提示中的相应令牌。此外，还引入了一种新颖的语义增强损失来强制执行空间约束，确保每个实体的语义精确映射到其各自的图像区域。在T2I-CompBench上的大量定量和定性评估表明，我们的方法在复杂场景中显著提高了语义一致性和生成完整性。

英文摘要

Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10651 2026-06-10 cs.CV 新提交

Kwai Keye-VL-2.0 Technical Report

Kwai Keye-VL-2.0 技术报告

Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang

发表机构 * Kuaishou Group（快手集团）

AI总结提出开源MoE多模态基础模型Keye-VL-2.0，首次将DeepSeek稀疏注意力适配到GQA架构，支持无损256K上下文处理，并通过跨模态多教师策略蒸馏和上下文/视频强化学习解决多任务对齐中的灾难性遗忘，在长视频理解和智能体任务上达到同类最优。

详情

Comments: 31 pages, 11 figures

AI中文摘要

我们介绍了 Kwai Keye-VL-2.0-30B-A3B，一个开源的混合专家（MoE）多模态基础模型，旨在推进长视频理解和智能体智能。为应对小时级视频中存在的超长上下文、信息冗余和过高计算成本等挑战，Keye-VL-2.0 首次将 DeepSeek 稀疏注意力（DSA）适配到基于 GQA 的多模态架构中，实现了无损的 256K 上下文处理，同时捕捉关键帧和长程时间依赖。该架构由高度优化的训练和推理基础设施支撑，包括可扩展的视频 I/O、异构 ViT-LM 并行和自定义 DSA 内核，显著提高了吞吐量并最小化计算开销。此外，为克服多任务对齐过程中灾难性遗忘的算法困境，我们引入了跨模态多教师在线策略蒸馏（MOPD），并结合上下文强化学习和视频强化学习。通过将在线策略 rollout 中的密集 token 级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络，Keye-VL-2.0 原生支持跨代码、工具和搜索场景的高级智能体协作，并具备多模态自我纠正能力。在视频理解、时间定位、推理、STEM 和智能体基准上的广泛评估表明，Keye-VL-2.0-30B-A3B 在相似规模模型中达到了最先进的性能，尤其在 TimeLens 上的细粒度时间定位和 Video-MME-v2 及 LongVideoBench 上的长视频理解方面表现优异。我们发布了模型检查点，以加速社区向可扩展且鲁棒的多模态智能体应用迈进。

英文摘要

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

URL PDF HTML ☆

赞 0 踩 0

2606.10650 2026-06-10 cs.CL cs.AI 新提交

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University（俄亥俄州立大学）； University of Michigan（密歇根大学）； ByteDance Seed（字节跳动Seed）

AI总结提出DLA框架，通过信息感知动态状态合并和容量受限内存建模，解决多状态线性注意力中固定合并策略导致的错误累积问题，在16个数据集上超越现有方法。

详情

Comments: Accepted by ICML 2026

AI中文摘要

大型语言模型（LLMs）对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度，这促使采用具有次二次成本（sub-quadratic cost）的线性注意力机制。为了在长上下文下提高表示能力，近期方法以多状态方式组织内存。然而，现有的多状态线性注意力方法依赖于固定的状态合并策略，无法适应动态变化的令牌重要性，不可逆地模糊了关键令牌，并在长序列上导致严重的错误累积。为了解决这一限制，我们提出了DLA，一种用于多状态线性注意力的动态内存建模框架。DLA引入了（i）信息感知动态状态合并，它基于令牌级别的信息变化自适应地确定状态边界，在语义转换周围保留高分辨率表示，同时积极总结稳定区域；以及（ii）容量受限内存建模，它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存，以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA，并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.10646 2026-06-10 cs.LG cs.CL 新提交

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

推理流如何流动？追踪注意力诱导的信息流以实现LLM中的目标RL

Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Alibaba Group（阿里巴巴集团）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出FlowTracer框架，通过注意力诱导的有向无环图追踪答案导向的推理流，基于全局信息流结构分配token级信用，从而提升LLM在推理任务中的强化学习效果。

详情

Comments: 25 pages, 7 figures, 11 tables. Accepted at ICML 2026

AI中文摘要

Token级信用分配仍然是大型语言模型（LLM）中强化学习（RL）的主要障碍，其中RL配方通常平等对待所有token，未能区分决定性推理步骤与常规格式或流畅填充。最近的研究利用模型内部信号分配更细粒度的信用，但这些往往是点式启发式方法，忽略了信息传播的全局结构。我们提出FlowTracer，一个RL框架，它在注意力诱导的有向无环图上追踪答案导向的推理流，其中节点对应token，边容量来自聚合的注意力权重，并从这种全局结构中推导出token信用。边容量被重新加权，仅保留能够到达答案区域的影响，同时强制执行局部流守恒，使得中间token不会因路径长度或无关分支而损失或获得有效质量。在此图上，FlowTracer提取连接问题与答案的信息流骨干，并通过流吞吐量对token进行评分，揭示调解长距离依赖的高影响枢纽和聚合检查点。这些推导出的重要性用于塑造token级奖励，使学习信号精确聚焦于将信息路由向（或远离）正确答案的token，并在各种推理任务中提供一致的性能提升。

英文摘要

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.10645 2026-06-10 cs.CV 新提交

ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

ManiSplat: 基于解耦3D高斯泼溅的单目视频操作轨迹合成

Wenhao Hu, Haonan Zhou, Liu Liu, Yun Du, Xinjie Wang, Ziang Li, Zhizhong Su, Gaoang Wang

发表机构 * Zhejiang University（浙江大学）； Horizon Robotics（地平线机器人）

AI总结提出ManiSplat框架，通过图结构解耦表示和任务导向时空对齐，从单目视频重建可控的3D高斯数字孪生，支持机器人操作任务与策略学习。

详情

AI中文摘要

从真实世界观测中重建动态且可交互的3D场景仍然是计算机视觉和机器人学中的一个基本挑战。尽管3D高斯泼溅的最新进展实现了高保真静态重建，但由于复杂的接触交互和突变的姿态变化，将其扩展到具有关节机器人和可操作物体的交互环境仍然困难。为了解决这些挑战，我们引入了ManiSplat，一个统一的框架，直接从单目自我中心机器人视频重建可控且解耦的高斯数字孪生。我们的方法引入了一种图结构解耦表示，将机器人、物体和背景分离为独立可优化的高斯子场，并组织在场景图中。为了确保稳定性，我们提出了一个任务导向的时空对齐模块，利用操作任务的内在逻辑——在运动和技能阶段之间交替——来构建准确的伪真实轨迹。最后，联合光度-几何优化确保重建场景在时间上连贯、物理上一致且可用于仿真。大量实验表明，我们的方法以高保真度和可控性重建了交互驱动的动态场景，有效支持下游机器人任务和策略学习。

英文摘要

Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

URL PDF HTML ☆

赞 0 踩 0

2606.10640 2026-06-10 cs.CV 新提交

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

ChartLens：用于图表数据校正和事实性摘要精炼的双分支框架

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yupeng Hu, Liqiang Nie

发表机构 * Shandong University（山东大学）； Southeast University（东南大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结提出ChartLens双分支框架，通过结构感知CSV验证校正和文本保留引导的摘要精炼，提升图表数据恢复与摘要事实性，在DataMFM挑战赛Track 2中获第一。

详情

AI中文摘要

在本报告中，我们展示了针对DataMFM挑战赛Track 2：图表理解（Chart Understanding）的冠军解决方案。该赛道要求模型从图表图像中恢复结构化图表数据并生成忠实于事实的自然语言摘要。为了满足准确数据提取和事实性叙述的互补需求，我们提出了ChartLens，一个用于图表数据校正和摘要精炼的双分支框架。ChartLens由两个关键模块组成：结构感知CSV验证与校正（SAVC）和文本保留引导的摘要精炼（TRSR）。SAVC通过验证和校正提高结构化数据提取的可靠性，而TRSR通过保留图表中的关键文本和数值证据来增强摘要生成。通过结合模型自适应、基于校正的生成和OCR辅助的证据依据，ChartLens改善了结构化数据恢复和摘要事实性。在测试集上，我们的最终系统获得了69.10的总分，并在Track 2中排名第一，证明了其在准确图表理解方面的有效性。我们的代码将在以下网址发布：this https URL。

英文摘要

In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ChartLens.

URL PDF HTML ☆

赞 0 踩 0

2606.10632 2026-06-10 cs.LG cs.AI 新提交

Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-\texorpdfstring{$δ$}{delta} Alignment

公平真的公平吗？通过固定δ对齐实现多任务学习中可靠的Lipschitz公平性

Junbo Ding, Xin Zang, Chenchen Pan, Donghao Song, Jiaxin Zhu, Danhuai Guo

发表机构 * Beijing University of Chemical Technology（北京化工大学）

AI总结针对多任务学习中Lipschitz个体公平性评估受表示尺度干扰的问题，提出固定δ审计与受控正则化框架ReLiF，实现语义一致的公平性评估与权衡。

详情

DOI: 10.1145/3770855.3817938

AI中文摘要

Lipschitz风格的个体公平性形式化了语义相似的样本应获得相似预测的思想，但在多任务学习（MTL）中，其评估可能受到方法引起的表示尺度的干扰。本文识别了阈值混淆问题：当审计容差源自每个模型自身的表示距离时，不同算法会在不同的语义阈值下进行比较。阈值漂移分析进一步展示了偏差排名如何变化，并识别了排名保持的充分条件。我们提出了\textbf{ReLiF}，一个可靠性感知框架，将评估时的固定$\delta$审计与训练时的受控正则化分离。ReLiF使用共享参考容差进行可比较的审计，并通过违反率反馈控制器保持Lipschitz代理活跃而不让其主导随机训练。本文还发展了关于阈值漂移、参考容差选择以及huberized训练代理与其未平滑的正间隔对应物之间关系的支持性分析。在临床时间序列基准和NYUv2（NYU Depth V2）密集预测上的实验表明，固定$\delta$审计暴露了方法依赖阈值可能掩盖的效用-公平性权衡。在使用ResNet50骨干的NYUv2上，ReLiF在共享固定阈值下实现了有竞争力的效用，同时显著减少了对齐偏差。在临床基准上，ReLiF产生了受控的公平性正则化权衡，而固定$\delta$审计揭示任务平衡基线有时能实现更低的偏差，且真正的效用-公平性权衡仍然存在。这些结果支持固定$\delta$审计作为评估MTL中Lipschitz公平性的语义一致协议。

英文摘要

Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its evaluation in multi-task learning (MTL) can be confounded by method-induced representation scales. This paper identifies threshold confounding: when the auditing tolerance is derived from each model's own representation distances, different algorithms are compared under different semantic thresholds. A threshold-drift analysis further shows how Bias rankings can change and identifies sufficient conditions for ranking preservation. We propose \textbf{ReLiF}, a reliability-aware framework that separates evaluation-time fixed-$δ$ auditing from training-time controlled regularization. ReLiF uses a shared reference tolerance for comparable auditing and a violation-rate feedback controller to keep the Lipschitz surrogate active without letting it dominate stochastic training. This work also develops supporting analysis for threshold drift, reference-tolerance selection, and the relationship between the huberized training surrogate and its unsmoothed positive-margin counterpart. Experiments on clinical time-series benchmarks and NYUv2 (NYU Depth V2) dense prediction show that fixed-$δ$ auditing exposes utility--fairness trade-offs that method-dependent thresholds can obscure. On NYUv2 with a ResNet50 backbone, ReLiF achieves competitive utility while substantially reducing aligned bias under shared fixed thresholds. On clinical benchmarks, ReLiF yields controlled fairness-regularized trade-offs, while fixed-$δ$ auditing reveals that task-balancing baselines can sometimes achieve lower bias and that genuine utility--fairness trade-offs persist. These results support fixed-$δ$ auditing as a semantically consistent protocol for evaluating Lipschitz fairness in MTL.

URL PDF HTML ☆

赞 0 踩 0

2606.10628 2026-06-10 cs.CV 新提交

Leveraging Metric Depth for Relative Depth Prediction

利用度量深度进行相对深度预测

Xiaoyang Bi, Shuaikun Liu, Zhaohong Liu, Yuxin Yang, Zhe Zhao, Mengshi Qi, Liang Liu, Huadong Ma

发表机构 * Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia（智能电信软件与多媒体北京市重点实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对足球场景中相对深度预测训练样本少的问题，提出利用预训练模型的零样本能力学习度量深度，在挑战集上取得2.68×10^{-3}的分数。

2606.10616 2026-06-10 cs.AI 新提交

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么：通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab（华为诺亚方舟实验室）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

AI总结针对长时域语言代理的有限上下文窗口，提出OSL-MR框架，将记忆保留建模为约束随机优化问题，通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值，实验表明在严格预算下优于现有方法。

详情

AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口，使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理，但大多将保留视为局部决策问题，并未在现实观测约束下显式建模其长期后果。为填补这一空白，我们将记忆保留建模为一个约束随机优化问题，具有明确的预算可行性、证据效用以及延迟成本（包括遗漏惩罚、重新获取延迟和过时信息风险）。随后，我们提出OSL-MR（观测安全记忆保留学习），这是一个新颖的框架，强制执行在线可观测特征与离线可用监督（OAS）之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式，该启发式既作为可部署的在线安全基线，又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值，同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明，OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度，敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.10611 2026-06-10 cs.LG cs.CV 新提交

Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

几何感知强化学习用于二维不规则排样

Auguste Lehuger, Guillaume Henon-Just

发表机构 * Valeo Brain（法雷奥大脑）

AI总结提出Polygons Transformer架构与组合优化强化学习框架，使智能体从数据中学习几何先验，在二维不规则排样中达到与最先进启发式算法Sparrow竞争的面积利用率。

详情

Comments: 15 pages, 4 figures, 5 tables. Under review at the European Workshop on Reinforcement Learning (EWRL)

AI中文摘要

针对二维不规则排样问题的传统启发式求解器存在一个根本性限制：它们对多边形几何是盲目的，依赖引导式暴力搜索在连续放置空间中导航，几何指导极少。本文认为，强化学习具有独特优势来克服这一瓶颈。通过将优化策略与几何感知神经编码器配对，智能体可以直接从数据中自动发现丰富的几何先验，利用这些学到的直觉来战略性地引导探索。为实现这一点，我们引入了Polygons Transformer（PoT），这是一种新颖的架构，能够编码二维连续矢量几何，同时允许跨多边形注意力。我们将这种新颖架构与组合优化强化学习（CORL）训练框架相结合，以寻找最优解。为了支持这一范式，我们发布了一个源自复杂地理轮廓的开源训练数据集以及一个专门的评估基准。我们的实证验证表明，训练后的智能体在面积利用率方面与最先进的启发式求解器Sparrow高度竞争，证明强化学习可以成功发现并利用几何感知来完成精确的空间任务。

英文摘要

Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.10610 2026-06-10 cs.CL 新提交

Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

小数据，大噪声：面向鲁棒参数高效微调的对抗训练

Eitan Cohen, Idan Simai, Uri Shaham

发表机构 * Bar-Ilan University（巴伊兰大学）

AI总结提出SDBN框架，将对抗训练与参数高效微调结合，通过离散不确定性集变体增强模型在低资源场景下的鲁棒性和泛化能力。

详情

Comments: Accepted to Findings of ACL 2026

AI中文摘要

参数高效微调（PEFT）已成为将基础模型适应下游NLP任务的关键技术。然而，当前的PEFT方法在处理噪声鲁棒性和有限训练数据下的性能退化方面往往存在困难。我们提出SDBN（小数据大噪声），一个统一的框架，将对抗训练引入PEFT——尽管两者具有互补优势，但在PEFT设置中这一组合仍较少被研究——以增强模型鲁棒性和泛化能力，优于其他方法。我们还引入了该方法的两种变体，使用离散不确定性集：SDBN-h，枚举字符级编辑并使用梯度选择最坏情况变体；SDBN-p，使用LLM生成的变体进行生成任务中的鲁棒优化。跨多个基准的实验显示，特别是在低资源设置以及词级和字符级污染下，性能有显著提升。该框架解决了对抗训练与参数高效适应之间较少被探索的交集，无需引入额外参数或仅需适度的计算开销，使得在数据稀缺和语言变异性常共存的现实场景中，PEFT部署更加可靠。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

URL PDF HTML ☆

赞 0 踩 0

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 新提交

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

因果集成智能体：基于LLM引导的专家重加权的层次化因果发现

Xinyu Li, Yuanyuan Wang, Haoxuan Li, Chuan Zhou, Erdun Gao, Bo Han, Tongliang Liu, Kun Zhang, Howard Bondell, Mingming Gong

发表机构 * The University of Melbourne（墨尔本大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Peking University（北京大学）； Adelaide University（阿德莱德大学）； Hong Kong Baptist University（香港浸会大学）； The University of Sydney（悉尼大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出因果集成智能体（CEA）框架，通过线性意见池聚合不同层次的统计因果发现结果，并利用大语言模型（LLM）作为元裁判在决策边界附近动态重加权专家，从而构建更准确完整的因果图。

详情

AI中文摘要

因果发现旨在从观测数据中揭示因果结构，这对现实世界决策至关重要。然而，不同的因果发现算法可能产生相互冲突的结果，使得识别准确的因果图复杂化。传统方法依赖数值和统计假设，往往忽略丰富的领域特定信息（如特征描述），而这些信息也有助于结构学习。尽管近期研究探索使用大语言模型（LLM）通过直接查询推断因果关系，但由于缺乏与实际数据的一致性，此类方法可能不可靠。为解决这些限制，我们提出因果集成智能体（CEA），一种新颖框架，通过线性意见池聚合来自不同图层次的统计发现专家的结构见解，并在聚合置信度接近决策边界时，使用LLM作为元裁判动态重加权专家，从而组合出更完善、更完整的因果图。在合成和真实数据集上的大量实验表明，CEA在广泛的因果发现方法中实现了最强的整体性能，突显了在因果发现中使用LLM进行元分析的有效性。

英文摘要

Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.10602 2026-06-10 cs.CV 新提交

Globally Localizing Lunar Rover in Pixels via Graph Alignment

通过图对齐在像素级全局定位月球车

Mao Chen, Xu Yang, Chuankai Liu, Xiangkai Zhang, Xiaoxue Wang, Zheng Bo, Zuoyu Zhang, Zhiyong Liu

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Beijing Aerospace Control Center（北京航天飞行控制中心）； Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences（中国科学院空间应用工程与技术中心）

AI总结提出WARG框架，利用统一图学习和重投影图匹配解决月球车跨视角定位中的实体纠缠、视角差异和仿真到真实域偏移问题，在玉兔二号真实数据上实现1.68米定位误差。

详情

AI中文摘要

精确的月球车定位是自主月球探测的前提，然而全球导航卫星系统（GNSS）信号的缺失以及局部定位方法的累积漂移严重限制了远程任务。跨视角定位通过匹配月球车视角和卫星视角图像提供了一种有前景的无漂移全局解决方案。然而，月球环境为对应点对齐带来了独特挑战，包括实体间纠缠、视角间差异以及仿真到真实的域偏移。为了解决这些挑战，我们提出了重投影图扭曲对齐（WARG），一个利用统一图学习和重投影图匹配实现鲁棒跨视角对齐的框架。在合成的LuSNAR数据集上预训练后，WARG的平均测试误差为0.32米，并在合成月球南极区域展现出鲁棒的零样本泛化能力，误差为3.63米。更重要的是，在玉兔二号月球车的真实数据上验证时，WARG在100米×100米的搜索区域内实现了1.68米的定位误差，相当于在空间分辨率为1.40米/像素的低分辨率卫星图像中达到近像素级精度。除了精度，WARG计算高效，仅含1.56M参数，是之前轻量级模型的16.12%，在NVIDIA RTX A6000 GPU上运行频率为5.49 Hz，接近GNSS级更新频率。最后，我们观察到WARG通过跨视角定位学习自然发展出低级空间感知能力，包括语义分割和结构推理，突显其作为以最小标注成本实现空间智能的有前景范式的潜力。源代码见：此 https URL。

英文摘要

Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at https://github.com/maochen-casia/warg.

URL PDF HTML ☆

赞 0 踩 0

2606.10594 2026-06-10 cs.CV 新提交

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Segment and Select: 3D场景中的视觉-语言分割

Yulin Chen, Zhihang Zhong, Yuenan Hou

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出SEGA3D范式，通过掩码候选生成器、大语言模型和语义空间选择器实现3D场景中基于语言指令的细粒度分割，在ScanNet和Matterport3D上分别提升8.3和5.3 mIoU。

详情

Comments: The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

AI中文摘要

3D视觉-语言分割旨在根据语言指令和视觉观察在3D场景中分割目标对象。现有技术严重依赖粗糙的超点表示来降低计算复杂度，这导致分割质量差和对象边界混乱。本文提出用于3D视觉-语言分割的SEGment-And-Select（SEGA3D）范式，该范式直接操作于细粒度视觉信息，无需依赖超点。具体而言，我们首先利用掩码候选生成器提供细粒度的类别掩码候选，显著提高候选掩码相对于超点对应物的质量。然后，利用大语言模型（LLM）基于语言描述和视觉特征生成语义和空间信息。LLM输出和视觉特征被输入到语义-空间选择器（SSS）以产生排名最高的掩码候选。最后，设计循环验证模块（LVM）从选定的候选掩码中产生分割掩码。我们的SEGA3D在ScanRefer、ScanNet和Matterport3D基准测试中取得了有竞争力的性能。值得注意的是，我们的SEGA3D在ScanNet和Matterport3D上分别超过最佳性能对手8.3 mIoU和5.3 mIoU。代码将在发表后提供。

英文摘要

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.10592 2026-06-10 cs.LG 新提交

Dirichlet-Guided Group Forecasting for Alleviating Over-smoothing in Time Series Forecasting

Dirichlet引导的群体预测：缓解时间序列预测中的过度平滑

Xingyu Zhang, Jingyao Wang, Xin Yu, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）

AI总结针对时间序列预测中的过度平滑问题，提出Dirichlet引导的群体预测（DGF）框架，通过显式建模多个模式条件预测分布及其选择概率的不确定性，并采用Dirichlet引导的分层采样和奖励优化，提升预测的准确性、多样性和动态一致性。

详情

AI中文摘要

时间序列预测常常遭受过度平滑的影响，尤其是当未来动态是多模态时。预测可能遵循观测未来的粗略趋势，但未能保留定义合理动态演变的急剧变化、振荡、转折点和制度转换。在这项工作中，我们从潜在动态模式压缩的角度重新审视过度平滑：在部分观测和单一实现监督下，多个可能的未来模式可能在预测过程中被削弱、合并或平均。基于这一观点，我们提出了Dirichlet引导的群体预测（DGF），一种保持模式的预测框架，它显式建模多个模式条件预测分布及其选择概率的不确定性。DGF使用Dirichlet引导的分层采样机制和基于奖励的优化，以鼓励预测准确、动态一致且模式区分。在真实世界预测基准上的大量实验表明，DGF减少了过度平滑，同时提高了预测准确性、多样性和动态一致性。

英文摘要

Time series forecasting often suffers from over-smoothing, especially when future dynamics are multi-modal. Forecasts may follow the coarse trend of the observed future, but fail to preserve sharp changes, oscillations, turning points, and regime transitions that define plausible dynamic evolution. In this work, we revisit over-smoothing from the perspective of latent dynamical mode compression: under partial observation and single-realization supervision, multiple plausible future modes can be weakened, merged, or averaged during forecasting. Based on this view, we propose Dirichlet-Guided Group Forecasting (DGF), a mode-preserving forecasting framework that explicitly models multiple mode-conditioned predictive distributions and uncertainty over their selection probabilities. DGF uses a Dirichlet-guided hierarchical sampling mechanism and reward-based optimization to encourage forecasts that are accurate, dynamically consistent, and mode-distinct. Extensive experiments on real-world forecasting benchmarks show that DGF reduces over-smoothing while improving forecasting accuracy, diversity, and dynamical consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.10591 2026-06-10 cs.SD 新提交

ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding

ContextCodec: 面向内容的超低比特率语音编码上下文引导

Chengbin Liang, Wenqi Guo, Hao Cao, Zhijin Qin

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结提出ContextCodec，通过双分支编码器解耦声学细节与内容上下文，利用CLIP对比损失对齐上下文特征与音素索引，在500 bps下实现质量与可懂度的良好平衡。

详情

Comments: Accepted at Interspeech 2026. 6 pages, 2 figures, 5 tables

AI中文摘要

神经语音编解码器实现了低比特率语音通信，但在超低比特率（< 1000 bps）下保持感知质量和可懂度具有挑战性。现有设计通常优先考虑声学细节，在严格的比特率约束下留给核心语言信息的容量有限。为了解决这个问题，我们提出了ContextCodec，一种传输面向内容的上下文特征以显式指导重建的编解码器。ContextCodec采用双分支编码器，将声学细节与面向内容的上下文解耦。上下文分支通过CLIP风格的对比损失进行训练，该损失将上下文特征与音素索引对齐，减少副语言泄漏。在解码过程中，这些特征被注入每个解码阶段以进行显式指导。此外，我们引入了一个轻量级的自回归潜在细化模块。实验表明，在500 bps下实现了强大的质量-可懂度权衡，在典型移动CPU上的RTF为0.4886。

英文摘要

Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.

URL PDF HTML ☆

赞 0 踩 0

2606.10582 2026-06-10 cs.LG cs.AI 新提交

Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

与陌生人共绘：种群规模驱动涌现素描中的零-shot互懂性

Jooyeon Kim

发表机构 * Graduate School of Artificial Intelligence, UNIST（UNIST人工智能研究生院）

AI总结研究通过视觉素描游戏，发现扩大训练种群规模能显著提升独立训练群体间的零-shot互懂性，其机制在于增加群体内变异并减少群体间差异，最终通过感知锚定实现结构收敛。

详情

AI中文摘要

涌现通信中的泛化主要关注新颖输入或语言结构，但智能体与来自严格不相交社区的陌生人进行通信的能力仍相对未被探索。在这项工作中，我们将这种能力形式化为\textit{零-shot互懂性（ZMI）}：独立训练群体之间无需事先接触即可成功通信。利用涌现素描（智能体通过绘制一组笔画进行通信）作为视觉接地模态，我们发现扩大训练种群规模显著提高了独立群体间的ZMI。关键的是，随着种群规模扩大，群体内通信变异增加，防止了同质化共适应。同时，群体间变异减少，表明向某种普遍性的结构收敛。进一步分析揭示，这种普遍性是通过感知接地实现的：扩大后的种群越来越将其涌现素描锚定在目标图像的客观视觉相似性上。这些结果共同将ZMI定位为涌现通信中一个独特的泛化轴，并提出了实现社会可互操作人工智能体的途径。

英文摘要

Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to communicate with strangers from strictly disjoint communities remains relatively unexplored. In this work, we formalize this capability as \textit{zero-shot mutual intelligibility (ZMI)}: successful communication between independently trained populations without prior exposure. Leveraging emergent sketching -- in which agents communicate through sets of drawn strokes -- as a visually grounded modality, we find that scaling the training population substantially improves ZMI across independent groups. Crucially, as we scale the population size, in-group communicative variation increases, preventing co-adaptation into homogeneity. Simultaneously, cross-group variation decreases, indicating a structural convergence toward a certain type of universality. Further analysis reveals that this universality is achieved through perceptual grounding: scaled populations increasingly anchor their emergent sketches on the objective visual resemblance of the target images. Together, these results position ZMI as a distinct axis of generalization in emergent communication and suggest a route toward socially interoperable artificial agents.

URL PDF HTML ☆

赞 0 踩 0

2606.10581 2026-06-10 cs.CL cs.SD eess.AS 新提交

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge: 弥合语音语言模型中的副语言感知与对话行为

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tencent Hunyuan（腾讯混元）； Shenzhen Loop Area Institute（深圳循环区域研究所）； Amphion Technology Co., Ltd.（Amphion科技有限公司）； Tsinghua University（清华大学）

AI总结提出ParaBridge，一种在线自我蒸馏方法，将推理阶段的副语言指令支架转化为稳定的模型行为，无需人工标注或外部奖励，显著提升语音语言模型对副语言线索的响应能力。

详情

AI中文摘要

语音携带的信息远不止文字：孩子的声音、恐惧的语气或嘈杂的背景都应引导一个足够胜任的语音对话助手给出不同的回复。当前的语音语言模型（SLM）能够识别此类副语言线索，但在开放域对话中常常忽略它们。我们观察到，在推理阶段使用简单的副语言指令支架可以缩小这种感知-行为差距，表明相关线索已潜在于模型中。然而，这种支架在多轮上下文和竞争指令下仍然脆弱。因此，我们提出\textbf{ParaBridge}，一种在线自我蒸馏方法，将脆弱的推理时支架转化为稳定的模型行为。在训练过程中，支架仅作为临时的特权视图；无支架模型自行生成回复，而支架视图沿其轨迹提供密集的全词汇下一词目标。这种监督教会了模型在非词汇线索应影响回复时的时机，无需策划的对话、人工标签或外部奖励模型。在Qwen3-Omni-thinking上，ParaBridge将无支架的VoxSafeBench SAR从14.6\%提升至40.3\%，并将EchoMind平均评分从3.27提升至3.92。它还保留了通用能力，MMAU-Pro、VoiceBench和GPQA均与原始模型相差在0.4分以内。在训练分布之外，ParaBridge泛化到未见过的副语言线索，从面向安全的训练迁移到共情导向的对话，并在不同的SLM骨干上有效。

英文摘要

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

URL PDF HTML ☆

赞 0 踩 0

2606.10580 2026-06-10 cs.LG cs.AI 新提交

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

蒙特卡洛乐观策略迭代的收敛性：超越均匀状态-动作更新

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结本文证明，在每状态动作均匀更新的条件下，首次访问蒙特卡洛乐观策略迭代收敛到最优，放宽了传统均匀状态-动作更新的要求，并通过均场动力学和锁定论证方法给出证明。

详情

AI中文摘要

蒙特卡洛乐观策略迭代（MC-O-PI）的渐近行为是一个长期悬而未决的问题。当环境模型未知时（这在实践中很常见），唯一已知的保证收敛到最优性的条件是不切实际的。在其标准形式中，该条件要求用于策略评估的回合在整个状态-动作空间上均匀初始化。本文严格放宽了这一要求。具体来说，我们证明即使更新仅在每个状态内的动作上均匀，首次访问MC-O-PI也能收敛到最优性。这允许回合以任意频率从不同状态开始；当状态空间很大或未知但每个状态中的动作空间可管理时，这是一种现实的实现。证明脱离了Tsitsiklis的经典分析，其中心交换性论证在状态以不同频率更新时不再适用。相反，我们首先证明当更新在每个状态的动作上均匀时，MC-O-PI的均场动力学生成单调改进的策略，然后通过扩展组合稳定性-ODE方法的锁定论证，证明噪声不能持续阻止这种改进。这种方法为一般研究乐观策略迭代算法提供了一种新途径。

英文摘要

The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canonical form, this condition requires that the episodes used for policy evaluation be initialised uniformly over the entire state-action space. This paper strictly relaxes that requirement. Specifically, we prove that initial-visit MC-O-PI converges to optimality even when updates are uniform only over the actions within each state. This allows episodes to start in different states at arbitrary frequencies; a realistic implementation when the state space is large or unknown but the action space in each state is manageable. The proof departs from the classical analysis of Tsitsiklis whose central commutativity argument no longer applies when states are updated at different frequencies. Instead, we first show that the mean-field dynamics of MC-O-PI generate monotonically improving policies when updates are uniform over the actions in each state, and then prove that noise cannot consistently prevent this improvement by extending the lock-in argument of the combined stability-ODE method. This approach suggests a new way to study optimistic policy-iteration algorithms in general.

URL PDF HTML ☆

赞 0 踩 0

2606.10579 2026-06-10 cs.RO cs.SY eess.SY 新提交

LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies

LieIPM：用于刚体直接轨迹优化的李群内点法

Sangli Teng, Ruiqi Zhang, Tzu-Yuan Lin, William A Clark, Mark Mueller, Ram Vasudevan, Maani Ghaffari, Koushil Sreenath

发表机构 * University of California, Berkeley（加州大学伯克利分校）； MIT（麻省理工学院）； Ohio University（俄亥俄大学）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）

AI总结提出一种基于李群结构的约束轨迹优化框架LieIPM，利用二阶刚体模型和变分积分器，实现无奇异、快速收敛的牛顿型更新。

详情

AI中文摘要

设计刚体的动态可行轨迹是机器人学中的一个基本问题。虽然直接方法被广泛使用，但现有的约束优化器通常在欧几里得空间中运行，忽略了刚体运动的流形结构。这种不匹配可能引入奇异性或导致优化问题病态。为了弥补这一差距，我们开发了一个结构感知框架，直接在矩阵李群上进行约束轨迹优化。我们的方法基于利用李群结构的二阶刚体模型，这使得在保持底层几何结构的同时实现高效的牛顿型更新成为可能。在此模型基础上，我们提出了一种线搜索李群内点法（LieIPM）来处理流形上的约束。我们使用李群变分积分器实例化该框架用于刚体运动规划，并推导出利用群对称性的闭式内蕴导数。LieIPM通过构造保留了旋转运动的拓扑结构，避免了奇异性。数值结果表明，与通用求解器和结构利用最优控制方法相比，该方法具有更强的鲁棒性和更快的收敛速度。

英文摘要

Designing dynamically feasible trajectories for rigid bodies is a fundamental problem in robotics. While direct methods are widely used, the existing constrained optimizers typically operate in Euclidean space and ignore the manifold structure of rigid body motions. This mismatch may introduce singularities or lead to poorly conditioned optimization problems. To bridge this gap, we develop a structure-aware framework for constrained trajectory optimization directly on matrix Lie groups. Our approach is based on the second-order rigid body models utilizing Lie group structures, which enables efficient Newton-type updates while preserving the underlying geometry. Building on this model, we propose a line-search Lie Group Interior Point Method (LieIPM) to handle constraints on the manifolds. We instantiate the framework for rigid body motion planning using Lie group variational integrators and derive closed-form intrinsic derivatives that exploit group symmetries. The LieIPM preserves the topology of rotation motions by construction and avoids singularities. Numerical results demonstrate superior robustness and faster convergence compared to general-purpose solvers and structure-exploiting optimal control methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10577 2026-06-10 cs.RO 新提交

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

AgenticNav：零样本视觉与语言导航作为工具调用框架

Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Technologies Ltd（华为技术有限公司）

AI总结提出AgenticNav，通过将动作、深度和记忆作为可调用工具暴露给VLM，实现零样本连续环境导航，在R2R-CE基准上达到SOTA性能。

详情

AI中文摘要

连续环境中的零样本视觉与语言导航（VLN-CE）最近随着大型视觉语言模型（VLM）的出现而变得可行。然而，现有方法通常依赖学习到的航点预测器来提出可导航动作，这严重限制了模型的动作空间，并且未能有效利用深度输入。此外，记忆通常通过累积包含大量无关上下文的冗长文本或视觉历史，或通过检索跨回合经验来处理，这削弱了零样本设置。在本文中，我们将零样本VLN-CE重新思考为VLM与环境之间的代理接口，并提出了AgenticNav，这是一个轻量级导航框架，将动作、深度和记忆暴露为可调用的工具。动作工具允许VLM直接选择RGB观测中的目标像素，并将其转换为可执行运动，而不是从预测的航点中选择。深度通过按需像素深度工具暴露，使VLM能够在需要的地方请求精确的度量距离。对于记忆，AgenticNav提供了一个紧凑的地图图像，总结历史轨迹，并配有一个召回工具，允许VLM有选择地重新访问过去的视觉观测，而不会使提示上下文过载。在R2R-CE基准上，AgenticNav在相同VLM骨干下，在零样本方法中建立了新的最先进（SOTA）性能。真实世界验证进一步突显了其相比先前方法的零样本泛化能力。消融实验表明，我们的动作工具设计优于传统航点预测器，并且深度工具和代理记忆进一步促进了导航性能。

英文摘要

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

URL PDF HTML ☆

赞 0 踩 0