arXivDaily arXiv每日学术速递 周一至周五更新
arXiv周末暂无论文更新,休息一下吧,周末愉快~~
全部学科分类 2132
2606.27377 2026-06-26 cs.CV cs.CL cs.LG 新提交

DanceOPD: On-Policy Generative Field Distillation

DanceOPD:在策略生成场蒸馏

Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong, Lixue Gong, Yongyuan Liang, Meng Chu, Leigang Qu, Lingdong Kong, Wei Liu, Tat-Seng Chua

发表机构 * ByteDance Seed(字节跳动Seed) NUS(新加坡国立大学) UMD(马里兰大学) HKUST(香港科技大学)

AI总结 提出DanceOPD框架,通过将每个样本路由到能力场并查询低噪声学生诱导状态,用速度MSE损失训练流匹配模型,实现文本到图像、局部编辑和全局编辑的多能力组合,提升目标能力同时保持生成质量。

Comments Technical Report; 39 pages, 13 figures, 9 tables; Project Page at https://danceopd.github.io/

详情
AI中文摘要

现代图像生成需要一个统一多种能力的单一模型,包括文本到图像(T2I)、局部编辑和全局编辑。然而,这些能力很少自然对齐且经常冲突。例如,编辑往往会降低T2I性能,而全局和局部编辑相互干扰。因此,有效组合这些能力已成为图像生成模型训练的核心挑战。为了解决这个问题,我们引入了DanceOPD,一种用于流匹配模型的在策略生成场蒸馏框架,它将每个样本路由到一个能力场,查询一个低噪声学生诱导状态,并使用简单的速度MSE目标进行训练。每个能力源被定义为共享流状态空间上的速度场,学生从其自身展开状态上查询的场中学习以组合专家能力。该公式还吸收了算子定义的场,如无分类器引导。在T2I、编辑、真实感场吸收和CFG吸收上的全面实验表明,我们的方法改进了多能力组合,在保持锚定生成质量的同时增强了目标能力。我们相信这项工作为流匹配模型中的生成场蒸馏建立了一条实用途径。

英文摘要

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.

2606.27374 2026-06-26 cs.RO cs.CV 新提交

World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays

世界行动模型通过循环生成重放实现持续模仿学习

Manish Kumar Govind, Dominick Reilly, Smit Patel, Hieu Le, Srijan Das

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 提出循环生成重放(REGEN)框架,利用世界行动模型(WAM)生成伪重放轨迹,使机器人策略在不存储原始演示的情况下复习先前任务,在仿真和真实实验中减少50%灾难性遗忘,接近真实重放性能。

详情
AI中文摘要

超越预测机器人动作,世界行动模型(WAMs)还能生成未来的视觉观测。我们基于这种生成能力提出了循环生成重放(REGEN),一种持续模仿学习框架,它合成伪重放轨迹,使机器人策略能够复习先前学习的任务,而无需存储其原始人类演示。在持续适应过程中,REGEN递归地查询WAM,仅根据先前的任务指令和当前任务的观测来合成伪重放轨迹。在仿真和真实世界操作环境中的实验表明,相对于顺序微调,REGEN将灾难性遗忘减少了高达50%,同时接近需要访问真实重放数据的特权经验重放方法的性能。最后,我们分析了限制生成重放的因素,将长时程视觉退化与动作-观测不一致确定为主要的瓶颈。我们的结果确立了WAMs作为无需存储演示的持续机器人学习的有前景的基础。

英文摘要

Going beyond predicting robot actions, World Action Models (WAMs) can also generate future visual observations. We build on this generative capability to propose Recurrent Generative Replay (REGEN), a continual imitation learning framework that synthesizes pseudo-replay trajectories, enabling a robot policy to rehearse previously learned tasks without storing their original human demonstrations. During continual adaptation, REGEN recursively queries the WAM to synthesize pseudo-replay trajectories conditioned only on prior task instructions and current-task observations. Experiments in both simulation and real-world manipulation settings show that REGEN reduces catastrophic forgetting by up to $50\%$ relative to sequential fine-tuning, while approaching the performance of privileged experience replay methods that require access to real replay data. Finally, we analyze the factors limiting generated replay, identifying long-horizon visual degradation and action-observation inconsistency as the primary bottlenecks. Our results establish WAMs as a promising foundation for continual robot learning without stored demonstrations.

2606.27373 2026-06-26 cs.CV 新提交

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

在自进化大型多模态模型中更加关注视觉标记

Shravan Venkatraman, Ritesh Thawkar, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Salman Khan, Fahad Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Aalto University(阿尔托大学) Australian National University(澳大利亚国立大学) Linköping University(林雪平大学)

AI总结 提出VISE框架,通过几何不变性和语义不变性奖励直接正则化视觉条件策略,解决自进化LMMs中视觉欠条件问题,在无监督设置下提升视觉语言理解。

Comments ECCV 2026

详情
AI中文摘要

近期,自进化大型多模态模型(LMMs)因在纯无监督设置下提升视觉推理能力而受到关注。然而,现有自进化LMMs中的多角色自博弈和自一致性奖励方案优化了答案一致性,却未确保解码器关注视觉内容,而是依赖统计语言先验产生自一致输出。这导致了一种我们称之为视觉欠条件的持续失败模式,即解码器在生成过程中依赖语言先验而非图像,表现为对视觉标记的关注不足。因此,当前自进化LMMs在图像描述和视觉问答等视觉-语言理解任务上表现不佳。为解决此问题,我们提出VISE(视觉不变性自进化),一种纯无监督的自进化框架,通过两种互补的基于不变性的奖励直接正则化模型的视觉条件策略:几何不变性奖励在已知变换下强制空间一致性,语义不变性奖励通过要求模型在预测区域被扰动时识别证据缺失来惩罚无证据生成。VISE在单一模型内运行,无需专家角色、外部奖励模型或标注,并在原始无标签图像上训练。在18个基准上的实验证明了我们方法的有效性。以Qwen3-VL-2B为基础模型,VISE在COCO上获得+16.85 CIDEr,在TextCaps上获得+19.66 CIDEr,将物体幻觉降低5.0 Chair-I点,并在四个模型系列和规模上泛化。我们的代码和模型见此https URL。

英文摘要

Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at https://mbzuai-oryx.github.io/VISE

2606.27372 2026-06-26 cs.CV 新提交

DnA: Denoising Attention for Visual Tasks

DnA:用于视觉任务的去噪注意力

Ron Campos, Subhajit Maity, Xin Li, Srijan Das, Aritra Dutta

发表机构 * University of Central Florida(中佛罗里达大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 提出去噪注意力(DnA),通过正负查询分离相关与无关特征,提升多头部注意力的判别性,在ImageNet-1K上提升0.8%,并适用于视频理解任务。

详情
AI中文摘要

多头注意力(MHA)中的softmax激活是视觉感知任务中基于注意力的模型的事实标准。然而,标准softmax可能产生噪声注意力模式,稀释相关特征并降低性能。在本文中,我们提出去噪注意力(DnA),其中,首先,一个正查询识别哪些图像特征属于正确类别,一个负查询识别紧密相关但不相关的图像特征。然后,DnA将这些交互投影到两个具有更大主角的不同子空间中,促进子空间分离并提高可判别性。使用ViT-B骨干网络,我们提出的DnA在ImageNet-1K上相比基线实现了0.8%的绝对增益。我们进一步展示了在多个视觉理解任务上的改进,包括使用视频变换器的视频理解(1.8%)和视频大语言模型(0.5%)。我们广泛的实证分析证明了涉及两个交互子空间的设计选择以及DnA的去噪效果。

英文摘要

The softmax activation in multihead attention (MHA) is the de facto standard for attention-based models in visual perception tasks. However, standard softmax can produce noisy attention patterns that dilute relevant features and degrade its performance. In this paper, we propose Denoising Attention or DnA, in which, first, a positive query identifies which image features belong to the correct class, and a negative query identifies closely associated but irrelevant image features. DnA then projects these interactions into two distinct subspaces with larger principal angles, promoting subspace separation and improved discriminability. Using a ViT-B backbone, our proposed DnA achieves an absolute gain of 0.8% on ImageNet-1K compared to the baseline. We further show improvements across multiple visual understanding tasks, including video understanding with video transformers (1.8%) and video LLMs (0.5%). Our extensive empirical analyses justify the design choices involving two interacting subspaces and the denoising effect of DnA.

2606.27371 2026-06-26 cs.CV 新提交

Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance

不要停留在众数!通过特征自引导缓解预训练流模型中的多样性坍塌

Pradhaan S Bhat, Rishubh Parihar, Abhijnya Bhat, R. Venkatesh Babu

发表机构 * Indian Institute of Science(印度科学研究所) Stanford University(斯坦福大学)

AI总结 提出一种无需训练的自引导机制,通过分散批次生成中的内部特征并施加流形正则化,在几乎不增加推理成本的前提下缓解预训练流模型的多样性坍塌问题。

Comments Accepted by ECCV 2026. Project page: https://dont-settle-at-the-mode.github.io/

详情
AI中文摘要

最先进的流模型能够从文本或图像提示生成令人惊叹的图像。然而,在相同条件下生成多个样本时,它们会出现多样性坍塌。现有方法通过潜在引导(效果有限)或样本选择(依赖外部奖励模型,推理开销大)来解决这一问题。在这项工作中,我们引入了一种高效、无需训练的自引导机制,无需额外奖励模型即可缓解多样性坍塌。具体来说,我们在批次生成过程中通过特征自引导分散流模型的内部特征。此外,为了保持特征接近流形,我们引入了一个流形正则化步骤,将这些分散的特征投影回数据流形,确保在不牺牲与输入条件对齐的情况下生成多样化的结果。我们的方法作为即插即用模块无缝集成到预训练流模型中,仅增加极小的推理成本。实验表明,在多个条件流模型(包括多步和少步文本到图像、深度到图像以及参考图像生成)中,该方法在保持保真度的同时显著提高了多样性。

英文摘要

State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning. Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead. In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions. Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.

2606.27369 2026-06-26 cs.LG 新提交

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

无需真实解即可改进大语言模型的强化学习

Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, Yuxiong He

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Snowflake AI Research

AI总结 提出RiVER框架,通过校准的连续奖励信号,在无真实解的任务上训练LLM,在AtCoder和精确解基准上均取得提升。

详情
AI中文摘要

使用可验证奖励的强化学习(RLVR)训练大语言模型通常依赖真实答案来分配奖励,限制了其在真实解未知的任务中的适用性。我们引入了一个基于排序的可验证框架(RiVER),该框架在没有真实解的情况下,使用确定性执行反馈作为连续值监督,训练LLM处理基于分数的优化任务。当将群体相对RL应用于此类连续奖励时,我们识别出两个关键挑战:\emph{规模主导},即跨测试实例未校准的分数幅度扭曲策略更新;以及\emph{频率主导},即重复采样的次优解可能压倒罕见但更强的候选解。RiVER通过校准奖励塑造来解决这些挑战,该方法使用实例级比较并强调排名靠前的求解器,同时为其他有效解保留有界反馈。我们在12个AtCoder启发式竞赛任务上训练,并在算法工程基准(ALE-Bench)、LiveCodeBench和USACO上评估。RiVER在ALE评分排名上将Qwen3-8B和GLM-Z1-9B-0414分别提升了8.9%和9.4%。更重要的是,尽管仅在基于分数的任务上训练且没有任何真实解,RiVER在精确解基准(如LiveCodeBench和USACO)上也使骨干模型平均绝对提升了2.4%和3.5%。相比之下,使用原始执行分数训练的基线模型虽然提升了ALE评分,但未能迁移到精确解基准。这些结果表明,结合适当的奖励校准,基于分数的优化任务可以作为无需真实解的有效训练环境,提升通用编码能力。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.

2606.27353 2026-06-26 cs.RO 新提交

Continual Robot Policy Learning via Variational Neural Dynamics

通过变分神经动力学的持续机器人策略学习

Jiaxu Xing, Zhiyuan Zhu, Yunfan Ren, Ismail Geles, Yifan Zhai, Rudolf Reiter, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人感知组)

AI总结 提出持续学习框架,结合物理先验与神经残差建模隐藏动态,通过可微仿真和在线条件推断提升策略在未知干扰下的性能。

详情
AI中文摘要

部署在现实世界中的机器人很少在单一固定动力学模型下运行:风向变化、负载变化、电池耗尽、接触变化以及硬件磨损。然而,大多数基于学习的控制器仅训练一次并部署,仿佛学习已经完成。这阻止了机器人利用部署经验进一步提高任务性能。在这项工作中,我们提出了一个持续学习框架,利用现实世界经验在隐藏和重复出现的动力学下改进机器人策略。我们的方法通过将分析物理先验与用于未建模效应的神经残差相结合,从真实状态-动作轨迹中学习条件感知的动力学模型。一个循环编码器从最近的交互中推断当前的隐藏条件,该估计同时调节残差模型和策略。策略学习通过可微仿真进行,使用从潜在模型中采样的多样化学习动力学。在部署时,这些采样条件被从最近的真实交互中在线推断的条件所替代,使得策略能够通过识别而非残差重新拟合来恢复重复出现的动力学。通过广泛的仿真研究和真实世界实验,我们证明了该框架在多种未观测干扰下提高了策略性能。在变化的强风中进行的真实四旋翼轨迹跟踪实验中,策略在大约1秒内从重复出现的干扰中恢复,比在线残差重新拟合快约5倍。与最先进的在线自适应方法相比,它还将大干扰悬停和跟踪误差分别减少了65.7%和53.3%。

英文摘要

Robots deployed in the real world rarely operate under a single fixed dynamics model: wind changes, payloads vary, batteries drain, contacts shift, and hardware wears. Yet most learning-based controllers are trained once and deployed as if learning were complete. This prevents the robot from using deployment experience to further improve task performance. In this work, we propose a continual learning framework that uses real-world experience to improve robot policies under hidden and recurring dynamics. Our method learns a condition-aware dynamics model from real state-action trajectories by combining an analytical physics prior with a neural residual for unmodeled effects. A recurrent encoder infers the current hidden condition from recent interaction, and this estimate conditions both the residual model and the policy. Policy learning is performed via differentiable simulation using diverse learned dynamics sampled from the latent model. At deployment, these sampled conditions are replaced by conditions inferred online from recent real interaction, allowing the policy to recover recurring dynamics by recognition rather than residual re-fitting. Through extensive simulation studies and real-world experiments, we demonstrate that the framework improves policy performance under diverse unobserved disturbances. On real quadrotor trajectory tracking under changing wind, the policy recovers from recurring disturbances in roughly 1s, about 5x faster than online residual re-fitting. It also reduces large-disturbance hover and tracking errors by 65.7% and 53.3% over the state-of-the-art online adaptation approaches

2606.27348 2026-06-26 cs.RO 新提交

Bridging Performance and Generalization in Reinforcement Learning for Agile Flight

强化学习在敏捷飞行中连接性能与泛化能力

Jonathan Green, Jiaxu Xing, Nico Messikommer, Angel Romero, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人感知组)

AI总结 提出基于学习进度感知切换和物理信息赛道生成器的框架,实现无人机竞速零样本泛化,速度与泛化性能均优于现有方法。

详情
AI中文摘要

自主无人机竞速是自主空中机器人的一个根本性挑战领域,需要在持续执行器饱和下实现时间最优控制。虽然强化学习(RL)在该领域已达到人类水平的表现,但当前方法无法泛化;在特定环境上训练的策略在未见配置中常常立即崩溃。这种失败反映了敏捷飞行中零样本泛化的固有困难,源于高维任务变化以及高速下安全与性能之间的紧密耦合。现有的改进泛化的方法会显著牺牲飞行速度:控制策略必须大幅降低性能才能实现甚至适度的泛化。在这项工作中,我们提出了一个基于强化学习的无人机竞速敏捷飞行零样本泛化框架。通过结合基于学习进度的任务感知切换与物理信息程序化赛道生成器,该框架产生了一个快速且鲁棒的通用策略,无需测试时适应。我们的方法在现实世界中各种未见赛道上实现了强大的零样本性能,与最先进方法相比,泛化能力提升了7.4倍,同时保持了有竞争力的竞速速度。我们在仿真和现实环境中验证了方法的结果,包括一个具有挑战性的基于视觉的端到端控制设置,该设置无需显式状态估计,而所有先前方法在该设置下均无法泛化。

英文摘要

Autonomous drone racing is a fundamentally challenging regime for autonomous aerial robots, requiring time-optimal control while operating under persistent actuation saturation. While reinforcement learning (RL) has achieved human-level performance in this domain, current methods fail to generalize; policies trained on specific environments often crash immediately in unseen configurations. This failure reflects the intrinsic difficulty of zero-shot generalization in agile flight, arising from high-dimensional task variation and the tight coupling between safety and performance at high speeds. Existing approaches that improve generalization impose a substantial cost on flight speed: control policies must significantly degrade performance to achieve even modest levels of generalization. In this work, we propose a framework for zero-shot generalization in agile flight for RL-based drone racing. By combining task-aware switching based on learning progress with a physically informed procedural track generator, the framework produces a fast and robust generalist policy without test-time adaptation. Our method achieves strong zero-shot performance across a wide range of unseen racetracks in the real world, demonstrating a 7.4x improvement in generalization over the state-of-the-art approaches, while maintaining competitive racing speeds. We validate our method's results in both simulation and real-world settings, including a challenging vision-based, end-to-end control setting that operates without explicit state estimation, where all prior approaches fail to generalize.

2606.27347 2026-06-26 cs.CL 新提交

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

用多语言联合实体关系抽取管道绘制欧洲政治精英网络

Kirill Solovev, Jana Lasser

发表机构 * IDea_Lab, University of Graz(格拉茨大学IDea_Lab)

AI总结 提出模块化、全开放权重的多语言联合实体关系抽取管道,从大规模非结构化新闻语料中构建带符号的时间知识图谱,通过跨度命名实体识别、三阶段链接级联和约束混合专家模型实现高文本正确性,并在奥地利和波兰案例中验证其有效性。

Comments 34 pages, 17 figures

详情
AI中文摘要

政治精英是否组织成寻租联盟以获取公共资源,或形成维持治理的公民网络,是比较政治学中的一个核心问题。然而,大规模观察这些复杂、非正式且对抗性的关系历来需要密集的人工编码,而自动化的文本即数据方法大多局限于简单的共现。最近的大语言模型方法提供了一条前进之路,但通常依赖专有API,缺乏跨语言能力,并且在可扩展的实体消解方面存在困难。我们提出了一个模块化、全开放权重的多语言联合实体关系抽取管道,从大规模非结构化新闻语料中构建带符号的时间知识图谱。它结合了基于跨度的命名实体识别与三阶段链接级联,将提及映射到语言无关的维基数据标识符;然后,一个高通量、本体约束的混合专家模型使用引导解码来提取基于领域本体的有向、带符号关系。针对3491个关系的黄金标准进行的全覆盖抽查显示了高文本正确性(严格68.2%到宽松93.7%)。两个大规模案例研究根据公开记录验证了该管道。在奥地利,它重建了一个政党的完整生命周期,确定了内部裂痕,并追踪人员进入后继派系和法院定罪。在波兰语语料库中,它揭示了国有企业庇护的重叠经济与治理网络,以及两极分化的公民纲领党(Platforma Obywatelska, PO)与法律与公正党(Prawo i Sprawiedliwość, PiS)双头垄断的结构平衡、带符号的冲突网络。通过连接原始多语言文本和结构化关系数据,我们的框架为跨国实证计算社会科学提供了稳健、可重复的基础。

英文摘要

Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and adversarial ties at scale has historically required intensive manual coding, while automated text-as-data methods have largely been limited to simple co-occurrence. Recent large language model (LLM) approaches offer a path forward but often rely on proprietary APIs, lack cross-lingual capability, and struggle with scalable entity resolution. We present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. It combines span-based named-entity recognition (NER) with a three-stage linking cascade mapping mentions to language-independent Wikidata identifiers; a high-throughput, ontology-constrained mixture-of-experts model then uses guided decoding to extract directed, signed relationships grounded in a domain ontology. A full-coverage spot-check against a 3491-relation gold standard shows high textual correctness (68.2% strict to 93.7% lenient). Two large-scale case studies validate the pipeline against the public record. In Austria, it reconstructs a political party's complete lifecycle, dating internal fractures and tracking personnel into successor factions and court convictions. In a Polish corpus, it uncovers the overlapping economic and governance networks of state-enterprise patronage, alongside the structurally balanced, signed conflict network of the polarized Civic Platform (Platforma Obywatelska, PO)--Law and Justice (Prawo i Sprawiedliwość, PiS) duopoly. By bridging raw multilingual text and structured relational data, our framework provides a robust, replicable foundation for cross-national empirical computational social science.

2606.27345 2026-06-26 cs.CV 新提交

RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

RayPE: 用于3D感知视频生成的射线空间位置编码

Minghao Yin, Jiahao Lu, Wenbo Hu, Wang Zhao, Shan Ying, Kai Han

发表机构 * The University of Hong Kong(香港大学) The Hong Kong University of Science and Technology(香港科技大学) ARC Lab, Tencent(腾讯ARC Lab)

AI总结 提出RayPE,通过向自注意力注入6D Plücker坐标编码射线几何关系,提升视频扩散Transformer的3D一致性和相机可控性。

详情
AI中文摘要

现代视频扩散Transformer通过RoPE在(u,v,t)轴上定位其token——这是对相机采样网格的描述,但未涉及场景的3D结构。我们观察到两条相机射线之间的几何关系由Plücker互积捕捉,该积在两条射线上是双线性的——与Transformer注意力中的点积具有相同的代数形式。基于这一类比,我们提出RayPE,一种位置编码扩展,将每个token的6D Plücker坐标加性地注入自注意力的查询和键中,并采用查询/键翻转排列,使得对称恒等配置恰好与互积一致。注入是加性的,得到的注意力分数分解为内容项、几何项以及内容和几何的交叉项——我们的实验发现这些项各自都是必要的。为了使编码在具有异构相机平移尺度(SfM、深度SLAM、度量)的视频数据上保持稳定,我们进一步将射线方向与矩幅度解耦,通过一个关于对数幅度的学习函数来门控编码,并应用RMSNorm使其与QKNorm归一化的内容分支对齐。完整模块向预训练的视频DiT添加的参数少于0.1%,以零初始化从预训练权重开始,并在四数据集训练混合上改善了相机可控性、跨帧3D一致性和整体视频质量。

英文摘要

Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.

2606.27344 2026-06-26 cs.RO 新提交

VibeAct: Vibration to Actions for Contact-Rich Reactive Robot Dexterity

VibeAct: 振动到动作——用于接触丰富反应性机器人灵巧性的方法

Yuemin Mao, Uksang Yoo, Jean Oh, Jonathan Francis, Jeffrey Ichnowski

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Bosch Center for Artificial Intelligence(博世人工智能中心)

AI总结 提出VibeAct框架,通过共享接触和滑移的物理表示,桥接真实振动触觉感知与基于模拟的强化学习,在灵巧手操作任务中优于基线方法,并成功迁移到真实平台。

详情
AI中文摘要

灵巧操作依赖于快速、局部且常被视觉遮挡的接触事件。压电麦克风提供了一种紧凑且高带宽的方式来感知这些交互,但由此产生的振动声学信号难以足够逼真地模拟,以用于灵巧机器人手的端到端从模拟到真实的策略学习。我们提出了VibeAct,一个通过共享接触和滑移的物理表示来桥接真实振动触觉感知与基于模拟的强化学习的框架。在真实世界中,我们将压电麦克风嵌入灵巧机器人手,并通过遥操作收集振动声学数据,然后在校准的数字克隆中回放记录,以自动标注每根手指的接触和滑移。触觉估计器学习从真实麦克风波形预测接触和滑移,而操作策略在模拟中基于从模拟接触直接计算的相同表示进行训练。这种解耦使得策略能够利用快速的触觉反馈,而无需模拟原始音频。在五个接触丰富的任务(包括重抓取、手内重定向和插入)中,VibeAct在模拟中始终优于基于本体感觉和点云的基线,其中在需要持续反应控制的任务上提升最大,连续滑移幅度通道被证明是最具信息量的观测。学习到的策略迁移到物理灵巧手-臂平台,提高了部署任务的成功率。项目视频和更多细节见该网址。

英文摘要

Dexterous manipulation depends on contact events that are fast, local, and often visually occluded. Piezoelectric microphones offer a compact and high-bandwidth way to sense these interactions, but the resulting vibro-acoustic signals are difficult to simulate faithfully enough for end-to-end sim-to-real policy learning on dexterous robot hands. We propose VibeAct, a framework that bridges real vibrotactile sensing and simulation-based reinforcement learning through a shared physical representation of contact and slip. In the real world, we embed piezoelectric microphones into a dexterous robot hand and collect vibro-acoustic data through teleoperation, then replay the recordings in a calibrated digital clone to automatically label per-finger contact and slip. A tactile estimator learns to predict contact and slip from real microphone waveforms, while manipulation policies are trained in simulation on the same representation computed directly from simulated contacts. This decoupling lets policies exploit rapid tactile feedback without simulating raw audio. Across five contact-rich tasks spanning regrasping, in-hand reorientation, and insertion, VibeAct consistently outperforms a proprioception-and-point-cloud baseline in simulation, with the largest gains on tasks requiring sustained reactive control, where the continuous slip-magnitude channel proves the most informative observation. The learned policies transfer to a physical dexterous hand-arm platform, improving success rates on deployed tasks. Project videos and additional details are at https://vibeact.github.io/.

2606.27339 2026-06-26 cs.CV 新提交

SAM2Matting: Generalized Image and Video Matting

SAM2Matting:通用图像与视频抠图

Ruiqi Shen, Guangquan Jie, Chang Liu, Henghui Ding

发表机构 * Fudan University(复旦大学) Shanghai University of Finance and Economics(上海财经大学)

AI总结 提出SAM2Matting框架,将VOS追踪器增强为高保真视频抠图,通过区域提议桥接和专用抠图头解耦任务,仅用图像训练即实现视频抠图新SOTA,支持多种提示类型并保持强时间一致性。

Comments ECCV 2026. Extended version. Project Page: https://henghuiding.com/SAM2Matting/

详情
AI中文摘要

尽管图像抠图取得了显著进展,但由于高层追踪(需要逐帧理解)与低层抠图(关注极细粒度细节)之间的固有差距,视频抠图仍然具有挑战性。现有方法使用昂贵且范围狭窄的视频抠图数据集,这可能限制域外泛化并损害追踪鲁棒性。我们通过SAM2Matting重新思考了这一范式,这是一个从追踪器到抠图的框架,将VOS追踪器提升为高保真视频抠图。具体来说,它通过区域提议桥接和专用抠图头增强基础追踪器(例如SAM2、SAM3)来解耦任务,使未妥协的追踪器处理时间一致性,而抠图组件解决细粒度细节。值得注意的是,尽管仅在图像上训练,SAM2Matting在视频抠图上建立了新的最先进性能,支持多种提示类型,保持强时间一致性,并在以人为中心和野外场景中展现出鲁棒的泛化能力。

英文摘要

Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.

2606.27334 2026-06-26 cs.AI 新提交

Language-Based Digital Twins for Elderly Cognitive Assistance

基于语言的数字孪生用于老年人认知辅助

Mohammad Mehdi Hosseini, Mohammad H. Mahoor, Hiroko H. Dodge

发表机构 * Ritchie School of Engineering and Computer Science, University of Denver(丹佛大学里奇工程与计算机科学学院) Department of Neurology, Massachusetts General Hospital, Harvard Medical School(哈佛医学院麻省总医院神经内科)

AI总结 提出基于大语言模型的数字孪生框架,通过风格测量和上下文元数据模拟老年人对话行为,并引入多条件变分自编码器评估保真度和认知一致性,实现非侵入式认知健康监测。

Comments Accepted and published in the Proceedings of the ACM International Conference on PErvasive Technologies Related to Assistive Environments (PETRA 2026). The final published version is available through the ACM Digital Library

详情
AI中文摘要

数字孪生已成为个性化医疗保健的一种有前景的范式,能够对个体行为和健康轨迹进行建模。在认知健康领域,轻度认知障碍(MCI)的早期检测仍然具有挑战性,而语言和对话模式可作为非侵入性生物标志物。在这项工作中,我们提出了一个基于语言的数字孪生框架,利用大型语言模型(LLMs)通过融入风格测量线索和上下文元数据来模拟老年人的对话行为。为了评估保真度和认知一致性,我们引入了一个多头条件变分自编码器(cVAE),该编码器联合测量重建质量并预测认知评分。在I-CONECT数据集上的实验表明,数字孪生保留了身份特定特征,并实现了与真实数据相当的重建和MoCA预测误差,同时优于基线GPT生成的响应。这些结果凸显了基于语言的数字孪生作为一种可扩展且非侵入性的方法,在个性化、持续的认知健康监测方面的潜力。

英文摘要

Digital twins have emerged as a promising paradigm for personalized healthcare, enabling modeling of individual behavior and health trajectories. In cognitive health, early detection of Mild Cognitive Impairment (MCI) remains challenging, where language and conversational patterns serve as non-invasive biomarkers. In this work, we propose a language-based digital twin framework that leverages large language models (LLMs) to mimic the conversational behavior of elderly individuals by incorporating stylometric cues and contextual metadata. To evaluate fidelity and cognitive consistency, we introduce a multi-head conditional variational autoencoder (cVAE) that jointly measures reconstruction quality and predicts cognitive scores. Experiments on the I-CONECT dataset show that the digital twin preserves identity-specific characteristics and achieves reconstruction and MoCA prediction errors comparable to real data, while outperforming baseline GPT-generated responses. These results highlight the potential of language-based digital twins as a scalable and non-invasive approach for personalized and continuous cognitive health monitoring.

2606.27332 2026-06-26 cs.CV 新提交

RoPEMover: Depth-Aware Object Relocation via Positional Embeddings

RoPEMover: 通过位置嵌入实现深度感知的物体重定位

Ipek Oztas, Duygu Ceylan, Aybars Bugra Aksoy, Aysegul Dundar

发表机构 * Bilkent University(比尔肯大学) Brown University(布朗大学) Adobe Research(Adobe研究院)

AI总结 提出一种基于扩散Transformer位置表示(RoPE)的几何感知物体运动方法,通过深度感知的2D RoPE扩展实现3D空间位移,在合成数据与少量真实图像训练下,实现物体身份保持、遮挡区域生成及阴影光照一致更新。

详情
AI中文摘要

在单张图像中移动物体需要几何一致的空间重排,包括处理遮挡、揭示先前未见的区域,以及保持连贯的阴影和反射。现有方法不适合这种场景,且常常无法保持这种场景级一致性。我们通过引入一种几何感知的物体运动方法来解决这个问题,该方法直接操作扩散Transformer的位置表示。我们的关键见解是,旋转位置嵌入(RoPE)定义了一个结构化的空间场,可以通过显式操作来诱导受控运动。我们将2D RoPE扩展为深度感知的形式,编码3D空间结构,从而实现一致的物体位移和场景感知更新。我们的模型使用合成数据结合少量真实图像,通过参数高效微调进行训练。尽管真实监督极少,但模型在大空间位移下能保持物体身份,在新揭示区域生成合理内容,并一致更新场景依赖的效果,如阴影和光照。在标准物体运动基准上的实验结果表明,在所有评估指标上均达到最先进性能。

英文摘要

Moving an object in a single image requires geometry-consistent spatial rearrangement, including handling occlusions, revealing previously unseen regions, and maintaining coherent shadows and reflections. Existing approaches are not well suited to this setting and often fail to preserve such scene-level consistency. We address this problem by introducing a geometry-aware object motion method that operates directly on the positional representations of diffusion transformers. Our key insight is that rotary positional embeddings (RoPE) define a structured spatial field that can be explicitly manipulated to induce controlled motion. We extend 2D RoPE into a depth-aware formulation that encodes 3D spatial structure, enabling consistent object displacement and scene-aware updates. Our model is trained using synthetic data combined with a small set of real images via parameter-efficient fine-tuning. Despite minimal real supervision, it preserves object identity under large spatial displacements, generates plausible content in newly revealed regions, and consistently updates scene-dependent effects such as shadows and illumination. Experimental results on standard object motion benchmarks demonstrate state-of-the-art performance across all evaluation metrics.

2606.27330 2026-06-26 cs.CL cs.AI cs.CV cs.LG 新提交

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

通过自主经验探索与事后经验利用赋能GUI智能体任务规划

Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出PEEU方法,通过自主探索环境发现经验并利用事后经验合成严格对齐的高层训练数据,提升小型多模态大语言模型在GUI任务中的规划与跨网站泛化能力。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

多模态网络智能体可以协助人类操作重复的GUI任务,其中有效的任务规划对于将复杂任务分解为可执行动作至关重要。虽然小型开源MLLM相比商业大模型具有成本效益和隐私保护优势,但它们存在规划能力弱和跨网站泛化有限的问题。为解决这些限制,我们引入了规划经验探索与利用(PEEU)方法,该方法自主探索环境以发现经验,并利用事后经验合成严格对齐的高层训练数据。为了定量分析驱动这种性能的泛化行为,我们提出了任务分解层次分析框架(TDHAF),系统研究跨三个任务粒度(低、中、高层)的组合泛化。我们的分析表明,掌握低层原子技能并不能保证高层规划能力,而高层任务训练能产生更强的OOD泛化。在真实世界基准上的实验证明了PEEU的卓越有效性:我们的7B模型达到30.6%的准确率,优于更大的Qwen2.5-VL-32B模型。这表明构建事后高层任务并利用经验对于小型MLLM的OOD规划能力至关重要。

英文摘要

Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.

2606.27326 2026-06-26 cs.LG cs.CV cs.RO 新提交

Hallucination in World Models is Predictable and Preventable

世界模型中的幻觉是可预测且可预防的

Nicklas Hansen, Xiaolong Wang

发表机构 * UC San Diego(加州大学圣迭戈分校)

AI总结 本文发现世界模型中的幻觉源于状态-动作空间的低覆盖区域,提出三种可预测幻觉的信号,并开发覆盖感知采样和好奇心奖励方法,仅需50条真实轨迹即可微调模型适应新环境。

Comments Interactive paper, live demo, code, dataset, and models: https://www.nicklashansen.com/mmbench2

详情
AI中文摘要

现代生成式世界模型能够渲染越来越逼真的、动作可控的未来场景,但它们经常产生幻觉:生成的轨迹在视觉上流畅,但偏离了真实动力学。我们假设幻觉集中在状态-动作空间的低覆盖区域,而轻量级的数据中心信号既可以检测幻觉,也可以指导缓解。为了验证这一点,我们引入了MMBench2,一个包含427小时、210个任务的视觉世界建模数据集,具有真实动作、奖励和实时模拟器,并在其上训练了一个350M参数的世界模型。我们识别出三种不同的幻觉模式:感知幻觉、动作边缘化幻觉和场景发散幻觉——每种模式对应流程的不同阶段,并开发了三种能够准确预测模型失败位置的信号。为了在训练时缩小覆盖差距,我们开发了一种覆盖感知采样技术;为了在线缩小差距,我们的幻觉预测器作为好奇心奖励用于目标数据收集,从而产生一种数据高效的微调方案,该方案仅需50条真实环境轨迹即可将预训练的世界模型适应于完全未见过的环境。总体而言,我们的发现表明,世界模型中的幻觉本质上是一个数据覆盖问题,并且用于检测幻觉的相同信号也可用于缓解幻觉。我们论文的交互式网页版可在此处获取:此 https URL

英文摘要

Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect it and guide mitigation. To test this, we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it. We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging -- each anchored to a different stage of the pipeline, and develop three signals that accurately predict where the model will fail. To close coverage gaps at training time, we develop a coverage-aware sampling technique; to close them online, our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories. Overall, our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation. An interactive web version of our paper is available at https://www.nicklashansen.com/mmbench2

2606.27325 2026-06-26 cs.CV 新提交

Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

并非所有动作都同等重要:重新思考灵巧世界模型的条件化

Zizhao Yuan, Zhengtu Liang, Taowen Wang, Qiwei Liang, Yichi Wang, Yunheng Wang, Yuetong Fang, Lusong Li, Zecui Zeng, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shenzhen University(深圳大学) Beijing University of Technology(北京工业大学) JD Explore Academy(京东探索研究院)

AI总结 针对高自由度灵巧动作中不同分量重要性不均导致优化失衡的问题,提出DexAC-WM,通过动作标记化保留维度级语义,结合局部细化与全局调制对齐动作与视觉动态,并引入语义分支提供物体场景先验,显著提升视频预测的真实感和动作一致性。

详情
AI中文摘要

近期基于动作条件的世界模型在建模复杂交互和预测未来状态方面取得了显著进展。尽管这些模型通常由更强的视觉表示和模型容量驱动,但动作条件化本身仍未得到充分探索。现有方法大多将整个动作序列压缩为单一表示,这适用于低自由度控制,但在高自由度场景中可靠性下降。我们观察到高自由度灵巧动作本质上是异质的,跨越多个数量级,大规模运动与细微但重要的信号共存。当统一聚合时,优化在动作分量间表现出不平衡,阻碍了细粒度效应的建模并影响动作保真度。因此,我们提出DexAC-WM,将动作条件化视为结构化过程而非全局压缩。DexAC通过动作标记化保留维度级语义,并通过局部细化与全局调制将动作信号与视觉动态对齐。为解决现有世界模型中高级语义基础有限的问题,我们进一步引入语义分支,提供丰富的物体场景先验,使世界模型能够捕捉动态视觉细节,同时支持高自由度动作条件视频预测。在EgoDex和EgoVerse上的实验表明,将语义分支与DexAC结合显著改善了FID、FVD和PCK,在视觉时间真实性和动作跟随一致性上均有提升。我们进一步验证了DexAC可扩展到其他骨干网络,展示了结构化动作条件化设计的可扩展性。这些结果表明,将世界模型扩展到高自由度控制需要结构化的动作建模和语义基础。

英文摘要

Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.

2606.27320 2026-06-26 cs.SD 新提交

Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding

弹性时间:神经音频编码的动态帧率瓶颈

Dimitrios Bralios, Paris Smaragdis, Minje Kim

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出弹性时间方法,通过轻量级潜在预测器实现动态帧率瓶颈,将固定帧率自编码器转换为动态帧率,在推理时高效选择边界,提升效率-质量权衡。

Comments Interspeech 2026

详情
AI中文摘要

神经音频自编码器已成为压缩、特征提取和生成的核心组件。然而,虽然现有系统支持可变比特率,但绝大多数模型仍以固定的潜在帧率运行,为信息密度差异很大的区域分配相等的时间预算,这可能导致不必要的长序列。我们引入弹性时间,一种动态帧率瓶颈,将固定帧率自编码器转换为动态帧率。我们的方法学习一个轻量级潜在预测器,用于决定哪些帧可以被跳过并在之后重建,从而在推理时实现高效的贪婪边界选择。实验表明,我们的方法在部署时实现速率控制,同时相对于基线改善了效率-质量权衡。总体而言,我们提供了一种灵活的机制来调整音频自编码器的时间分辨率,可能有助于更高效的下游建模,用于生成和长上下文任务。

英文摘要

Neural audio autoencoders have become a core component of compression, feature extraction, and generation. However, while existing systems support variable bitrate, the vast majority of models still operate at a fixed latent frame-rate, allocating equal temporal budget to regions with very different information density, which can result in unnecessarily long sequences. We introduce Elastic Time, a dynamic frame-rate bottleneck that converts fixed-frame-rate autoencoders to dynamic ones. Our method learns a lightweight latent predictor used to decide which frames can be skipped and later reconstructed, enabling efficient greedy boundary selection at inference. Experiments show our method enables deployment-time rate control while improving efficiency-quality tradeoffs relative to baselines. Overall, we provide a flexible mechanism for adjusting temporal resolution in audio autoencoders, potentially facilitating more efficient downstream modeling for generation and long-context tasks.

2606.27316 2026-06-26 cs.CL 新提交

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

基于大语言模型的德国央行证券招股说明书资格标准审查

Serhii Hamotskyi, Akash Kumar Gautam, Christian Hänig

发表机构 * Anhalt University of Applied Sciences(安哈尔特应用科学大学)

AI总结 提出基于大语言模型的生成式信息提取流程,将任务分解为提取、规范化和解释,用于审查证券招股说明书中的资格标准,在文档级资格判断上达到91%的精确率。

Journal ref Proceedings of the 7th Financial Narrative Processing Workshop (FNP 2026) at LREC 2026, pp. 1-10, 2026

详情
AI中文摘要

验证证券作为抵押品的资格是德国中央银行的一项关键职责。然而,在冗长、半结构化且通常双语的招股说明书中,手动根据法律和财务标准验证这些资产是一项资源密集型任务。虽然之前的努力利用传统的命名实体识别(NER)进行信息提取,但这些方法可能难以应对OCR噪声、语言变异和严格的基于跨度的约束,并且需要为每种相关注释类型手动标注训练数据。在本文中,我们提出了第一个将大语言模型(LLM)应用于资格审查过程的案例研究,将范式转向生成式信息提取流程。我们的方法将任务分解为提取、规范化和解释,从而在处理噪声文本和交错德英内容时具有更大的灵活性。我们进一步引入了一种基于价值的评估方法,使用LLM作为评判者,该方法提供了比基于位置的指标更语义化的评估。我们的结果表明,基于LLM的系统在文档级资格判断中实现了高精确率(高达91%),表现出保守的操作模式,最大限度地减少了错误接受。

英文摘要

Verifying the eligibility of securities as collateral is a key responsibility of the German Central Bank. However, manually verifying these assets against legal and financial criteria within lengthy, semi-structured, and often bilingual prospectuses is a resource-intensive task. While previous efforts utilized traditional Named Entity Recognition (NER) for information extraction, these methods can struggle with OCR noise, linguistic variance, and rigid span-based constraints, and the need for manually annotated training data for each relevant annotation type. In this paper, we present the first case study applying Large Language Models (LLMs) to the eligibility examination process, shifting the paradigm toward a generative Information Extraction pipeline. Our approach decomposes the task into extraction, normalization, and interpretation, allowing for greater flexibility in handling noisy text and interleaved German-English content. We further introduce a value-based evaluation methodology using LLM-as-a-judge, which offers a more semantic assessment than location-based metrics. Our results demonstrate that LLM-based systems achieve high precision (up to 91%) in document-level eligibility, exhibiting a conservative operating profile that minimizes false acceptance.

2606.27315 2026-06-26 cs.LG 新提交

Blackwell Approachability and Gradient Equilibrium are Equivalent

Blackwell可逼近性与梯度均衡等价

Brian W. Lee, Nika Haghtalab, Michael I. Jordan, Ryan J. Tibshirani

发表机构 * University of California, Berkeley(加州大学伯克利分校) Inria & École Normale Supérieure(法国国家信息与自动化研究所 & 巴黎高等师范学院)

AI总结 本文证明梯度均衡(GEQ)与Blackwell可逼近性在算法上等价,从而将GEQ纳入在线学习主流框架,并建立了与遗憾最小化、校准等框架的等价关系。

Comments 30 pages, 1 figure, accepted for presentation at COLT 2026

详情
AI中文摘要

梯度均衡(GEQ)是最近引入的一种在线优化框架,它从离线优化中推广了一阶平稳性,并抽象了诸如在线共形预测等问题。尽管GEQ与已知的在线学习框架(即遗憾最小化)有相似之处,但先前的工作表明,GEQ误差和遗憾是不可比较的目标,这使得GEQ如何融入更广泛的在线学习格局尚不明确。在这项工作中,我们证明GEQ在算法意义上等价于Blackwell可逼近性。也就是说,Blackwell可逼近问题总是可以通过查询黑箱GEQ预言机来解决,且预言机的错误率没有渐近损失,反之亦然。结合可逼近性、遗憾最小化和校准之间的已知等价关系,这些结果意味着GEQ也与这些框架等价。我们的归约是高效的,可用于将精细保证(如乐观和强自适应性)从遗憾最小化转移到GEQ。在此过程中,我们还识别了GEQ的必要和充分条件,并建立了具有无约束和约束决策集的不同GEQ概念之间的归约。

英文摘要

Gradient equilibrium (GEQ) is a recently introduced online optimization framework that generalizes first-order stationarity from offline optimization and abstracts problems like online conformal prediction. While GEQ has curious similarities with known online learning frameworks, namely regret minimization, prior work has shown that GEQ error and regret are incomparable objectives, leaving open a precise understanding of how GEQ fits into the broader online learning landscape. In this work, we show that GEQ is equivalent to Blackwell approachability in the algorithmic sense. That is, a Blackwell approachability problem can always be solved using queries to a black-box GEQ oracle, with no asymptotic loss in the oracle's error rate, and vice versa. Taken together with known equivalences between approachability, regret minimization, and calibration, these results imply that GEQ is equivalent to these frameworks, as well. Our reductions are efficient and can be used to transfer refined guarantees, such as optimism and strong adaptivity, from regret minimization to GEQ. Along the way, we also identify necessary and sufficient conditions for GEQ, and establish reductions between different notions of GEQ with unconstrained and constrained decision sets.

2606.27313 2026-06-26 cs.CV 新提交

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

ViQ: 任意分辨率下的文本对齐视觉量化表示

Xumin Yu, Zuyan Liu, Zhenyu Yang, Yuhao Dong, Shengsheng Qian, Jiwen Lu, Han Hu, Yongming Rao

发表机构 * Tencent HY Vision Team(腾讯HY视觉团队) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) Chinese Academy of Sciences(中国科学院)

AI总结 提出ViQ框架,通过文本对齐预训练和特征离散化两阶段量化学习,平衡离散表示中的语义与细节,支持原生分辨率输入,在多模态任务中取得与连续高维特征编码器相当的性能,并实现20%-70%的训练加速。

Comments Accepted to ECCV 2026

详情
AI中文摘要

文本和视觉的统一表示是一个自然的追求,因为它能够简化多模态建模并提高训练效率。然而,以与文本相同的方式将图像表示为离散信号不可避免地会引入严重的信息损失。现有工作在离散表示中难以平衡低层细节和高层语义:面向重建的表示通常缺乏语义信息,而语义更强的特征则常常遭受严重的细节损失。我们提出了ViQ,一个视觉量化表示框架,旨在平衡离散表示中的语义和细节,同时支持原生分辨率输入,从而使其能够作为任意视觉输入的通用离散表示。我们的方法将量化学习分为两个阶段:文本对齐预训练和特征离散化。通过文本对齐预训练,我们增强了视觉编码器从预训练语言模型中获得的语义丰富的监督,并使其能够处理原生分辨率的视觉输入。在离散化过程中,我们提出了一种近端表示学习策略,逐步压缩特征空间,并结合位置感知的头部量化机制,实现对任意分辨率的灵活处理。在多模态任务上的大量实验表明,与具有连续高维视觉特征的最先进多模态视觉编码器相比,ViQ在保持低层重建高精度的同时,取得了具有竞争力的性能。我们还表明,使用视觉量化表示进行多模态训练大大提高了效率,在不同的基础LLM和训练方案下实现了20%-70%的加速。

英文摘要

A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20\%-70\% acceleration with different base LLMs and training recipes.

2606.27307 2026-06-26 cs.CV 新提交

See & Sniff: Learning Visuo-Olfactory Representations

See & Sniff: 学习视觉-嗅觉表征

Seongyu Kim, Seungwoo Lee, Hyeonggon Ryu, Joon Son Chung, Arda Senocak

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) Hankuk University of Foreign Studies(韩国外国语大学) Ulsan National Institute of Science and Technology(蔚山科学技术院)

AI总结 提出SmellNet-V数据集和See & Sniff自监督框架,通过密集局部对齐学习联合视觉-嗅觉表征,实现气味分类、跨模态检索和气味定位,性能提升7%。

Comments ECCV 2026. Project Page: https://mm.kaist.ac.kr/projects/SeeandSniff/

详情
AI中文摘要

虽然现代多模态模型将视觉与语言、音频或触觉相结合,但由于缺乏配对的视觉-嗅觉数据,嗅觉仍然很大程度上未被探索。我们引入了SmellNet-V,这是一个可扩展的视觉-嗅觉数据集,基于这样的洞察:在语义类别内,气味身份对视觉变换基本不变。这使我们能够将仅含气味的样本与语义对齐的野外网络图像进行合成配对,从而将单模态嗅觉数据集转换为跨模态基准,而无需昂贵的联合收集。基于该数据集,我们提出了See & Sniff,一个自监督框架,通过密集局部对齐学习联合视觉-嗅觉表征,并自然地生成气味显著性图,用于气味源的空间定位。我们进一步引入了像素级气味定位任务和一个评估基准。我们的方法在仅凭气味进行气味分类方面比仅基于气味的基线提高了7%,并泛化到跨模态检索和气味定位,将视觉-嗅觉学习确立为多模态感知的一个新方向。

英文摘要

While modern multimodal models integrate vision with language, audio, or touch, olfaction remains largely unexplored due to the lack of paired visuo-olfactory data. We introduce SmellNet-V, a scalable visuo-olfactory dataset built on the insight that odor identity is largely invariant to visual transformations within a semantic category. This allows us to synthetically pair smell-only samples with semantically aligned in-the-wild web images, converting a unimodal olfactory dataset into a cross-modal benchmark without costly co-collection. Building on this dataset, we propose See & Sniff, a self-supervised framework that learns joint visuo-olfactory representations via dense local alignment and naturally produces smell saliency maps for spatial grounding of odor sources. We further introduce pixel-level smell localization task and a benchmark for evaluation. Our method surpasses smell-only baselines by 7% in smell classification from smell alone and generalizes to cross-modal retrieval and smell localization, establishing visuo-olfactory learning as a new direction in multimodal perception.

2606.27305 2026-06-26 cs.CV 新提交

Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN

雕刻NeRF几何:基于人类偏好的3D感知人脸GAN微调

Archer Moore, Mingming Gong, Liam Hodgkinson

发表机构 * School of Mathematics and Statistics, The University of Melbourne(墨尔本大学数学与统计学院)

AI总结 提出直接对辐射场密度值进行人类偏好强化学习微调,无需网格或形状先验,显著改善3D人脸几何,在74.4%的成对比较中用户偏好微调后的几何。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)用于3D生成已在多项工作中得到应用,但大多数现有流程优化显式表面表示,通常通过将辐射场转换为网格并在大量表面监督数据上训练。我们则直接对预训练的3D感知生成模型进行微调,基于辐射场密度($\sigma$)值的学习奖励,无需外部提供的网格或形状先验。奖励模型无需预训练,易于在小规模偏好样本上训练,并带来3D几何的稳健改进。针对无条件的3D感知人脸GAN(EG3D),我们的奖励直接读取神经辐射场(NeRF)的连续3D密度场,仅提供几何学习信号,无需文本条件、网格提取或多视角渲染。密度一致性约束在几何重塑时保持2D外观质量相似,但带来可测量且有限的分布代价(FID-50k从4.09升至6.66):作为概念验证,微调后的生成器基于单个标注者的偏好进行训练,在74.4%的成对比较中用户更偏好其生成的人脸几何。

英文摘要

Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ($σ$) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.

2606.27304 2026-06-26 cs.LG 新提交

A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets

基于多保真卷积自编码器-迁移学习框架的导波损伤诊断:利用大规模仿真与有限实验数据

Santosh Kapuria, Abhishek

发表机构 * Indian Institute of Technology Delhi(印度理工学院德里分校)

AI总结 提出多保真迁移学习框架,结合轻量物理仿真、卷积自编码器深度特征学习和少量实验数据,实现板结构损伤定位与尺寸估计,R²分别超0.93和0.99。

Comments 19 pages, 24 figures

详情
AI中文摘要

基于导波的结构健康监测(GWSHM)利用机载传感器在工程结构早期损伤诊断中具有巨大潜力。然而,深度学习模型的实际部署常受限于标记实验数据稀缺以及生成大规模高保真仿真数据集的高计算成本。本研究提出一种多保真迁移学习框架,该框架集成了轻量物理仿真、基于卷积自编码器(CAE)的深度特征学习、前馈神经网络以及有限的实验测量,用于对配备压电传感器的板状结构进行精确的损伤定位和尺寸估计。采用计算高效的一维时域谱元模型生成大规模合成数据集用于预训练,而迁移学习仅使用少量标记数据将模型适应到实验域。基于CAE的迁移学习框架在损伤定位精度上显著优于基于CNN的对应框架。该模型实现了优异的预测性能,损伤定位的R²分数超过0.93,损伤尺寸估计的R²分数超过0.99。其泛化能力在未见过的数据上得到验证,对预训练或微调中未出现的损伤场景表现出高预测精度。结果证明,所提出的框架是实际GWSHM应用中准确、计算高效且实用的解决方案。

英文摘要

Guided wave-based structural health monitoring (GWSHM) with onboard transducers offers significant potential for the early diagnosis of damage in engineering structures. However, the practical deployment of deep learning models is often hindered by the limited availability of labelled experimental data and the high computational cost of generating large-scale high-fidelity simulation datasets. This study presents a multifidelity transfer learning framework that integrates lightweight physics-based simulations, convolutional autoencoder (CAE)-based deep feature learning, a feed-forward neural network, and limited experimental measurements for accurate damage localisation and sizing in plate-like structures instrumented with piezoelectric transducers. A computationally efficient one-dimensional time-domain spectral element model is employed to generate a large synthetic dataset for pretraining, while transfer learning adapts the model to experimental domains using only a small amount of labelled data. The CAE-based transfer learning framework significantly outperforms its CNN-based counterpart in damage localisation accuracy. The model achieves excellent predictive performance with $R^2$ scores exceeding 0.93 for damage localisation and 0.99 for damage sizing. Its generalisation capability is demonstrated on previously unseen data, showing high prediction accuracy for damage scenarios not represented during pretraining or fine-tuning. The results establish the proposed framework as an accurate, computationally efficient, and practically viable solution for real-world GWSHM applications.

2606.27295 2026-06-26 cs.RO 新提交

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining

LA4VLA:通过语言-动作预训练实现无视觉行动学习

Tao Lin, Yuxin Du, Yiran Mao, Zewei Ye, Yilei Zhong, Bing Cheng, Yiming Wang, Jiting Liu, Yang Tian, Junchi Yan, Feiran Wu, Zenan Meng, Hu Wei, Yuqian Fu, Gen Li, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) Alibaba Group(阿里巴巴集团) Nanyang Technological University(南洋理工大学) KAUST(阿卜杜拉国王科技大学)

AI总结 提出LA4VLA框架,通过语言-动作预训练学习动作先验,减少对视觉线索的依赖,提升VLA策略的鲁棒性,在仿真和真实任务中成功率显著提升。

Comments Github: https://github.com/MINT-SJTU/LA4VLA

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过联合映射视觉观察和语言指令到动作来在机器人演示上进行预训练。然而,密集的视觉-动作监督可能主导相对稀疏的语言-动作信号。因此,策略可能依赖视觉捷径而非学习语言如何条件化动作执行,使其对视觉变化敏感。为解决这一限制,我们提出LA4VLA,一种语言-动作预训练框架,使策略能够在没有视觉观察的情况下获取语言条件化的动作先验。这些先验捕获跨任务和场景共享的可复用操作技能,减少对场景特定视觉线索的依赖。具体来说,LA4VLA将专家演示轨迹分解为原子动作片段,并将每个片段与相应的低级动作描述配对。这产生了LA4-33K,一个完全来自现有演示的33K语言-动作(LA)片段数据集,无需额外的机器人数据收集。我们进一步开发了LA4VLA-1B,一个轻量级的1B参数VLA模型,并研究了将语言-动作监督融入VLA学习的三种范式:仅LA预训练、顺序LA到VLA预训练以及混合LA-VLA预训练。在仿真和真实世界任务中,LA预训练的策略始终优于匹配的VLA预训练对应物,而结合LA和VLA监督带来进一步提升。特别是,混合LA-VLA预训练使LA4VLA-1B的平均成功率相对于无预训练基线在仿真和真实世界任务中分别提高了高达17.8和45.0个百分点。这些结果确立了LA4VLA作为构建更强、更鲁棒的VLA策略的有效且互补的预训练策略。

英文摘要

Vision-Language-Action (VLA) models are commonly pretrained on robot demonstrations by jointly mapping visual observations and language instructions to actions. However, dense visual-action supervision can dominate the comparatively sparse language-action signal. As a result, policies may rely on visual shortcuts rather than learn how language conditions action execution, making them sensitive to visual variations. To address this limitation, we propose LA4VLA, a language-action pretraining framework that enables policies to acquire language-conditioned action priors without visual observations. These priors capture reusable manipulation skills shared across tasks and scenes, reducing reliance on scene-specific visual cues. Specifically, LA4VLA decomposes expert demonstration trajectories into atomic action segments and pairs each segment with a corresponding low-level action description. This yields LA4-33K, a dataset of 33K Language-Action (LA) episodes derived entirely from existing demonstrations without additional robot data collection. We further develop LA4VLA-1B, a lightweight 1B-parameter VLA model, and investigate three paradigms for incorporating language-action supervision into VLA learning: LA-only pretraining, sequential LA-to-VLA pretraining, and mixed LA-VLA pretraining. Across simulation and real-world tasks, LA-pretrained policies consistently outperform matched VLA-pretrained counterparts, while combining LA and VLA supervision leads to further gains. In particular, mixed LA-VLA pretraining improves the average success rate of LA4VLA-1B over the no-pretraining baseline by up to 17.8 and 45.0 percentage points in simulation and real-world tasks, respectively. These results establish LA4VLA as an effective and complementary pretraining strategy for building stronger and more robust VLA policies.

2606.27292 2026-06-26 cs.RO 新提交

BOWConnect: Parallel Bayesian Optimization over Windows with Learned Local Cost Maps for Sample-Efficient Kinodynamic Motion Planning

BOWConnect: 基于学习局部代价图窗口的并行贝叶斯优化用于样本高效的运动动力学运动规划

Sourav Raxit, Abdullah Al Redwan Newaz, Jose Fuentes, Leonardo Bobadilla

发表机构 * Louisiana State University New Orleans(路易斯安那州立大学新奥尔良分校) Florida International University(佛罗里达国际大学)

AI总结 提出BOWConnect双向并行运动动力学规划器,通过窗口贝叶斯优化学习局部代价图引导采样,解决高维空间样本效率低、动态约束下代价启发式不可靠及窄通道规划难题,在基准测试中实现100%成功率。

Comments Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

本文提出BOWConnect,一种双向并行运动动力学运动规划器,解决了现有基于采样的方法的三个基本局限性:高维状态空间中的样本效率低、动态约束下不可靠的代价启发式以及窄通道环境中的性能差。与依赖随机控制采样和几何距离启发式的经典规划器不同,BOWConnect将窗口贝叶斯优化(BOW)作为基于学习的转向函数集成到并行树探索框架中,使每个工作节点能够学习局部代价图和约束,以引导采样朝向动态可行且无碰撞的控制。双向架构在并行线程中从起始区域和目标区域同时生长前向和后向树,空间哈希机制实现快速连接查询,边界值问题求解器生成运动动力学一致的桥接轨迹。在十个基准环境中的广泛评估表明,BOWConnect在复杂场景(包括窄通道和非凸空间)中实现了100%的成功率,同时提供了最快或接近最快的规划时间,而现有最先进的规划器在这些场景中失败或性能大幅下降。在地面车辆和四旋翼飞行器上的实际部署证实了实时规划且无碰撞。真实世界和模拟实验的视频、高分辨率版本的图片以及开源代码可在以下网址获取:此 https URL。

英文摘要

This paper presents BOWConnect, a bidirectional parallel kinodynamic motion planner that addresses three fundamental limitations of existing sampling-based methods: sample inefficiency in high-dimensional state spaces, unreliable cost heuristics under dynamic constraints, and poor performance in narrow passage environments. Unlike classical planners that rely on random control sampling and geometric distance heuristics, BOWConnect integrates Bayesian Optimization over Windows (BOW) as a learning-based steering function within a parallel tree-based exploration framework, enabling each worker to learn local cost maps and constraints to guide sampling toward dynamically feasible and collision-free controls. A bidirectional architecture simultaneously grows forward and backward trees from the start and goal regions in parallel threads, with a spatial hashing mechanism enabling fast connection queries and a boundary value problem solver generating kinodynamically consistent bridge trajectories. Extensive evaluations across ten benchmark environments demonstrate that BOWConnect achieves 100\% success while delivering the fastest or near-fastest planning time in complex scenarios, including narrow passages and non-convex spaces where state-of-the-art planners fail or degrade substantially. Real-world deployment on a ground vehicle and a quadrotor confirms real-time planning with no collisions. Videos of real-world and simulated experiments, high-resolution versions of the figures, and the open-source code are available at https://bow-connect.github.io/.

2606.27291 2026-06-26 cs.LG 新提交

Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

设计便携查询生成的奖励信号:工业语义职位搜索案例研究

Ping Liu, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Rajat Arora, Yunxiang Ren, Chunnan Yao, Dan Xu, Baofen Zheng, Wanjun Jiang, Andrii Soviak, Kevin Kao, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation(领英公司)

AI总结 提出RLAIF框架生成便携职位搜索查询,通过奖励塑造解决策略优化中的奖励黑客问题,实验表明稳健奖励设计比优化器选择更关键。

Comments Accepted to KDD 2026 Workshop on AI Agent for Information Retrieval (Agent4IR)

详情
AI中文摘要

职位搜索平台依赖低带宽查询接口,通常无法捕捉候选档案的高维复杂性。我们提出一个端到端的RLAIF(从AI反馈中强化学习)框架来生成\emph{便携}职位搜索查询,即抽象掉求职者特定标识符同时保留可泛化资质的术语。该任务引入了一个高度对抗性的奖励面,其中策略优化经常利用LLM作为评判者的评分标准中的缺陷,导致退化的逐字复制行为。我们进行了全面的实证实验,以隔离优化机制与结构化奖励工程的影响。我们的结果表明,对于无评论家的优化器,性能主要取决于稳健的奖励塑造,使得具体算法选择基本无关紧要。虽然无评论家的每次 rollout 基线方法(RLOO和REINFORCE++)天然抵抗奖励黑客,但GRPO中的组相对优势归一化对虚假奖励信号特别敏感,使其极易被利用。我们表明,引入一个确定性的、基于规则的奖励下限来纠正分配给逐字复制的奖励,可以缓解这种失败模式,导致在跨族评估评判器上质量提升$+0.147$。最终,我们表明训练时的奖励模型将性能增益夸大了$2.4$倍,确认训练成功根本上依赖于执行奖励塑造纪律,而非选择替代优化器。

英文摘要

Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emph{portable} job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial $+0.147$ quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by $2.4\times$, confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers.

2606.27288 2026-06-26 cs.AI cs.LG 新提交

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

语言模型组合何时有效?基于67个前沿模型的协同失败上限:路由、投票与多智能体混合

Josef Chen

发表机构 * KAIKAKU

AI总结 研究多模型系统(路由、投票、级联等)的准确率上限,发现其增益受限于所有模型同时出错的比率β,而非常用的平均成对误差相关性ρ,并通过67个模型实验验证了β的显著低估。

详情
AI中文摘要

多模型LLM系统,如路由、投票、级联、融合和多智能体混合,被用于超越单模型准确率。我们表明,它们的增益受限于一个领域很少报告的指标。对于任何输出为某个成员模型答案的策略,准确率不能超过1减去β,其中β是所有模型在同一查询上出错的比率。相比之下,常用的诊断指标——平均成对误差相关性ρ——无法识别β:具有相同边际分布和成对相关性的误差规律可能具有不同的全错率。β的Clopper-Pearson界给出了在训练路由器之前,任何路由器、投票或级联所能提供的最大增益的有限样本保证。在来自21个提供商的67个模型上,一个四分校准的单因子模型仍然低估了全错尾部:在开放式数学问题上,观测到的β为0.052,而完整67模型高斯copula下的β为0.023,低估约2.5倍,90%置信区间为1.7至3.4,k=17。该效应在基于执行评分的代码任务中重复出现,β为0.079。将相同的GPQA-Diamond问题以自由回答而非多项选择形式重新提问,重新打开了尾部,β为0.127,一个五法官小组的kappa为0.73至0.92,将协同失败定位在答案格式而非主题上。在质量匹配的情况下,低ρ异质集成优于高ρ的Self-MoA,但在我们池中的可检查任务上,组合模型很少能超越单个最佳模型,除非有强大的查询级路由信号。增益来自于模型在不同问题上的失败,而非增加更多模型。

英文摘要

Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.

2606.27286 2026-06-26 cs.AI 新提交

Simulation-based inference for rapid Bayesian parameter estimation in epidemiological models: a comparison with MCMC

基于模拟的推理用于流行病学模型中快速贝叶斯参数估计:与MCMC的比较

Alina Bazarova, Johann Fredrik Jadebeck, Henrik Zunker, Carolina J. Klett-Tammen, Torben Heinsohn, Wolfgang Wiechert, Katharina Noeh, Stefan Kesselheim

发表机构 * Forschungszentrum Jülich, Jülich Supercomputing Centre(于利希研究中心,于利希超级计算中心) Helmholtz AI(亥姆霍兹人工智能) Forschungszentrum Jülich, Institute of Bio- and Geosciences(于利希研究中心,生物与地球科学研究所) German Aerospace Center, Institute of Software Technology, Department High-Performance Computing(德国航空航天中心,软件技术研究所,高性能计算部) Helmholtz Centre for Infection Research(亥姆霍兹感染研究中心) RWTH Aachen University, Computational Systems Biology(亚琛工业大学,计算系统生物学)

AI总结 本文研究使用神经后验估计的模拟推理(SBI)作为MCMC的替代方案,用于流行病学模型的贝叶斯校准,在COVID-19 ICU数据上验证了其计算效率与后验准确性。

详情
AI中文摘要

机制流行病学模型广泛用于支持传染病预测和公共卫生决策。此类模型的贝叶斯校准通常使用马尔可夫链蒙特卡洛(MCMC)进行,但对于高维非线性系统和重复的近实时分析,MCMC可能变得计算昂贵。在这里,我们研究使用神经后验估计的基于模拟的推理(SBI),作为使用2020年德国COVID-19重症监护病房(ICU)占用数据对机制性SECIR流行病学模型进行贝叶斯校准的可扩展替代方案。我们使用31天推理窗口和涉及多个传播变化点的更具挑战性的201天重建问题,在多个流行病阶段比较了SBI和MCMC。使用Wasserstein距离和Kullback-Leibler散度以及后验预测检查定量评估后验一致性。在31天窗口中,SBI恢复的后验分布与MCMC高度一致,同时准确再现了观察到的ICU轨迹。在201天设置中,尽管不确定性增加,SBI保留了主导后验结构。通过结合CPU和GPU资源,SBI相比仅限于在CPU上运行的MCMC,大幅减少了计算运行时间。对于31天推理问题,MCMC需要约1000秒,而SBI在单个GPU上约60-70秒即可达到相当的后验和预测性能。对于201天推理问题,SBI平均需要157秒,而MCMC运行耗时超过19000秒。我们的结果表明,SBI为机制流行病学模型的贝叶斯校准提供了一个快速且计算高效的框架,支持重复的近实时推理和快速暴发分析。

英文摘要

Mechanistic epidemiological models are widely used to support infectious disease forecasting and public-health decision making. Bayesian calibration of such models is commonly performed using Markov chain Monte Carlo (MCMC), which can become computationally expensive for high-dimensional nonlinear systems and repeated near-real-time analyses. Here, we investigate simulation-based inference (SBI) using neural posterior estimation as a scalable alternative for Bayesian calibration of a mechanistic SECIR epidemiological model using COVID-19 intensive care unit (ICU) occupancy data from Germany during 2020. We compared SBI and MCMC across multiple epidemic phases using both 31-day inference windows and a substantially more challenging 201-day reconstruction problem involving multiple transmission change points. Posterior agreement was evaluated quantitatively using Wasserstein distances and Kullback-Leibler divergences together with posterior predictive checks. Across the 31-day windows, SBI recovered posterior distributions in strong agreement with MCMC while accurately reproducing observed ICU trajectories. In the 201-day setting, SBI preserved the dominant posterior structure despite increased uncertainty. SBI, by combining CPU and GPU resources, substantially reduced computational runtime compared with MCMC, which was restricted to running on CPUs. Whereas MCMC required approximately 1000 seconds for the 31-day inference problems, SBI achieved comparable posterior and predictive performance in approximately 60-70 seconds on a single GPU. For the 201-day inference problem, SBI required an average of 157 seconds, while the MCMC runs took over 19,000 seconds. Our results demonstrate that SBI provides a rapid and computationally efficient framework for Bayesian calibration of mechanistic epidemiological models, supporting repeated near-real-time inference and rapid outbreak analysis.

2606.27282 2026-06-26 cs.LG 新提交

How Good Can Linear Models Be for Time-Series Forecasting?

线性模型在时间序列预测中能有多好?

Lang Huang, Jinglue Xu, Luke Darlow

发表机构 * Sakana AI, Tokyo, Japan(Sakana AI,日本东京) National Institute of Informatics, Japan(日本国立信息学研究所)

AI总结 通过调整预处理(上下文长度、归一化、正则化、数据增强)而非扩大模型规模,Ridge回归在8个基准上超越Transformer、MLP和CNN,挑战了“更大容量带来更高精度”的假设。

Comments 17 pages, 10 figures, and 5 tables

详情
AI中文摘要

时间序列预测研究一直朝着更大的架构稳步发展,从专门的Transformer到通用基础模型,其假设是容量决定准确性。我们持相反观点:通过调整预处理而非扩大模型规模,可以以更低的成本缩小大部分差距。我们使用Ridge回归作为测试平台,因为它具有闭式解和可解释的权重,可以直接从搜索中读取最优超参数。我们在八个标准基准上搜索上下文长度、局部归一化、正则化和数据增强,发现了三种模式。(1) 最优回溯长度强烈依赖于序列本身,并且通常随预测范围非单调变化,拟合的幂律指数从ETTm2的+0.46到Exchange和Traffic的-0.19,挑战了“更长预测范围需要更长历史”的惯例。(2) 对上下文的已学习尾部分数进行归一化,而非整个上下文,几乎普遍更优。(3) 同一数据集内的序列通常在超参数上不一致;跨序列共享的最优程度从完全共享到完全独立。得到的模型在大多数数据集-预测范围组合上优于先前的线性预测器,并在八个基准中的六个上超过了Transformer、MLP和CNN基线。优化的超参数还充当数据本身的诊断工具,揭示出大型模型默默吸收到其学习参数中的结构。

英文摘要

Time-series forecasting research has been moving steadily toward larger architectures, from specialized transformers to general-purpose foundation models, on the assumption that capacity is what unlocks accuracy. We take the opposite position: most of the gap can be closed at far lower cost by tuning preprocessing rather than scaling models. We use Ridge regression as the testbed, since it has a closed-form solution and interpretable weights, which let the optimal hyperparameters be read off the search directly. We search over context length, local normalization, regularization, and augmentation on eight standard benchmarks and find three patterns. (1) Optimal lookback is strongly series-specific and often non-monotonic in forecast horizon, with fitted power-law exponents ranging from $+0.46$ on ETTm2 to $-0.19$ on Exchange and Traffic, challenging the convention that longer horizons need longer history. (2) Normalizing over a learned trailing fraction of the context, rather than its entirety, is almost universally preferred. (3) Series within the same dataset often disagree on hyperparameters; the optimal degree of cross-series sharing varies from fully shared to fully per-series. The resulting models beat prior linear forecasters on most dataset-horizon entries and exceed Transformer, MLP, and CNN baselines on six of eight benchmarks. The optimized hyperparameters also serve as a diagnostic on the data itself, revealing structures that larger models absorb silently into their learned parameters.