arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2338
2606.11190 2026-06-10 cs.LG 新提交

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐,何时预测:多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion(以色列理工学院) Genentech(基因泰克公司) Brown University(布朗大学) Meta AI, FAIR

AI总结 提出统一线性框架,通过信噪比模型揭示跨模态对齐与预测的互补失效模式,构建四区域相图指导多模态学习目标选择,并在非线性实验中验证。

详情
AI中文摘要

跨模态对齐(CA)和跨模态预测(CP)是多模态表示学习的主要范式,但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者,特别是在生物医学或天体物理学等科学领域,面对异构仪器以及多个层次的组织和测量时,无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下,我们推导出两个目标的分离比,揭示了互补的失效模式:对齐使每个模态白化,当干扰在视图间强相关时失败;预测通过单侧白化编码任何可跨模态预测的内容,恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域:两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法,使用少量标记子样本将真实世界数据集定位在该图中,在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测,包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题,并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 新提交

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校) Arena

AI总结 本文重新解读监督微调为目标分布设计,提出Q-target框架,将监督分解为对观测token的依赖强度与替代token的概率分配,并基于此提出Target-SFT方法,在多个推理任务中优于现有方法。

详情
AI中文摘要

监督微调(SFT)通常最大化示范轨迹中每个token的似然。然而,观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的,尤其是当预训练模型编码了丰富的知识先验时。在这项工作中,我们将SFT重新解释为目标分布设计:不仅研究损失目标,还分析损失驱动模型匹配的token级目标。我们引入Q-target框架,将SFT监督分解为两个明确的选择:(1) 对观测token的依赖强度,以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点,我们提出Target-SFT,直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法,展示了这种基于目标的方法的有效性。总体而言,我们的公式揭示了SFT训练更基本的设计原则,并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

2606.11188 2026-06-10 cs.CV 新提交

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM: 一种具有统一离散表示的自回归大型多模态模型

Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang

发表机构 * Shanghai Key Lab of Intelligent Information Processing, Fudan University(复旦大学上海智能信息处理重点实验室) School of Computer Science, Fudan University(复旦大学计算机科学技术学院) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Youtu Lab, Tencent(腾讯优图实验室) Meta AI Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ARM模型,通过离散语义视觉分词器将图像映射为紧凑token序列,结合自回归建模和强化学习,统一实现图像理解、生成和编辑,并提升任务性能与跨任务协同。

详情
Comments
technical report
AI中文摘要

本文介绍了ARM,一种基于离散表示的自回归模型,它在下一个词预测框架内统一了图像理解、生成和编辑。ARM建立在三个努力之上:首先,我们训练了一个离散语义视觉分词器,将图像映射为紧凑的token序列。我们的分词器通过多个目标进行监督,这些目标共同促进语义可辨别性、语言对齐和忠实重建,从而在共享潜在空间中支持多样化的任务。在此基础上,我们在大规模文本和图像token序列上训练了一个7B自回归模型,无缝地发展了视觉-语言感知和生成能力。最后,为了进一步改善文本到图像生成和指令引导编辑的偏好对齐行为,ARM应用强化学习(RL)来优化任务级目标,如视觉质量、指令遵循和编辑一致性。令人惊讶的是,结果表明RL不仅显著提高了目标任务的性能(例如,将WISE总体从0.50提升到0.56,GEdit-Bench-EN G_O从5.75提升到6.68),而且还诱导了文本到图像生成和编辑之间的跨任务协同。总的来说,这些发现凸显了自回归建模,当与强大的表示和偏好优化相结合时,作为多模态智能的可扩展基础。代码:此https URL。

英文摘要

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

2606.11187 2026-06-10 cs.CV 新提交

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Next Forcing: 基于多块预测的因果世界建模

Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu

发表机构 * Robbyant HUST(华中科技大学) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 提出Next Forcing框架,通过多块预测训练目标加速视频生成模型收敛、提升精度并实现推理加速,在多个基准上取得最优结果。

详情
Comments
Project page: https://gangweix.github.io/next-forcing/
AI中文摘要

自回归视频生成已成为世界动作模型(WAMs)的强大范式。然而,现有方法存在训练收敛慢和收敛精度有限的问题,尤其是在高帧率下,因为训练监督仅限于当前块,缺乏关于未来动态的明确信号;此外,由于迭代视频去噪,推理速度也较慢。在本文中,我们提出Next Forcing,一种用于因果世界建模的多块预测(MCP)框架,可实现更快的训练、更高的精度和加速推理。受大语言模型中多token预测的启发,Next Forcing引入了MCP训练目标,通过轻量级辅助MCP模块增强主模型,以同时去噪多个未来时间范围(next$^1$、next$^2$、next$^3$块)的视频块。这些MCP模块在预测深度上形成因果链,其中从主模型多个层融合的中间特征被用于预测未来动态,使得近期预测能够为远期预测提供信息,并向主模型提供密集的多尺度时间监督。在训练中,MCP模块显著加速收敛并提高收敛精度,尤其是在高帧率下:在50 fps下,Next Forcing在5k训练步数上比LingBot-VA相对提升93.1%,收敛速度提升2.3倍,并在RoboTwin基准(Clean/Random上94.1/93.5%)上建立了新的最先进结果。在推理时,MCP模块可以保留以并行预测当前块和下一个视频块,实现2倍推理加速。Next Forcing还在PhyWorld(评估视频生成中物理规律遵循的基准)上展示了显著改进,并在通用视频预训练上实现了超过50%的FVD降低。

英文摘要

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

2606.11184 2026-06-10 cs.RO 新提交

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

TacForeSight:面向接触丰富操作的力引导触觉世界模型

Yujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng, Shuai Tian, Songen Gu, Chen Gao, Zining Wang, Shuicheng Yan, Wenchao Ding

发表机构 * TARS Robotics National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Fudan University(复旦大学)

AI总结 提出TacForeSight框架,通过力条件触觉世界模型预测触觉潜动态,结合预测性触觉条件策略实现高频操作下的主动接触推理,在动态接触干扰下优于现有方法。

详情
AI中文摘要

接触丰富操作要求机器人在动态接触过渡或复杂表面几何下持续感知和调节演变的物理交互。最近的模仿学习方法通过整合触觉或力反馈改善了接触感知控制,但很少对全局力和局部触觉感知的非对称时空角色进行建模。为此,我们提出TacForeSight,一种轻量级的力条件触觉预测框架,用于实时操作。核心组件是TacForceWM,一个触觉世界模型,它从双指触觉观测中预测短时域触觉潜动态,并以高频腕部力和力矩信号为条件。另一个关键组件,预测性触觉条件策略,利用预测的潜变量作为预期接触先验,通过交叉注意力建模当前到未来的触觉演化,并通过触觉引导门控模块自适应融合视觉-触觉特征。通过在紧凑潜空间内进行预测,TacForeSight实现了主动接触推理,并具有适用于高频操作控制的高效实时推理。在五个代表性任务和三种过程扰动设置上的真实机器人实验表明,TacForeSight在动态接触干扰下始终优于现有基线。所有模型和数据集将在项目网站上公开。

英文摘要

Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control by incorporating tactile or force feedback, but they rarely model the asymmetric spatiotemporal roles of global force and local tactile sensing. To address this, we propose TacForeSight, a lightweight force-conditioned tactile foresight framework for real-time manipulation. The core component is TacForceWM, a tactile world model that predicts short-horizon tactile latent dynamics from dual-finger tactile observations conditioned on high-frequency wrist force and torque signals. Another key component, the Predictive Tactile-Conditioned Policy, leverages the predicted latents as anticipatory contact priors, models the current-to-future tactile evolution via cross-attention, and adaptively fuses visuo-tactile features through a tactile-guided gating module. By forecasting purely within a compact latent space, TacForeSight enables proactive contact reasoning with efficient real-time inference suitable for high-frequency manipulation control. Real-robot experiments on five representative tasks and three in-process perturbation settings show that TacForeSight consistently outperforms existing baselines, particularly under dynamic contact disturbances. All models and datasets will be made publicly available on the project website at https://tacforesight.github.io/ProjectPage.

2606.11182 2026-06-10 cs.LG cs.AI 新提交

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

EEVEE:面向真实世界自改进智能体的测试时提示学习

Weixian Xu, Shilong Liu, Mengdi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Princeton University(普林斯顿大学)

AI总结 提出首个多数据集测试时提示学习框架EEVEE,通过路由器与提示协同进化策略解决跨数据集干扰,在异构数据流下提升鲁棒性。

详情
Comments
19 pages, 6 figures
AI中文摘要

本文提出EEVEE,首个面向LLM智能体的多数据集测试时提示学习框架,能够在真实世界任务流下实现测试时提示学习。现有方法主要针对单数据集设计,而实际应用要求模型处理来自多个数据集、领域和任务分布的异构输入流,限制了其实用性。为减轻跨数据集干扰,EEVEE引入一个路由器,将输入划分为任务簇并分配适当的提示配置。该设计通过路由器-提示协同进化策略进行优化,该策略采用交错的路由器和提示学习阶段来解决它们的相互依赖关系。跨多个数据集的实验表明,该框架在异构数据流下提高了鲁棒性,同时保持了单基准学习能力和效率。具体而言,EEVEE在Qwen3-4B-Instruct和DeepSeek-V3.2上平均多基准分数分别提高了10.38和24.32分,超过SOTA方法GEPA和ACE高达37.2%和48.2%。

英文摘要

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

2606.11180 2026-06-10 cs.CV 新提交

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Lip Forcing: 用于实时唇部同步的少步自回归扩散

Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能) AIPARK

AI总结 提出Lip Forcing,首个用于视频到视频唇部同步的自回归扩散方法,通过蒸馏14B教师模型为因果学生模型,仅需两步去噪即可实现实时同步,并引入同步窗口DMD、两步推理计划和SyncNet奖励。

详情
Comments
Project Page: https://cvlab-kaist.github.io/LipForcing/
AI中文摘要

基于扩散的唇部同步模型实现了强大的视觉质量和音视频对齐,但全序列双向注意力和大量去噪步骤使其不适用于实时推理。我们提出了Lip Forcing,据我们所知,这是首个用于视频到视频(V2V)唇部同步的自回归扩散方法,它将一个14B音频条件双向视频扩散教师模型蒸馏为因果学生模型。在推理时,学生模型仅需两步去噪即可生成每个块,无需推理时的CFG,从而实现实时唇部同步。针对唇部同步的教师轨迹分析揭示了CFG保真度-同步权衡:无CFG预测偏向参考保真度,而CFG引导预测偏向中间轨迹带内的同步。Lip Forcing将这一发现转化为三个分析导出的组件:同步窗口DMD、两步推理计划和基于SyncNet的奖励。我们在两个学生尺度上验证了Lip Forcing,两者均从14B教师模型蒸馏而来。1.3B学生模型以31 FPS的速度进入实时流式处理,比同尺度的双向模型快17.6倍。14B学生模型是V2V唇部同步中报道的最大扩散模型,在参考保真度相当的情况下,运行速度比其教师模型快39.8倍。两种尺度的首帧时间均为亚毫秒级,远低于所有扩散基线。

英文摘要

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

2606.11176 2026-06-10 cs.CV cs.CL cs.CY cs.HC 新提交

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

数据记者智能体:将数据转化为可验证的多模态故事

Kevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu, Philip Torr, James Zou

发表机构 * University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出多智能体框架Data2Story,通过证据链验证声明并自动生成多模态文章,在18篇文章上评估,证明其在透明性和可审计性上接近人类记者。

详情
Comments
Project page: https://data2story.github.io Github: https://github.com/QinghongLin/data2story-skill
AI中文摘要

数据讲述塑造社会的故事;数据记者的工作是将原始信息转化为非专家可以信任的故事。一篇高质量的新闻专题需要新闻编辑室团队数周时间:寻找背景、运行统计、选择角度和设计视觉。最近的智能体在单个步骤上表现良好:数据科学智能体闭合分析循环,而设计智能体合成漂亮的网站。但是,一个智能体能否端到端地充当数据记者?我们引入了数据记者智能体(Data2Story),这是一个多智能体框架,将专业角色编排成一个虚拟新闻编辑室。Data2Story贡献了两项创新。(i)声明有证据支持:一个检查员将每个数字、角度和资产链接回数据、代码或外部参考。(ii)文章是多模态生成的:而不是默认使用纯文本和静态图表,Data2Story推理读者想看什么,然后部署多模态工具,例如用于地理的交互式地图和用于音乐的音频。我们在18篇文章上评估Data2Story,每篇都与原始发表的专家作品配对,沿着四个轴:(a)人类-智能体角度覆盖;(b)53名参与者在五个维度上的评分评估;(c)计算机使用智能体作为评委,一种节省成本的代理,用于衡量读者如何浏览交互式文章;(d)可验证性,其中编码验证器根据数据重新执行语句并检查声明与参考。Data2Story产生有竞争力、证据可追溯的多媒体故事,在透明性和可审计性方面特别强。人类文章在编辑角度、创意设计和演示方面保持优势。我们将Data2Story定位为记者的合作者,实现更多基于证据、透明和可验证的报道。代码和演示可在https://this URL获取。

英文摘要

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

2606.11173 2026-06-10 cs.AI cs.LG 新提交

The Role of Feedback Alignment in Self-Distillation

反馈对齐在自蒸馏中的作用

Semih Kara, Oğuzhan Ersoy

发表机构 * Gensyn

AI总结 研究通过自蒸馏提升语言模型性能时,反馈与模型推理的结构对齐是关键因素,步级对齐批评比二元奖励或参考解更有效。

详情
Comments
Accepted to the ICML 2026 Workshop on RL from World Feedback (RLxF)
AI中文摘要

在语言模型上附加额外上下文(例如对先前尝试的反馈)通常会改善其响应。自蒸馏训练模型在没有上下文时保留这种改进。该方法通过匹配模型在两种设置下的输出分布来工作:学生仅看到问题,而自教师还看到上下文。因此,模型学习的内容取决于自教师接收的上下文,然而上下文的设计在很大程度上尚未被探索。我们通过使用冻结的批评器训练求解器来研究自蒸馏的上下文设计。我们比较了三种条件:(i) 二元奖励(GRPO),(ii) 参考解,以及 (iii) 与求解器推理轨迹对齐的逐步批评。步级对齐批评带来了最大的增益,在Avg@12上比GRPO高出16.11分,比参考解条件化的自蒸馏高出5.27分。逐token优势分析揭示了原因:步级对齐反馈仅针对推理失败的token,保留正确行为不变。相比之下,条件化于参考解会迫使模型在每个token上改变其行为(即使是正确的步骤),因为替代推导在措辞和方法上不可避免地存在差异。这表明反馈与求解器推理之间的结构对齐是自蒸馏有效性的关键驱动因素。

英文摘要

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.

2606.11172 2026-06-10 cs.LG 新提交

Predicting Future Behaviors in Reasoning Models Enables Better Steering

推理模型中预测未来行为以实现更好的引导

Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti, Seong Joon Oh, Sebastian Lapuschkin, Wojciech Samek

发表机构 * Fraunhofer HHI(弗劳恩霍夫海因里希·赫兹研究所) Northeastern University(东北大学) KAIST(韩国科学技术院)

AI总结 通过训练激活探针预测推理模型未来行为,提出未来探针控制生成(FPCG)方法,在多个评估中实现几乎无质量下降的引导。

详情
AI中文摘要

部署的大型推理模型(LRM)经常出现意外行为。测试时引导通过干预其隐藏表示来控制LRM输出,但可能降低输出质量。我们认为,先前的引导工作隐含地依赖于检测已生成文本中行为的内部特征。我们表明这些检测特征是未来行为结果的不良预测器,因此不是自然的干预目标。相反,我们训练激活探针从中间推理步骤预测未来行为可能性。这些探针以64%-91%的准确率预测最可能的行为,揭示了一种不同类型的内部预测特征。基于这些预测特征,我们引入了一种文本级引导方法,即未来探针控制生成(FPCG)。FPCG采样多个候选句子,并根据预测未来行为可能性的探针选择最佳句子。这使得引导几乎没有输出质量下降。FPCG还在激活引导失败的多个评估中实现了引导。这些结果表明,区分检测和预测特征能够实现对LRM行为更细致的控制。

英文摘要

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

2606.11167 2026-06-10 cs.CL eess.AS 新提交

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互对齐

Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

发表机构 * Kyutai Gradium

AI总结 针对全双工对话模型交互性问题,提出基于强化学习的后训练对齐方法,从暂停处理、话轮转换、回馈和用户打断四个维度优化,并加入LLM奖励防止语义退化,在Moshi和PersonaPlex上取得一致改进。

详情
AI中文摘要

全双工口语对话模型可以同时听和说,使其成为自然对话的有前途的架构。然而,当前模型仅通过令牌级似然最大化的监督学习进行训练,这并未直接优化交互级行为,导致交互性问题,如过度沉默和不合时宜的话轮转换。最近的工作应用强化学习(RL)来改善交互性,但现有方法仅在其奖励中处理有限的一组交互行为。在这项工作中,我们提出了一种后训练对齐方法,通过RL全面改善全双工口语对话模型的交互性。我们解决了交互性的四个典型轴:暂停处理、话轮转换、回馈和用户打断。对于每个轴,我们从人类对话语料库中提取短音频片段,并使用特定于轴的奖励函数优化模型。一个额外的基于LLM的响应质量奖励防止语义退化。我们将我们的方法应用于两个开源模型Moshi和PersonaPlex,在预录音频的离线评估和实时多轮对话评估中均显示出交互性的一致改进。

英文摘要

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

2606.11164 2026-06-10 cs.AI 新提交

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc: 推理模型的分层解码时KV缓存预算分配

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu

发表机构 * Tsinghua University(清华大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Shenzhen University of Advanced Technology(深圳理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对长链式推理中KV缓存快速增长导致的推理瓶颈,提出ReasonAlloc框架,通过离线层预分配和在线头重分配的分层预算分配策略,在不增加训练开销下显著提升小预算下的推理性能。

详情
AI中文摘要

大语言模型推理中的长链式思维轨迹由于键值缓存的快速增长导致严重的推理瓶颈。当前的解码时压缩方法通过令牌驱逐缓解此问题,但通常假设所有层和头之间均匀分配预算。相比之下,现有的非均匀预算分配方法主要针对静态提示预填充阶段设计,未能捕捉自回归推理的逐步上下文需求。为弥补这一差距,我们提出ReasonAlloc,一个无需训练的框架,将解码时KV压缩重新表述为分层预算分配问题。ReasonAlloc在两个互补层面运作:离线层预分配策略捕捉一种架构驱动的需求模式,我们称之为“推理波”;在线头策略根据实时效用将资源重新分配给信息丰富的头。在数学推理基准(MATH-500、AIME 2024)上使用DeepSeek-R1-Distill-Llama-8B、DeepSeek-R1-Distill-Qwen-14B和AceReason-14B的评估表明,ReasonAlloc优于均匀预算的R-KV、SnapKV和Pyramid-RKV(一种强制静态单调递减层预算的基线),在小预算(128-512令牌)下增益最大。ReasonAlloc可与现有令牌驱逐策略即插即用,并引入可忽略的推理时间开销。

英文摘要

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

2606.11162 2026-06-10 cs.LG 新提交

COGENT: Continuous Graph Emulators with Neural Ordinary Differential Equations for Long-Term Physical Forecasting

COGENT: 基于神经常微分方程的连续图仿真器用于长期物理预测

Zesheng Liu, Maryam Rahnemoonfar

发表机构 * Lehigh University(理海大学)

AI总结 提出COGENT,一种结合图编码器和神经常微分方程的连续图仿真器,用于不规则地理空间网格上的长期物理预测,通过连续潜在动力学实现任意时间预测,并采用滚动视界采样和渐进调度策略稳定训练,在冰盖模拟中展现出优于自回归图基线的长期稳定性。

详情
AI中文摘要

在这项工作中,我们提出了COGENT,一种基于神经常微分方程的连续图仿真器,用于不规则地理空间网格上的长期物理预测。COGENT使用基于图的历史编码器对系统状态的有限历史以及相关的强迫场和外部强迫进行编码,生成节点级上下文向量,这些向量捕捉了局部空间交互和时间演化。这些上下文向量初始化并条件化一个潜在神经常微分方程,其动力学由插值的未来强迫和显式的相对展开时间驱动。通过将预测轨迹建模为连续潜在动力系统,COGENT可以在任意未来时间生成预测,而不受固定时间离散化的限制。残差解码器将得到的潜在轨迹映射回未来的物理状态,从而实现直接的多步预测,无需反复将预测状态反馈到模型中。这种公式将基于图的空间表示、历史条件化的潜在动力学和连续时间展开统一在一个框架中,用于基于网格的物理仿真模拟。为了稳定长期监督的训练,我们还提出了有效的展开视界采样和渐进式展开视界调度策略。我们在由冰盖和海平面系统模型生成的瞬态冰盖模拟上评估了COGENT,展示了相对于自回归图基线的改进的长期稳定性。这些结果表明,连续图神经ODE为不规则地理空间网格上的可扩展物理预测提供了一种有前景的方法,特别是在需要稳定长期预测和能够在任意时间查询系统状态的应用中。

英文摘要

In this work, we present COGENT, a continuous graph emulator with Neural Ordinary Differential Equations for long-term physical forecasting on irregular geospatial meshes. COGENT encodes a finite history of system states and associated forcing fields and external forcings with a graph-based history encoder, producing node-wise context vectors that capture both local spatial interactions and temporal evolution. These context vectors initialize and condition a latent Neural Ordinary Differential Equation whose dynamics are driven by interpolated future forcings and explicit relative rollout time. By modeling the forecast trajectory as a continuous latent dynamical system, COGENT can generate predictions at arbitrary future times rather than being restricted to a fixed temporal discretization. A residual decoder maps the resulting latent trajectories back to future physical states, enabling direct multi-step forecasting without repeatedly feeding predicted states back into the model. This formulation combines graph-based spatial representation, history-conditioned latent dynamics, and continuous-time rollout in a unified framework for mesh-based physical simulation emulation. In order to stabilize training with long-horizon supervision, we also propose effective rollout-horizon sampling and a progressive rollout-horizon scheduling strategy. We evaluate COGENT on transient ice-sheet simulations generated by the Ice-sheet and Sea-level System Model, demonstrating improved long-range stability over autoregressive graph baselines. These results suggest that continuous graph Neural ODEs provide a promising methodology for scalable physical forecasting on irregular geospatial meshes, particularly in applications that require stable long-horizon predictions and the ability to query system states at arbitrary times.

2606.11151 2026-06-10 cs.RO 新提交

JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation

JOIN:通过对抗、推理和导航实现基于锚点抓取条件的双臂辅助操作连接

Drake Moore, Matt Cheng, Xiang Zhi Tan, Taşkın Padır

发表机构 * Northeastern University(东北大学)

AI总结 提出一种异构按需双臂系统JOIN,通过锚点臂与移动补臂的条件性连接,结合视觉语言模型和几何工具,解决代表性双臂日常生活任务,在实验中成功率更高且需更少人工修正。

详情
Comments
Xiang Zhi Tan and Taşkın Padır share equal advising
AI中文摘要

辅助移动和操作平台作为恢复残疾人独立性的手段已受到越来越多的关注。虽然对于许多基本的日常生活活动(ADL)有效,但诸如开罐、倒液体、端托盘或基本餐食准备等大量日常任务本质上是双臂的,任何单臂系统都无法完成。由于额外的功耗、成本以及转移和移动所需空间的损失,在轮椅上增加第二只手臂是不切实际的。我们提出了一种异构的按需双臂系统,其中安装在轮椅上的锚点臂在需要时与一个被召唤的移动操作器(作为补臂)连接。我们称之为双臂连接的核心技术问题是有条件的:锚点臂已经确定了抓取,补臂必须选择站立位置和抓取对象以完成任务。我们将双臂连接分解为三个阶段(规划、驱动、抓取),并表明视觉语言模型(VLM)结合标准几何工具,提供了足以解决代表性双臂ADL类别的任务级知识。我们的系统JOIN贡献了(i)一个轮椅参考的对抗分数,以及(ii)任务条件方向可操作性。我们在Kinova Gen3锚点臂和Hello Robot Stretch~3补臂上对代表性的同对象和异对象任务进行了评估。JOIN完成了更多尝试(19/20),优于最先进的方法(14/20),并且需要的操作员修正明显更少。

英文摘要

Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percentage of everyday tasks such as opening a jar, pouring a liquid, lifting a tray, or basic meal preparation, is fundamentally bimanual and remains out of reach for any single-arm system. Adding a second arm to a wheelchair is impractical, due to the additional power draw, cost, and the loss of space required for transfers and mobility. We instead propose a heterogeneous, on-demand bimanual system, in which a wheelchair-mounted anchor arm is joined when needed by a summoned mobile manipulator that serves as a complement arm. The central technical problem, which we call bimanual joining, is conditional: the anchor has already committed to a grasp, and the complement arm must choose where to stand and what to grasp to complete the task. We formulate bimanual joining as a three-phase decomposition (plan, drive, grasp) and show that a vision-language model (VLM), coupled with standard geometric tools, provides task-level knowledge sufficient to solve a representative class of bimanual ADLs. Our system JOIN, contributes (i) a wheelchair-referenced opposition score, and (ii) task-conditioned directional manipulability. We evaluate JOIN on a Kinova Gen3 anchor and a Hello Robot Stretch~3 complement on representative same-object and different-object tasks. JOIN accomplished more attempts (19/20) than state-of-the-art methods (14/20) and required markedly less correction by the operator.

2606.11149 2026-06-10 cs.LG 新提交

Efficiently Learning Drifting Halfspaces with Massart Noise

高效学习带有Massart噪声的漂移半空间

Mingchen Ma, Guyang Cao, Jelena Diakonikolas, Ilias Diakonikolas

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对Massart噪声下的漂移概念学习问题,提出一种计算高效的学习器,实现误差η + Õ(Δ^{1/3}/γ),并证明该误差在低度多项式测试下最优。

详情
Comments
To appear at ICML 2026
AI中文摘要

我们研究了在Massart噪声存在下学习漂移概念的问题。在该框架中,在线学习者可以访问独立样本的历史记录,这些样本的标签是目标概念的带噪版本,且目标概念可能每轮发生变化。目标是每轮输出一个具有较小预测误差的假设。我们研究了基本类别——边缘可分离线性分类器(半空间)——的学习复杂性。在正面结果方面,我们给出了一种计算高效的学习器,其误差达到η + Õ(Δ^{1/3}/γ),其中η是Massart噪声率的上界,Δ是漂移率,γ是边缘。有趣的是,在可实现设置中,我们技术的改编产生了一个高效学习器,其误差率优于先前工作。在下界方面,我们提供了信息-计算权衡的形式化证据,强烈表明我们算法的性能本质上是最优的。具体来说,虽然信息论最优误差随Δ^{1/2}缩放,但我们证明即使在随机分类噪声的特殊情况下,Δ^{1/3}缩放对于低度多项式测试也是不可避免的。

英文摘要

We study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may change from round to round. The goal is to output, in each round, a hypothesis with small prediction error. We study the complexity of this learning problem for the fundamental class of margin-separable linear classifiers (halfspaces). On the positive side, we give a computationally efficient learner achieving error $η+ \tilde O(Δ^{1/3}/γ)$, where $η$ upper bounds the Massart noise rate, $Δ$ is the drift rate, and $γ$ is the margin. Interestingly, in the realizable setting, an adaptation of our techniques yields an efficient learner with an improved error rate over prior work. On the lower-bound side, we provide formal evidence of an information-computation tradeoff, strongly suggesting that our algorithm's performance is essentially optimal. Specifically, while the information-theoretically optimal error scales with $Δ^{1/2}$, we prove that $Δ^{1/3}$-scaling is unavoidable for low-degree polynomial tests, even in the special case of random classification noise.

2606.11148 2026-06-10 cs.CV 新提交

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

MOFA-VTON: 虚拟试衣中细粒度调整带来的更多时尚可能性

Xiaoyu Han, Chenyang Wang, Jing Wang, Shunyuan Zheng, Quanling Meng, Shengping Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) HiDream.ai Inc.(HiDream.ai公司) Harbin Institute of Technology (Weihai) Qingdao Research Institute(哈尔滨工业大学(威海)青岛研究院)

AI总结 提出MOFA-VTON方法,通过用户绘制简单草图实现虚拟试衣中服装布局的细粒度调整,利用掩码构建策略和布局调整模块,在VITON-HD和DressCode数据集上超越现有方法。

详情
Comments
Accepted to CVPR 2026 (Highlight)
AI中文摘要

虚拟试衣旨在将店内服装图像贴合到特定人体上。理想的虚拟试衣方法应提供多样且灵活的着装选择,准确反映现实场景中不同的穿着风格,并根据个人偏好和时尚追求进行定制。然而,当前方法主要按照相同的穿着模式直接替换原始服装为目标服装。这种对服装适配的有限控制可能导致固定且单调的试衣输出。为了探索虚拟试衣中细粒度调整带来的更多时尚可能性,我们提出了一种新颖的虚拟试衣方法,称为MOFA-VTON,它允许用户通过简单草图调整试衣结果中的服装适配。具体来说,我们首先设计了一种掩码构建策略,将用户绘制的曲线草图转换为双区域掩码,取代传统的与服装无关的掩码,并为后续生成过程提供细粒度的布局指导。此外,我们提出了布局调整块,利用交叉注意力机制独立学习人体上半身和下半身区域的布局对应关系,细化这两个区域的空间排列。通过这些实现,我们的方法能够灵活地对目标服装进行细粒度调整,克服了固定布局的限制。在VITON-HD和DressCode数据集上的大量实验表明,我们提出的MOFA-VTON优于先前的最先进方法,并为虚拟试衣提供了更多时尚可能性。

英文摘要

Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

2606.11144 2026-06-10 cs.LG q-bio.GN q-bio.QM stat.AP 新提交

OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

OncoTraj:EGFR突变非小细胞肺癌奥希替尼耐药纵向预测的公共基准

Abhijoy Sarkar, Aarchi Singh Thakur

发表机构 * Span AI

AI总结 针对EGFR突变非小细胞肺癌一线奥希替尼耐药预测缺乏公共基准的问题,提出OncoTraj基准,整合813名患者数据,定义三项任务,并发现单时间点组织NGS特征导致所有模型性能接近随机,而TP53共突变与进展率升高相关。

详情
Comments
24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1
AI中文摘要

EGFR突变非小细胞肺癌(NSCLC)对一线奥希替尼的耐药是治疗压力下可预测克隆演化的典型例子,但目前尚无用于训练或评估相应纵向患者轨迹计算模型的公共基准。我们推出OncoTraj,这是一个来自三个真实世界临床基因组数据源(MSK-CHORD(672名患者)、AACR Project GENIE BPC NSCLC(34名患者)和FLAURA分子耐药补充(107名患者))的813名接受一线奥希替尼治疗的EGFR突变NSCLC患者的公共基准。OncoTraj定义了三个锁定任务:(A)固定12个月标志点的进展二元分类,(B)首次进展时间(天)的回归,以及(C)主要耐药机制的六类分类。我们发布了统一的数据集、经过审计的无泄漏保证的患者级训练/验证/测试划分、一个开源评估框架,以及六个参考基线,涵盖多数类预测器、逻辑回归、随机森林、XGBoost、LSTM和多任务Transformer。使用v1的单时间点快照特征,所有模型在干净的源内评估中均未超过随机水平:这种天花板在不同模型类别中的一致性表明限制在于输入模态(单快照组织NGS而非连续ctDNA),而非算法。该基准确实恢复了可重复的、与文献一致的关联:TP53共突变使整个队列的12个月进展率从29%提高到59%。OncoTraj建立了一个可重复、经泄漏审计的基线,并将模态限制转化为针对富集连续ctDNA的v2的具体设计要求。

英文摘要

Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

2606.11130 2026-06-10 cs.LG 新提交

Robust Regression of General ReLUs with Queries

一般ReLU的鲁棒回归与查询

Ilias Diakonikolas, Daniel M. Kane, Mingchen Ma

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of California, San Diego(加利福尼亚大学圣迭戈分校)

AI总结 针对高斯分布下一般ReLU的平方损失鲁棒回归,提出首个高效查询算法,使用d polylog(1/ε)+Õ(min{1/p,1/ε})个标签查询达到O(opt)+ε误差,并证明查询复杂度近最优。

详情
Comments
Appeared at NeurIPS 2025
AI中文摘要

我们研究在被动学习设置中,最近的工作给出了一个计算高效的算法,使用$poly(d,1/\epsilon)$个标记样本,输出误差为$O(opt)+\epsilon$的假设,其中$opt$是最佳拟合ReLU的平方损失。这里我们关注交互式设置,其中学习器对未标记样本的标签具有某种形式的查询访问。我们的主要结果是第一个计算高效的学习器,使用$d \operatorname{polylog}(1/\epsilon)+\tilde{O}(\min\{1/p, 1/\epsilon\})$个黑盒标签查询,其中$p$是目标函数的偏置,并达到误差$O(opt)+\epsilon$。我们通过证明其查询复杂度界在性质上接近最优来补充我们的算法结果,即使忽略计算约束。最后,我们确定查询访问对于改进被动学习的标签复杂度本质上是必要的。具体而言,对于基于池的主动学习,任何主动学习器都需要$\tilde{\Omega}(d/\epsilon)$个标签,除非它抽取了超多项式数量的未标记样本。

英文摘要

We study the task of agnostically learning general (as opposed to homogeneous) ReLUs under the Gaussian distribution with respect to the squared loss. In the passive learning setting, recent work gave a computationally efficient algorithm that uses $poly(d,1/ε)$ labeled examples and outputs a hypothesis with error $O(opt)+ε$, where $opt$ is the squared loss of the best fit ReLU. Here we focus on the interactive setting, where the learner has some form of query access to the labels of unlabeled examples. Our main result is the first computationally efficient learner that uses $d polylog(1/ε)+\tilde{O}(\min\{1/p, 1/ε\})$ black-box label queries, where $p$ is the bias of the target function, and achieves error $O(opt)+ε$. We complement our algorithmic result by showing that its query complexity bound is qualitatively near-optimal, even ignoring computational constraints. Finally, we establish that query access is essentially necessary to improve on the label complexity of passive learning. Specifically, for pool-based active learning, any active learner requires $\tildeΩ(d/ε)$ labels, unless it draws a super-polynomial number of unlabeled examples.

2606.11129 2026-06-10 cs.CV 新提交

WorldOlympiad: Can Your World Model Survive a Triathlon?

WorldOlympiad:你的世界模型能经受铁人三项考验吗?

Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里巴巴达摩院) The Hong Kong University of Science and Technology(香港科技大学) Monash University(莫纳什大学) TRE, Alibaba Group(阿里巴巴TRE)

AI总结 提出WorldOlympiad基准,从物理忠实性、几何一致性和交互保真度三个维度诊断视频世界模型,揭示现有模型在物理推理、3D一致性和长程交互方面的显著不足。

详情
Comments
Project Page: https://alibaba-damo-academy.github.io/WorldOlympiad/, Code: https://github.com/alibaba-damo-academy/WorldOlympiad
AI中文摘要

我们介绍WorldOlympiad,一个用于诊断基于视频的世界模型在物理忠实性、几何一致性和交互保真度方面的基准。现有基准通常关注视觉质量、语义对齐或短期时间一致性,但很少能洞察生成视频是否遵循物理规则、保持连贯的3D结构以及支持长程可控交互。为弥补这一空白,WorldOlympiad将世界模型评估分解为三个互补维度。物理轨迹使用对象分割和MLLM作为评判者,评估生成视频是否遵循力学、热现象和材料属性中的可解释规则。几何轨迹通过高斯泼溅重建生成视频,评估结构一致性、跨视角连贯性和相机轨迹对齐。交互轨迹评估生成序列是否遵循复杂动作提示并在连续视频块间保持平滑连贯的过渡。WorldOlympiad进一步涵盖三个主要下游场景,包括游戏、机器人和通用真实世界视频,捕捉从交互控制、具身操作到开放域运动和相机动态的多样化挑战。这些轨迹和场景共同构成了一个可扩展且可解释的评估套件,揭示了超越通用视频质量的失败模式。对最先进模型的实验揭示了物理推理、3D一致性和长程交互方面的显著差距,强调了为生成式世界模型制定更结构化评估协议的必要性。

英文摘要

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

2606.11127 2026-06-10 cs.CL cs.AI 新提交

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

基于来源的门控与自适应恢复在合成后训练数据筛选中的应用

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs

AI总结 研究合成后训练数据筛选中的来源证据门控与样本自适应恢复,提出结合故障诊断与定向再生成的自适应恢复流水线,提高产量、恢复率和注入召回率。

详情
AI中文摘要

合成后训练流水线通常使用奖励模型或整体LLM评判器对生成的样本进行过滤,但两个实践很少被一起检验:过滤信号是否基于引发每个生成的来源证据,以及被拒绝的样本是否可以系统性地恢复而非永久丢弃。我们通过对抗性注入语料库提供真实故障标签,在门控配置、恢复策略和生成器规模上对这两个问题进行了受控研究。我们发现,精确的来源出处改善了更强评判器的忠实度门控;幻觉门控和奖励门控拒绝的样本群体大多不重叠,因此两者都是必要的;结合故障诊断与定向再生成的自适应恢复流水线比简单重采样实现了更高的产量、恢复率和注入召回率。下游微调质量主要由生成器规模驱动,过滤和恢复条件虽有重要贡献但处于次要地位。

英文摘要

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

2606.11123 2026-06-10 cs.LG 新提交

Overcoming Rank Collapse in Feedback Alignment

克服反馈对齐中的秩坍缩

Gauthier Boeshertz, Razvan Pascanu, Claudia Clopath

发表机构 * Imperial College London(伦敦帝国理工学院) Mila(Mila研究所)

AI总结 研究发现反馈对齐(FA)在深层网络中因误差信号秩低而失效,提出通过Muon优化器和隐藏活动归一化提升信号维度,在CIFAR100上ResNet-18准确率提升9个百分点。

详情
Comments
9 pages and 4 figures, 1 table for main text. Total of 28 pages and 13 figures with appendix
AI中文摘要

反向传播(BP)被广泛认为在生物学上不可行,部分原因在于它要求反馈权重是前向权重的转置以进行误差传播。有趣的是,当使用固定的随机反馈权重训练网络以规避此问题时,学习过程会将前向权重与反馈权重对齐,导致反向传播的误差信号成为BP使用的标准梯度的近似。这一过程称为反馈对齐(FA),在MLP和非常浅的CNN中有效,但难以扩展到更深层的架构。在这项工作中,我们首先研究了在CIFAR10上训练的BP和FA模型之间的差异,特别关注信号的有效秩。我们发现FA误差的秩显著较低,因此被限制在比BP更低维的子空间中,限制了参数空间的探索。受此观察启发,我们评估了两种增加FA有效维度的机制:Muon,一种使权重更新正交化的优化器;以及隐藏活动归一化,促进激活正交性。在更大的架构和基准测试中,我们发现这些方法一致地优于FA基线,例如,在CIFAR100上使用ResNet-18,准确率提高了9个百分点。我们的结果将低维梯度动力学确定为扩展FA的关键障碍,并表明诱导更高维的更新几何是扩展反向传播替代方法的有前途的途径。

英文摘要

Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.

2606.11120 2026-06-10 cs.AI cs.CV 新提交

Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

蒙特卡洛传球搜索:利用轨迹生成进行足球3D反事实传球评估

Andrew Kang, Priya Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出蒙特卡洛传球搜索(MCPS),结合价值模型、世界模型和反事实策略,基于3D轨迹数据评估足球传球,通过两种执行盈余分数实现分布感知的传球分析。

详情
Comments
CVPR 2026, CVSports Workshop
AI中文摘要

我们将足球传球评估重新定义为类似蒙特卡洛树搜索(MCTS)的评估问题,其组成部分大多以不同名称存在于文献中:价值模型(控球价值)、世界模型(带球交互的多智能体轨迹)以及反事实动作策略(带噪声的传球变体采样)。基于德甲联赛首个公开的高保真3D球轨迹跟踪数据集,我们引入了蒙特卡洛传球搜索(MCPS),该方法推断每个观察到的传球的踢球参数,采样执行变体和选项变体,使用球条件世界模型将每个候选向前滚动直到下一次球交互,并通过学习到的价值模型对结果进行评分,以获得所获价值的分布。该分布通过两种互补的执行盈余分数(基于均值和基于百分位的分数)实现分布感知的归因,用于分析和排名。为了使世界模型在有限的公开数据下具有样本效率,我们改编了来自自动驾驶的离散令牌自回归轨迹生成器(SMART),并表明与基线相比,它在最佳20次预测准确性上表现强劲,同时支持完全假设的展开以进行下游评估。我们已发布模型检查点和代码。

英文摘要

We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

2606.11119 2026-06-10 cs.LG cs.AI cs.CL 新提交

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

TRACE:一种用于高效智能体强化学习的统一展开预算分配框架

Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji

发表机构 * Tsinghua University(清华大学) Tencent(腾讯)

AI总结 针对多轮智能体强化学习中奖励对比度不足的问题,提出TRACE框架,通过将每个ReAct式思考-行动-观察步骤建模为语义节点,在固定采样预算内将预算分配到提示根和中间前缀,增强奖励对比,提升策略更新信号。

详情
Comments
32 pages, 12 figures, 6 tables
AI中文摘要

具有可验证奖励的强化学习(RLVR)是增强大型语言模型推理和智能体行为的一种有前景的方法。然而,展开密集的策略优化常常受到奖励对比度不足的限制,当过于简单或复杂的提示产生低方差反馈,以及当仅结果奖励对多轮展开中的每个决策赋予相同的终端评估时,就会出现这种情况。过去的努力集中在将可用的展开资源分配给有希望的提示,但它们仅利用提示级别的样本信息性,而忽略了同一展开中不同轮次之间前缀级别信息性的变化。本工作针对多轮智能体强化学习,将每个ReAct风格的思考-行动-观察步骤建模为语义上不同的节点,使得预算分配从提示根扩展到具有进一步延续的轮次级前缀,这自然形成了树状结构的展开。我们引入了树状展开分配用于对比探索(TRACE),这是一个统一的展开分配框架,在固定采样预算内增强奖励对比。在技术上,TRACE将展开预算分配给最可能产生混合终端奖励的提示根和中间前缀。一个共享的通用预测器根据前缀历史估计这些锚点处的条件成功概率,以指导这种分配。由此产生的自适应树状结构丰富了仅结果反馈,并放大了策略更新信号。实验上,TRACE在典型的智能体基准测试中取得了有竞争力的性能和效率提升,例如,在相同采样成本下,Qwen3-14B多跳问答的平均准确率比竞争基线提高了2.8个百分点。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

2606.11109 2026-06-10 cs.RO 新提交

EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid Robots

EM-Fall: 用于人形机器人昼夜跌倒检测的具身毫米波感知

Yanshuo Lu, Yuxuan Hu, Shenghai Yuan, Xinyu Zhou, Kuangji Zuo, Bofan Lyu, XiChen Yuan, Jianfei Yang

发表机构 * MARS Lab(MARS实验室) NTU(南洋理工大学) IOT Lab(物联网实验室)

AI总结 提出EM-Fall框架,将毫米波感知与移动人形机器人结合,通过主动调整视角实现跨房间遮挡下的跌倒检测,并设计轻量时序模型处理宠物干扰和多径效应,在8个真实环境中验证了鲁棒性。

详情
AI中文摘要

跌倒是老年人受伤和住院的主要原因之一,因此可靠的跌倒感知成为住宅环境中安全监测的重要能力。然而,现有的跌倒检测系统通常依赖于可穿戴设备或固定传感装置,可能存在用户依从性低、空间覆盖有限或在遮挡和光照不良条件下性能下降的问题。在这项工作中,我们提出了\textbf{EM-Fall},一种部署在移动人形机器人上的具身跌倒检测框架。该系统将毫米波(mmWave)感知与机器人移动性相结合,使机器人能够主动调整其传感视角,并在跨房间和遮挡情况下保持目标可观测性。为了解决复杂住宅环境中的干扰,包括宠物运动和多径伪影,我们设计了一个以人为中心的感知流水线,结合轻量级时序建模,以捕捉跌倒事件前、中、后的运动演变。我们在八个真实室内环境中对四位参与者进行了系统评估,并构建了一个家庭毫米波跌倒检测数据集。实验结果表明,具身移动感知范式提高了监测连续性,并在多种环境条件下保持了鲁棒的跌倒检测性能。所提出的框架为家庭环境中的机器人辅助安全监测提供了一种实用解决方案。

英文摘要

Falls are one of the leading causes of injury and hospitalization among elderly individuals, making reliable fall awareness an essential capability for safety monitoring in residential environments. However, existing fall detection systems often rely on wearable devices or fixed sensing installations, which may suffer from low user compliance, limited spatial coverage, or degraded performance under occlusion and poor lighting conditions. In this work, we propose \textbf{EM-Fall}, an embodied fall detection framework deployed on a mobile humanoid robot. The system integrates millimeter-wave (mmWave) sensing with robotic mobility, allowing the robot to actively adjust its sensing viewpoint and maintain target observability across rooms and under occlusion. To address interference in complex residential environments, including pet motion and multipath artifacts, we design a human-centered perception pipeline combined with lightweight temporal modeling to capture motion evolution before, during, and after fall events. We evaluate the proposed system across eight real indoor environments with four participants and construct an in-home mmWave fall detection dataset. Experimental results show that the embodied mobile sensing paradigm improves monitoring continuity and maintains robust fall detection performance under diverse environmental conditions. The proposed framework provides a practical solution for robot-assisted safety monitoring in home environments.

2606.11106 2026-06-10 cs.CV cs.AI 新提交

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA: 可访问的胎儿超声解读与标注——基于选择性蒸馏的统一视觉-语言模型

Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) HMC(哈马德医疗公司) Advanced AlRazi Diagnostic Center(高级阿尔拉齐诊断中心) Sidra Medicine(锡德拉医学)

AI总结 提出统一视觉-语言模型FADA,通过选择性蒸馏从四个领域基础模型提取知识,实现胎儿超声的解读、分类、检测和分割,在单个消费级GPU上训练,无需外部标签,可在智能手机上离线运行。

详情
AI中文摘要

全球范围内受过训练的超声技师短缺限制了低收入和中等收入国家的产前超声筛查,这些国家超过一半的孕妇未接受专业超声检查。当前的深度学习方法分别处理检测、分割或分类,每个任务都需要单独的模型和推理时的专家指定标签。我们提出FADA,一个基于Qwen3.5-VL构建的统一视觉-语言模型,通过单一解读优先的流程执行临床解读、分类、检测和分割,无需外部标签。FADA通过离线预计算特征缓存,从四个领域基础模型(FetalCLIP、UltraSAM、USF-MAE、UltraFedFM)中蒸馏知识。选择性蒸馏仅对标注任务应用特征对齐,而解读任务依赖标准微调,在大多数评估指标上持续优于完全蒸馏。推荐变体FADA-SKD在分割上达到0.8820平均Dice,检测上达到0.7671 mAP@0.50,结构化解读合规性达到100%。专家超声技师对237张图像的验证确认了在自主和人机协同模式下输出临床可接受,其中73.5%的解读在临床医生指导下获得完美评分。该系统可在单个消费级GPU上训练,无需云连接即可部署。我们通过在商用智能手机(高通骁龙7 Gen 1,12 GB RAM)上使用GGUF量化的this http URL运行压缩的0.8B模型,验证了边缘部署,完全离线完成全部5阶段流程约需60秒。这为将AI辅助胎儿评估与便携式超声设备集成提供了实用途径,直接解决了资源受限环境中的诊断可及性差距。代码、模型和数据可在https://this https URL获取。

英文摘要

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

2606.11105 2026-06-10 cs.CL cs.AI 新提交

PhantomBench: Benchmarking the Non-existential Threat of Language Models

PhantomBench: 对语言模型非存在性威胁的基准测试

Haeji Jung, Hila Gonen

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Canada CIFAR AI Chair, Amii(加拿大CIFAR人工智能主席,阿米研究所)

AI总结 提出PhantomBench,首个大规模非存在概念基准,包含6万多个虚构实体,评估21个模型,发现平均幻觉率高达86.7%,前沿模型也难以避免。

详情
AI中文摘要

幻觉,即语言模型生成事实无依据的响应,会带来严重风险,因为用户倾向于盲目依赖它们。在高风险领域,这种模型行为的后果可能导致重大伤害。尽管在理解幻觉方面取得了显著进展,但这些模型如何可靠地识别其知识边界仍不清楚。我们引入了PhantomBench,这是首个此类大规模基准,包含来自不同领域真实概念的6万多个不存在的术语和实体。使用我们的基准,我们评估了各种类型和大小的共21个模型。我们展示了令人震惊的幻觉率(在某些情况下平均高达86.7%),并注意到即使是前沿模型也令人惊讶地无法在不存在的概念上弃权,特别是当输入预设它们存在时。然后,我们展示了PhantomBench可以作为研究模型在罕见概念上行为的代理,这些概念更容易产生幻觉。我们还提供了一个构建PhantomBench的流程,使得能够根据研究人员和实践者的特定需求可扩展地生成不存在的概念。

英文摘要

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

2606.11096 2026-06-10 cs.CV 新提交

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

IDEAL: 深度对齐实现离散表示自编码器

Yitong Chen, Zijie Diao, Junke Wang, Lingyu Kong, Yixuan Ren, Bo He, Yu-Gang Jiang, Zuxuan Wu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Shanghai Innovation Institute(上海创新研究院) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出IDEAL框架,通过联合对齐量化令牌与浅层和深层VFM特征,提升离散表示自编码器的重建质量,在ImageNet上实现0.61 rFID,并创下自回归图像生成新纪录(gFID 1.89)。

详情
Comments
Code is available at https://github.com/Row11n/IDEAL
AI中文摘要

基于预训练视觉基础模型(VFM),表示自编码器(RAE)最近成为构建用于图像生成的语义丰富潜在空间的有前途方法。然而,它们的重建质量通常仍然次优,很大程度上是因为深层VFM表示没有保留足够的细粒度视觉细节。这种限制在离散化后变得更加严重,缺失的低级信息难以恢复。事实上,我们观察到浅层VFM特征保留了更丰富的局部外观和结构细节,这补充了现有RAE中使用的深层特征所携带的高级语义。受这种互补特性的启发,我们提出了Ideal,一种用于离散表示自编码的深度对齐框架。通过联合对齐量化令牌与浅层和深层VFM特征,Ideal使得生成的离散视觉令牌能够同时保留视觉保真度和丰富语义。大量实验表明,Ideal实现了卓越的重建性能,在ImageNet上达到0.61 rFID,比之前最佳方法高出0.28。当用于自回归图像生成时,Ideal进一步产生了1.89的gFID,为自回归图像生成建立了新的最先进水平。

英文摘要

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

2606.11088 2026-06-10 cs.RO 新提交

A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments

资源受限环境下的分布式多UGV探索框架:环回感知规划与描述符辅助定位

Zhiwei Li, Haiou Liu, Xijun Zhao, Ji Li, Yingze Wang, Boyang Wang

发表机构 * School of Mechanical Engineering, Beijing Institute of Technology(北京理工大学机械与车辆学院) China North Artificial Intelligence & Innovation Research Institute, Collective Intelligence & Collaboration Laboratory (CIC)(中国北方人工智能与创新研究院集体智能与协作实验室) Zhengzhou Intelligent Technology Research Institute, Beijing Institute of Technology(北京理工大学郑州智能科技研究院)

AI总结 提出一种完全分布式的多无人地面车辆(UGV)探索框架,通过轻量级LiDAR全局描述符实现跨UGV环回检测,并结合环回感知分层规划,在资源受限环境中减少探索时间和行驶距离。

详情
Journal ref
IEEE Transactions on Industrial Electronics, 2026
AI中文摘要

在未知、无GPS且带宽受限的环境中,多无人地面车辆(UGV)在没有先验地图的情况下进行鲁棒且高效的协同探索仍然具有挑战性,因为定位漂移会降低地图一致性并导致重复覆盖。本文提出了一种完全分布式的探索框架,该框架将描述符辅助的跨UGV环回闭合与环回感知分层规划相结合,同时实现自主定位和探索。我们开发了一种轻量级的LiDAR全局描述符,具有距离图像预对齐功能,可在较大的偏航和横向变化下实现鲁棒的跨UGV地点识别,并使用验证的环回闭合来维护全局一致的轨迹和稀疏拓扑表示。我们进一步引入了一种不确定性感知的跨UGV环回闭合选择模块,该模块在姿态不确定性下对候选环回闭合进行评分,并将高实用性的环回闭合保留为规划锚点,用于全局任务分配和局部路径优化。仿真和真实UGV实验表明,环回闭合模块实现了89.9%/95.5%的AR@1/AR@1%,分布式优化减少了绝对轨迹误差,系统显著降低了双向通信量,并且与mTSP基线相比,整体框架将探索时间和行驶距离分别减少了15%和14%。

英文摘要

Robust and efficient cooperative exploration with multiple unmanned ground vehicles (UGVs) in unknown, GPSdenied, and bandwidth-limited environments without prior maps remains challenging, as localization drift degrades map consistency and induces redundant coverage. This paper presents a fully distributed exploration framework that couples descriptoraided inter-UGV loop closure with loop-aware hierarchical planning while enabling autonomous localization and exploration. We develop a lightweight LiDAR global descriptor with range-image prealignment to enable robust cross-UGV place recognition under large yaw and lateral variations, and use verified loop closures to maintain globally consistent trajectories and a sparse topological representation. We further introduce an uncertainty-aware crossUGV loop-closure selection module that scores candidate loop closures under pose uncertainty and retains high-utility loop closures as planning anchors for global task allocation and local route refinement. Simulations and real-UGV experiments show that the loop-closure module achieves AR@1/AR@1% of 89.9%/95.5%, distributed optimization reduces absolute trajectory error, the system substantially reduces two-way communication volume, and the overall framework reduces exploration time and travel distance by 15% and 14%, respectively, compared with an mTSP baseline.

2606.11087 2026-06-10 cs.LG cs.AI 新提交

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

强化学习中流策略的测试时梯度引导

Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine

发表机构 * UC Berkeley(加州大学伯克利分校) Physical Intelligence

AI总结 提出QGF算法,通过预训练参考流策略和价值函数,在测试时利用价值梯度引导策略生成高价值动作,无需额外策略学习,在离线RL基准上优于现有测试时方法且与训练时方法竞争力相当。

详情
AI中文摘要

表达性连续控制策略,如扩散模型和流模型,构成了模拟和真实机器人控制中模仿学习近期进展的基础。尽管它们在监督模仿学习设置中能够稳定扩展,但将其纳入强化学习(RL)流程以改进策略已被证明更加困难。这通常需要专门的训练目标或通过去噪过程反向传播,这会导致众所周知的稳定性问题并影响可扩展性。在本文中,我们研究了一个问题:仅在测试时采用简单的策略改进方案,同时保持稳定的监督策略训练不变,是否能够成为避免这些问题的竞争性替代方案。为此,我们提出了QGF(Q-Guided Flow),一种完全在测试时进行策略优化的RL算法。QGF通过预训练一个参考流策略(通过标准的行为克隆目标)和一个价值函数评论家,并在测试时使用价值梯度引导参考策略生成更高价值的动作,而无需任何额外的策略学习。实验上,QGF在高维动作空间的单任务和目标条件离线RL基准测试中优于先前的测试时RL方法,并且与最先进的训练时算法竞争力相当,同时运行成本更低。此外,通过避免演员-评论家训练的不稳定性,它展现出随模型规模的良好扩展性,为使用表达性策略提供了一种实用且有效的替代RL算法。

英文摘要

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

2606.11081 2026-06-10 cs.LG cs.AI 新提交

Unifying Local Communications and Local Updates for LLM Pretraining

统一大语言模型预训练中的本地通信与本地更新

Pietro Cagnasso, Eugene Belilovsky, Edouard Oyallon

发表机构 * Concordia University(康考迪亚大学) Mila CNRS, Sorbonne University(法国国家科学研究中心,索邦大学)

AI总结 提出GASLoC算法,通过去中心化训练框架统一本地通信与更新,在异构带宽下优于DiLoCo,支持自适应优化器和多本地步骤。

详情
Comments
38 pages, 9 figures
AI中文摘要

随着训练依赖于跨集群、数据中心和低带宽链接的分布式计算,通信高效的大语言模型预训练变得越来越重要。许多实用方法降低了通信频率,但仍依赖于同步All-Reduce操作,这些操作保持相同的模型状态并将进度与全局集体操作绑定。当带宽或工作速度异构时,这可能成为瓶颈。我们引入了GASLoC,一种新颖的去中心化预训练算法,它将通信加速的概念推广到最近流行的“外部优化器”,以允许一个实用的基于八卦的训练框架,该框架与自适应优化器兼容,允许本地优化器步骤,并可以利用稀疏随机对等通信。在多个标准LLM训练任务上的实验表明,GASLoC在单步每通信设置下,对于多种拓扑结构优于最先进的去中心化算法,并且与LLM设置中现有的去中心化方法不同,它在利用多个本地步骤时能够获得与DiLoCo竞争的性能。在异构带宽设置下,我们展示了GASLoC的优势,表明它可以显著优于DiLoCo。

英文摘要

Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely on synchronous All-Reduce operations that maintain identical model states and tie progress to global collectives. This can become a bottleneck when bandwidth or worker speed is heterogeneous. We introduce GASLoC, a novel decentralized pre-training algorithm that generalizes the notion of communication acceleration to the recently popular "outer optimizer" to allow a practical gossip-based training framework that is compatible with adaptive optimizers, allows for local optimizer steps, and can utilize sparse randomized peer communication. Empirically, on a number of standard LLM training tasks, we demonstrate that GASLoC outperforms state-of-the-art decentralized algorithms in single step per communication setting for a number of topologies and, unlike existing decentralized methods in the LLM setting, it allows to obtain performance competitive with DiLoCo when utilizing multiple local steps. In the heterogeneous bandwidth setting we demonstrate the advantage of GASLoC showing that it can significantly outperform DiLoCo.