arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2606.06671 2026-06-08 cs.CV 新提交

JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

JA-SIREN:通过频谱匹配实现正弦网络的确定性初始化

Mohammed Alsakabi, Kejia Hu, John M. Dolan, Ozan K. Tonguz

发表机构 * Department of Electrical and Computer Engineering, College of Engineering(电气与计算机工程系) The Robotics Institute, School of Computer Science(机器人研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出JA-SIREN确定性初始化方案,利用离散正弦变换和Jacobi-Anger展开解析匹配网络初始频谱与目标信号,消除随机性,在Kodak数据集上PSNR达67.18 dB,比最佳基线提升21.30 dB。

详情
AI中文摘要

现有的隐式神经表示(INR)方法受随机初始化影响,无法保证跨运行的一致性或高质量性能,图像回归中的变化超过2.5 dB(78%)。这种变化对结果可重复性至关重要的科学计算和模拟来说是有问题的。为了解决这个问题,我们提出了Jacobi-Anger正弦表示网络(JA-SIREN),一种基于经典频谱分析的正弦网络确定性初始化方案。通过计算目标信号的离散正弦变换(DST)并利用Jacobi-Anger展开,我们为两层正弦MLP推导出闭式权重,该权重解析地将网络的初始频谱响应与目标信号匹配,无需随机种子或额外的超参数调整。在Kodak数据集上,JA-SIREN实现了67.18 dB的平均PSNR,比最佳基线提高了21.30 dB。这是以零运行间方差实现的,证实了频谱信息初始化是正弦INR中比随机初始化更有效且可重复的替代方案。

英文摘要

Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network's initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

2606.06667 2026-06-08 cs.CL 新提交

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

泛化的搭便车假说:解释和缓解涌现的错位

Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

发表机构 * Northeastern University(东北大学) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出搭便车假说,认为聊天模板标记导致微调行为泛化到无关领域,并设计TReFT方法通过正则化标记表示缓解涌现错位,在多个数据集上有效。

详情
AI中文摘要

LLMs在训练示例之外的广泛过度泛化机制尚不清楚。涌现错位(EM)提供了一个引人注目的案例研究:在狭窄任务上微调会诱导对语义无关测试域的广泛错位。在这项工作中,我们提出了搭便车假说:聊天模板标记可以将微调行为搭便车到域外查询上。我们通过实验验证了这一假说,即对前缀(所有用户查询之前的标记)进行细微扰动,或者用未微调模型的前缀表示替换当前前缀表示,可以在不改变用户查询的情况下恢复对齐。基于这一发现,我们提出了标记正则化微调(TReFT),该方法在训练期间正则化特定标记表示以缓解EM。在不同的模型和多个诱导EM的数据集上,TReFT在保留域内学习的同时减少了EM。在基于法律领域微调的Llama-3.1-8B上,TReFT比使用保留对齐示例的数据交错方法实现了33.5%更多的EM减少。我们进一步展示了TReFT扩展到其他狭窄微调设置,包括弃权、工具使用和拒绝(平均减少54.3%的离题泛化),支持了搭便车假说。总的来说,我们的工作强调了LLMs可能以非预期的方式学习和泛化,并提出了一个走向更受约束微调的路径。它还呼吁进一步研究共享输入特征如何跨域搭便车模型行为。

英文摘要

The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

2606.06660 2026-06-08 cs.AI cs.PF cs.RO 新提交

AEGIS: A Backup Reflex for Physical AI

AEGIS:物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结 提出AEGIS方法,通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤,仅在必要时切换到强策略,在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情
AI中文摘要

长时域机器人操作往往逐渐失败:一个坏步骤会降低状态,策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS(激活探针早期预警、门控推理切换),一种选择性升级方法,通过在弱策略的冻结激活上使用轻量级探针,在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时,控制权切换到更强的独立策略,但仅限于需要它的步骤。在LIBERO-Spatial上,AEGIS恢复了弱策略单独损失的10.1%的轨迹,而预算匹配的盲目升级为4.6%,随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著,经Holm-Bonferroni调整,三个预注册对比:比盲目升级高5.4个百分点,p=8.5e-6;比随机触发高5.0个百分点,p=1.0e-4;配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略,因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764,95% CI [0.70, 0.84],在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划,包括条件恢复任务率估计量和明确的终止标准,并在每臂700个公共随机数情节上确认了结果,nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

2606.06658 2026-06-08 cs.LG cond-mat.stat-mech physics.comp-ph 新提交

Capturing non-Markovian dynamics in non-equilibrium stochastic systems using flow matching

利用流匹配捕捉非平衡随机系统中的非马尔可夫动力学

Bhargav Sriram Siddani, John B. Bell, Alejandro L. Garcia, Ishan Srivastava

发表机构 * Lawrence Berkeley National Laboratory(伯克利国家实验室) San Jose State University(圣何塞州立大学)

AI总结 针对粗粒化随机偏微分方程无法准确捕捉短时非马尔可夫效应和低密度非高斯分布的问题,提出生成式流匹配方法直接建模粒子模拟中的概率通量分布,在Kramers首通时间问题中准确捕捉短时行为并改进数密度统计矩预测。

详情
Comments
5 pages, 1 figure, Accepted to 2026 Conference on Physics and AI (PAI26)
AI中文摘要

由粗粒化随机偏微分方程(如正则化Dean-Kawasaki方程)表示的随机粒子系统的流体动力学模型,无法准确捕捉以非马尔可夫效应为主的短时系统动力学,以及分布高度非高斯化的低粒子密度区域。我们开发了一种生成式流匹配方法,直接对粒子模拟中的通量概率分布进行建模,明确包含了非马尔可夫和非高斯效应。作为演示,我们使用该方法模拟非相互作用布朗粒子系统的Kramers首次通过时间问题。结果表明,与马尔可夫基线(正则化DK方程)的解相比,该模型准确捕捉了短时行为,并提供了数密度统计矩的更好预测。

英文摘要

Hydrodynamic models of stochastic particle systems represented by coarse-grained stochastic partial differential equations (SPDE), such as the regularized Dean-Kawasaki (DK) equation, do not accurately capture the short-time system dynamics that is dominated by non-Markovian effects, and low particle density regimes where the distributions are highly non-Gaussian. We develop a generative flow matching method that directly models the probability distribution of fluxes from particle simulations that explicitly incorporates non-Markovian and non-Gaussian effects. As a demonstration, we use this method to simulate the Kramers first passage time problem for a system of non-interacting Brownian particles. We show the model accurately captures the short-time behavior and provides better predictions of the statistical moments of the number density when compared against the solution of the Markovian baseline, regularized DK equation.

2606.06647 2026-06-08 cs.LG q-bio.NC 新提交

The Identity Trap in EEG Foundation Models: A Diagnostic Audit

脑电图基础模型中的身份陷阱:一项诊断性审计

Jun-You Lin, Ying Choon Wu, Tzyy-Ping Jung

发表机构 * National Yang Ming Chiao Tung University University of California, San Diego

AI总结 提出FMScope协议,通过方差分解、主题轴擦除等五种诊断方法,揭示EEG基础模型在受试者分离交叉验证中可能依赖受试者身份特征而非临床生物标志物,并验证了该陷阱的普遍性及可移除性。

详情
Comments
28 pages, 6 figures, 8 tables. Code available at https://github.com/Jimmy110101013/fmscope
AI中文摘要

目标。EEG基础模型(FMs)在临床静息态EEG上报告了强准确性。然而,在受试者分离交叉验证下的高准确性仍然模棱两可:它可能反映真实的临床生物标志物,也可能反映与标签相关的受试者身份特征。我们将其命名为身份陷阱,并询问是否可以在微调之前从表示层面进行诊断。方法。我们提出FMScope,一种冻结表示协议,包含五种诊断方法:方差分解、受试者轴擦除、非周期性1/f消融、逐层标签探测和受试者内方向一致性。我们将其应用于三个预训练FM(LaBraM、CBraMod、REVE),在四个数据集上采用2x2布局:标签的受试者关系 x 是否存在共识的跨受试者EEG标志物。主要结果。(i) 身份陷阱是普遍存在的:在12/12对中,冻结的受试者方差是随机零假设的13-89倍,在微调下所有12对均上升(+10至+63个百分点)。这种主导性是一个可移除的线性轴:在标签在受试者内变化的情况下,擦除它可改善标签解码(主要单元中+6至+12个百分点;外部队列中+4至+27个百分点)。(ii) 非周期性1/f是受试者身份的一个载体:移除它会使LaBraM和CBraMod上的受试者探测下降9-19个百分点。REVE在无可测量的非周期性依赖下饱和了受试者身份。(iii) 微调仅在具有文献确立的跨受试者标志物的单元中放大标签方差。意义。身份陷阱是捷径学习的一个物理基础实例:偏好线索具有可测量的生理成分,仅靠受试者分离分割无法排除它。FMScope将反映生物标志物的增益与反映受试者身份的增益分开。

英文摘要

Objective. EEG foundation models (FMs) report strong accuracy on clinical resting-state EEG. However, high accuracy under subject-disjoint cross-validation remains ambiguous: it can reflect a genuine clinical biomarker, or subject-identity features that correlate with the label. We name this the Identity Trap and ask whether it can be diagnosed at the representation level before fine-tuning. Approach. We propose FMScope, a frozen-representation protocol packaging five diagnostics: variance decomposition, subject-axis erasure, aperiodic 1/f ablation, layer-wise label probing, and within-subject direction consistency. We apply it to three pretrained FMs (LaBraM, CBraMod, REVE) across four datasets in a 2x2 layout: subject relation of label x presence of a consensus cross-subject EEG marker. Main results. (i) The Identity Trap is universal: frozen subject-variance is 13-89x a random null in 12/12 pairs, rising in all 12 under fine-tuning (+10 to +63 pp). This dominance is a removable linear axis: erasing it improves label decoding where the label varies within subject (+6 to +12 pp in primary cells; +4 to +27 pp across external cohorts). (ii) Aperiodic 1/f is one subject carrier: removing it drops the subject probe by 9-19 pp on LaBraM and CBraMod. REVE saturates subject identity without measurable aperiodic dependence. (iii) Fine-tuning amplifies label-variance only in cells with a literature-established cross-subject marker. Significance. The Identity Trap is a physically-grounded instance of shortcut learning: the preferred cue has a measurable physiological component, and subject-disjoint splitting alone cannot rule it out. FMScope separates gains reflecting a biological marker from those reflecting subject identity.

2606.06646 2026-06-08 cs.CL cs.AI 新提交

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

CAF-Gen: 一种用于丰富论证结构的多智能体系统

Jakub Bąba, Jarosław Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(电子与信息技术学院,华沙技术大学)

AI总结 提出CAF-Gen多智能体框架,通过迭代创建-评审流程将浅层论证结构自动转换为符合Carneades论证框架的丰富模型,克服单次生成的结构不稳定性。

详情
Comments
Accepted for publication in the proceedings of ICCCI 2026
AI中文摘要

从自然文本中形式化复杂推理是计算语言学的核心挑战之一。它要求系统不仅理解关键词,还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘技术能够识别基本的主张和前提,但往往难以捕捉高级模式(如Carneades论证框架)所需的更丰富的结构信息,该框架包含前提类型、证明标准和论证模式等特征。我们通过引入CAF-Gen来解决这一局限性,这是一个自动化的多智能体框架,旨在将浅层论证结构丰富为符合CAF的论证模型。通过采用迭代的创建者-评审者流水线,创建者智能体的输出由批评智能体验证以确保结构完整性。这种多智能体协作对于缓解单次生成模型典型的结构不稳定性至关重要。我们的实验表明,迭代反馈循环提高了所得数据的质量,并与原始标注实现了强对齐,同时生成了结构更丰富的模型。我们的发现表明,多智能体系统可以克服单次生成的局限性,为自动建模形式论证提供了一种稳健的方法。

英文摘要

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

2606.06641 2026-06-08 cs.AI cs.LO 新提交

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

加速傅里叶SAT (AFSAT):完全实现基于GPU的对称伪布尔SAT求解器

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 提出AFSAT,一个基于连续局部搜索的GPU加速伪布尔SAT求解器,通过JAX编译器实现大规模并行化,显著提升数值稳定性、运行速度和内存效率。

详情
AI中文摘要

我们提出加速傅里叶SAT (AFSAT),一个基于连续局部搜索 (CLS) 的GPU加速伪布尔可满足性求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器,支持单个问题实例中任意异构混合的对称约束类型和长度。利用JAX编译器,AFSAT通过纯函数组合、自动向量化、自动微分和即时编译 (JIT),在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证显著改进的数值稳定性、运行时性能和内存效率。这是通过识别和解决由内存延迟和浮点表示引起的各种限制,以及利用自动并行化和紧凑表示来实现的。浮点固有的表示和稳定性限制通过定制的离散傅里叶变换实现得到部分解决。通过JAX数组分片,我们在扩展到多个加速器时实现了接近线性的吞吐量。

英文摘要

We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

2606.06635 2026-06-08 cs.CL cs.AI 新提交

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败:承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学) Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过令牌级不确定性信号,将语言模型推理失败分为承诺性失败(早期锁定错误路径)和持续性不确定性(不确定性持续累积),并在23个模型-数据集配置中验证了可预测性,为自我一致性策略提供了指导。

详情
AI中文摘要

语言模型推理中的失败通过不同的过程产生,这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败,发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败,其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点,超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中,持续性不确定性,不确定性反而在整个过程中累积,并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现,该框架的可证伪预测在23个案例中的20个中成立,远高于两种失败模式下的随机水平。最后,我们展示了我们的失败模式框架对自我一致性有直接影响,识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

2606.06631 2026-06-08 cs.CV 新提交

From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

从像素到牛顿:从单目视频预测体内关节接触力

Jessy Lauer

发表机构 * Rowland Institute at Harvard(哈佛大学罗兰研究所)

AI总结 提出一种无物理模型的流水线,从非标定单目视频预测3D髋膝接触力,无需标记、力板、肌电、个体成像或肌肉骨骼模型,通过变换器融合运动、形状、活动文本和自监督视频令牌,在26名患者25种活动上达到与个体化肌肉骨骼模拟相当的精度。

详情
AI中文摘要

关节接触力决定植入物寿命、软骨健康和康复效果,影响谁患骨关节炎、谁从关节置换中良好恢复以及谁受益于生物力学干预。然而,它们只能通过侵入性测量,在少数装有仪器的患者中进行。我提出一种无物理流水线,从非标定单目视频预测瞬时3D髋膝接触力:无需标记、力板、肌电图、个体成像或肌肉骨骼模型。每帧恢复参数化身体网格,编码为运动特征,并由变换器解码为力,其姿态流在每一层由身体形状、关节、侧别、活动文本和自监督视频令牌(V-JEPA 2)自适应调制,将髋和膝统一在单一模型中。在来自体内OrthoLoad数据库的26名患者和25个活动类别上的留一受试者交叉验证中,该流水线匹配个体化肌肉骨骼模拟的精度(髋部$0.32 \pm 0.08$ BW RMSE;膝部$0.23 \pm 0.03$ BW RMSE),并分辨出比步态再训练和骨关节炎进展报道的更小的峰值力变化。零样本应用于独立仪器化队列,它媲美或超越先前发表的方法。即使没有精心策划的活动标签,仅视频特征也能保持精度,并实现对原始视频的端到端推理。由预测器驱动,生成式运动先验产生生物力学合理的变体,降低峰值负荷,重新发现预测模拟文献中的策略。该流水线确立非标定单目视频作为估计关节负荷的可行模态,为回顾分析存档临床记录、初级保健筛查和家庭康复追踪开辟道路。

英文摘要

Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 新提交

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要?

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学)

AI总结 研究利用日常互联网视频协同训练机器人操作策略时,手部姿态质量和运动差距对迁移的影响,提出一种协同训练方法,在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

详情
Comments
The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html
AI中文摘要

用于协同训练机器人操作策略的人类视频数据集主要由精心策划的演示组成,其中动作被编排成类似机器人行为,并且使用专用硬件捕获3D手部姿态。更丰富的数据源是日常互联网视频,但哪些因素能够实现从这些视频到机器人的迁移仍是一个开放问题。我们使用一个新的数据集(包含532个人类视频,共28小时的高质量三角测量手部标签和自然动作)对此进行研究。我们发现手部姿态质量影响迁移,但即使手部姿态准确,固有的运动差距也会阻碍迁移,除非视觉和策略网络针对每种具身形态进行专门化。我们的协同训练方法在低机器人数据场景下,在六个操作任务中绝对成功率提升29.7%,并带来一致的改进。

英文摘要

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

2606.06618 2026-06-08 cs.RO cs.AI cs.LG 新提交

ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

ChronoForest: 用于高效桥接搜索和路线组合的闭环多树扩散规划

Jungmin Seo, Jaesik Park

发表机构 * Seoul National University(首尔国立大学)

AI总结 针对仅依赖短程离线轨迹进行长程路线规划的问题,提出ChronoForest系统,通过锚链树扩散规划器和在线多树协调器实现局部桥接搜索与全局路线重解,在OGBench和哈密顿路线组合基准上显著提升成功率和效率。

详情
Comments
40 pages, 4 figures, 7 tables, 3 algorithms
AI中文摘要

当仅有短程离线轨迹可用时,我们如何规划到达指定目标、访问必经航点且保持路径短的长程路线?这一问题在离线导航中至关重要,因为收集足够丰富的长程数据十分困难,但真实智能体仍需以路线级效率(而非仅仅可行性)解决长程任务。难点有两方面:在微观层面,组合多个短程片段会在搜索代价和路径质量之间产生权衡;在宏观层面,航点排序需要比较起点、目标和航点锚点之间的成对旅行代价,而这些锚点在规划前未知,且仅通过长程时间距离估计时可靠性下降。本文提出ChronoForest,一种闭环规划系统,通过锚链树扩散规划器和在线多树协调器,将局部桥接搜索与在线路线重解耦合。ChronoForest利用时间距离进行短程引导和节点评估,同时利用搜索时的桥接证据验证长程锚点连通性,并反复重解路线。在OGBench AntMaze-Stitch上,ChronoForest在中等、大型和巨型分片上分别达到99.8%、99.3%和99.5%的成功率,并在巨型拼接任务上相比先前报道的扩散方法提升高达34.5个百分点。在哈密顿路线组合基准上,在线重解纠正了较差的时间排序,提升了路线质量,同时代价远低于穷举规划。

英文摘要

How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.

2606.06615 2026-06-08 cs.SD cs.AI cs.LG eess.AS 新提交

FIGMA: Towards FIne-Grained Music retrievAl

FIGMA:迈向细粒度音乐检索

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对现有音乐检索模型无法处理细粒度属性查询的问题,提出多视角对比架构FIGMA,通过联合优化全局音频-文本对齐和帧级标记对齐,在统一表示空间中捕获高层语义和细粒度音乐属性,并在新构建的细粒度音乐描述数据集上取得显著提升。

详情
Comments
Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/
AI中文摘要

使用自然语言描述检索音乐已通过对比音频-文本模型(如CLAP)得到改进,但当前系统仍局限于粗粒度语义查询。当描述指定细粒度音乐属性(如速度、调性、和弦进行或节奏结构)时,现有模型通常无法检索到正确的音频。我们表明,这一限制源于对比学习目标本身:尽管在长描述上训练,基于CLAP的模型实际上仅利用前几个标记,丢弃了详细提示中编码的大量信息。然后,我们提出FIGMA(细粒度音乐检索),一种多视角对比架构,通过联合优化全局音频-文本对齐和帧级、标记级对齐来解决这一限制。该设计使FIGMA能够在统一表示空间中捕获高层语义上下文和细粒度音乐属性。此外,我们形式化了细粒度音乐检索任务,并构建了细粒度音乐描述数据集(FGMCaps),一个包含38万音乐-描述对的大规模训练数据集以及1万测试集,两者都标注了速度、调性、和弦进行、节拍数以及流派和情绪。大量实验表明,FIGMA在多个音乐检索基准(包括域外评估)上持续优于现有基于CLAP的音乐检索模型,相对改进高达73.3%。

英文摘要

Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 新提交

Re-Centering Humans in LLM Personalization

重新将人类置于LLM个性化中心

Lechen Zhang, Jiarui Liu, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究LLM个性化在合成数据与人类数据上的性能差距,通过收集人类对话和判断揭示系统在属性提取、相关属性配对和个性化响应生成阶段的局限性,并引入轻量级训练干预以缩小差距。

详情
AI中文摘要

尽管兴趣日益增长,但大多数对大型语言模型(LLM)个性化能力的评估都依赖于合成数据。目前尚不清楚当前的个性化系统对真实用户的效果如何。在本文中,我们研究了LLM个性化在使用合成数据与人类数据时的性能差距。我们收集了人类对话(550个对话)和个性化三个阶段的判断:从对话中提取用户属性(5,949个判断),将相关属性与新提示配对(11,919个),以及将相关属性融入个性化响应(1,101个)。纳入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性,与人类在相关属性上的判断不一致,并且生成的个性化响应被人类评价为并不优于通用响应(尽管LLM广泛评价为更好)。我们在前两个阶段引入了两种轻量级基于训练的干预措施,使自动化个性化评估更接近人类数据。然而,在第三阶段,我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平,这表明与人类一致的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

英文摘要

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

2606.06601 2026-06-08 cs.CV cs.AI cs.LG 新提交

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

通过分解视觉代理实现直接3D感知物体插入

Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

发表机构 * Google(谷歌) Black Forest Labs(黑森林实验室)

AI总结 提出DIRECT框架,通过分解外观、几何和上下文引导,实现可控制3D姿态的物体插入,在几何可控性和视觉质量上优于现有方法。

详情
Comments
ICML 2026; Project Page: https://gong1130.github.io/DIRECT/
AI中文摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法实现了高视觉质量,但将插入视为简单的2D修复任务,无法显式控制对象的3D姿态,限制了其实用性。我们提出DIRECT(用于参考组合和目标集成的分解注入),一种新颖框架,将交互式姿态操作与高保真2D图像合成相结合,实现姿态可控的物体插入。我们的方法将插入条件分解为三个互补组件:从参考对象捕获视觉细节的外观引导、从用户调整的3D代理派生的几何引导以及来自目标背景的上下文引导。通过将它们注入到不同路径,DIRECT避免了特征纠缠,同时保留了参考外观、遵循用户指定的姿态并使对象适应目标场景。我们还引入了一个自动数据构建流程,以提高训练数据的多样性和质量。实验表明,DIRECT在几何可控性和视觉质量方面均优于先前方法。

英文摘要

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

2606.06586 2026-06-08 cs.CL 新提交

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动的强化学习改进跨语言事实回忆

Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O'Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp

发表机构 * University College London(伦敦大学学院) Centre for Artificial Intelligence(人工智能中心)

AI总结 提出PolyFact数据集,利用GRPO强化学习方法提升大语言模型的跨语言事实回忆一致性,优于监督微调,并揭示其通过减少语言专用表示实现跨语言共享的机制。

详情
Comments
Under Review at EMNLP 2026
AI中文摘要

主要用英语数据训练的大型语言模型(LLMs)编码了丰富的世界知识,但通常无法在其他语言中可靠地表达这些知识,这种现象称为跨语言事实不一致性。为了研究和解决这一问题,我们引入了PolyFact,一个大规模并行多语言事实问答数据集,包含12种类型多样的语言中的10万个基于Wikidata的事实。利用PolyFact,我们比较了轻量持续预训练(CPT)、监督微调(SFT)和通过组相对策略优化(GRPO)的强化学习在Qwen-2.5-7B和OLMo-2-1124-7B中改进跨语言事实回忆的效果。我们发现GRPO始终优于SFT,提高了跨语言一致性和对未见语言的泛化能力,而并行数据上的CPT带来的额外收益有限。机制分析进一步表明,GRPO通过减少MLP层和注意力头中的语言专门化来重组多语言路由,从而促进更共享的跨语言表示。我们发布了代码、模型和数据集。

英文摘要

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

2606.06576 2026-06-08 cs.LG astro-ph.EP astro-ph.IM stat.ML 新提交

Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

高斯过程潜在因子回归用于低数据高维输出问题

Edward T. Stevenson, Eric T. Wolf, Mei Ting Mak, N. J. Mayne, Miles Cranmer

发表机构 * University of Cambridge(剑桥大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Oxford(牛津大学) University of Exeter(埃克塞特大学)

AI总结 提出高斯过程潜在因子回归(GPLFR)模型,通过将输出表示为低维潜在状态的线性高斯解码,联合优化压缩与预测,解决低数据高维输出回归问题,并首次构建岩石系外行星全球气候模型的空间分辨仿真器。

详情
Comments
9 pages content + 22 pages appendix/references. Supporting code at https://github.com/edstevenson/GPLFR
AI中文摘要

在科学领域,回归任务通常需要从少量训练样本预测高维输出。多输出高斯过程在低数据场景中表现出色,但通常难以处理高维输出。PCA-GP(主成分分析加高斯过程回归)等压缩-预测流程处理了高维性,但依赖于为重构而非预测优化的基。为弥补这一差距,我们提出一个模型,将每个输出表示为从高斯过程先验中抽取的低维潜在状态的线性高斯解码。通过解析地边缘化解码器权重,我们将压缩和预测耦合在一个可扩展到高维输出的单一目标中。我们将此模型称为高斯过程潜在因子回归(GPLFR)。我们通过构建首个岩石系外行星全球气候模型的空间分辨仿真器来演示GPLFR。

英文摘要

In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs. Compress-then-predict pipelines such as PCA-GP (principal component analysis plus Gaussian process regression) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction. To address this gap, we propose a model that represents each output as a linear-Gaussian decoding of a low-dimensional latent state drawn from a Gaussian process prior. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high-dimensional outputs. We refer to this model as Gaussian process latent factor regression (GPLFR). We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets.

2606.06569 2026-06-08 cs.RO 新提交

PhyRoGen: Synthetic Generation of Physical Robot Manipulation Puzzles Using Procedural Content Generation

PhyRoGen:使用程序化内容生成物理机器人操作谜题的合成生成

Lennart Julian Droß, Andreas Orthey, Marc Toussaint

发表机构 * Technical University of Berlin(柏林技术大学) Robotics Institute Germany(德国机器人研究所)

AI总结 提出PhyRoGen框架,利用程序化内容生成自动创建机器人操作谜题的合成数据集,生成的24个谜题可在1-300秒内求解,并在物理仿真中验证可操作性。

详情
Comments
8 pages, accepted at CASE 2026
AI中文摘要

机器人操作物理谜题对于自动装配和拆卸任务很重要。然而,为了让机器人解决物理谜题,需要学习操作技能,这需要大量的训练数据集,而数据集的生成通常耗时且繁琐。为了解决这个问题,我们提出了物理机器人操作谜题生成框架(PhyRoGen),它利用程序化内容生成(PCG)来自动生成操作谜题的合成数据集。PhyRoGen是一个通用谜题生成器,可以生成具有互锁对象依赖关系的物理谜题,其中必须先操作一个关节对象,然后才能移动另一个对象。基于PhyRoGen,我们定义了六个具体的生成器,用于生成24个物理谜题。通过使用基准测试框架,我们能够使用基于采样的规划算法在1到300秒内解决所有谜题。最后,我们通过使用KUKA LBR iiwa机器人在物理仿真中演示了每个生成的谜题都是可操作的。这表明我们的框架能够程序化地生成独特的、可解的机器人操作谜题,这是对操作算法进行基准测试和开发稳健基础模型的关键要素。

英文摘要

Robot manipulation of physical puzzles is important for automatic assembly and disassembly tasks. However, to enable robots to solve physical puzzles, manipulation skills need to be learned, which requires large training datasets, the generation of which is often time consuming and tedious. To overcome this problem, we propose the Physical Robot Manipulation Puzzle Generation framework (PhyRoGen), which leverages procedural content generation (PCG) for automated generation of synthetic datasets of manipulation puzzles. PhyRoGen is a general-purpose puzzle generator, which can generate physical puzzles with interlocking object dependencies, where one articulated object must be manipulated before another can be moved. Based upon PhyRoGen, we define six concrete generators which we use to generate 24 physical puzzles. By using a benchmarking framework, we are able to solve all puzzles in 1 to 300 seconds using sampling-based planning algorithms. Finally, we demonstrate that every generated puzzle is manipulatable by using a KUKA LBR iiwa robot in a physical simulation. This shows that our framework is able to procedurally generate unique, solvable robot manipulation puzzles, which is a crucial ingredient to benchmark manipulation algorithms and to develop robust foundation models.

2606.06564 2026-06-08 cs.LG cs.AI 新提交

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

WAV:面向深度仅解码器Transformer的多分辨率块残差路由

Kehan Wang

发表机构 * Chongqing University(重庆大学)

AI总结 提出WAV v1方法,通过为每个块增加方向性细节基(相位基和分裂基)来增强残差路由,在深层Transformer中优于现有方法,48层时在TinyStories和Text8上取得更低验证损失。

详情
Comments
6 pages, 4 figures, 3 tables
AI中文摘要

残差连接对于训练深度Transformer至关重要,但标准的PreNorm残差流以固定的单位权重聚合子层更新。最近的注意力残差用内容相关的深度路由替代了这种固定累积,而块注意力残差通过对块级残差摘要进行路由使机制高效。然而,单个块摘要仅存储块内的低频总残差位移,丢弃了方向性结构,例如注意力与MLP的不平衡以及早期与晚期块的动态。我们提出WAV v1,一种用于仅解码器Transformer的轻量级多分辨率残差路由方法。WAV v1不是仅通过累积残差和来表示每个块,而是为每个块增加两个方向性细节基:一个对比注意力和MLP更新的相位基,以及一个对比早期和晚期子层更新的分裂基。这些基与标准块摘要一起通过相同的深度softmax混合器进行路由,而负细节源初始化和分离的RMS匹配稳定了训练。在字符级TinyStories和Text8语言建模中,WAV v1显示出明显的深度相关优势。尽管在12层时并非始终有益,但在24层时变得有竞争力,并在48层时优于所有基线。在48层时,WAV v1将TinyStories上的验证损失从0.4960降至0.4738,Text8上从0.9363降至0.9305,且额外参数可忽略。这些结果表明,方向性残差细节(而不仅仅是块级和)对于在更深Transformer中扩展残差路由很重要。

英文摘要

Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights. Recent Attention Residuals replace this fixed accumulation with content-dependent depth-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block-level residual summaries. However, a single block summary stores only the low-frequency total residual displacement inside a block, discarding directional structure such as attention-vs-MLP imbalance and early-vs-late block dynamics. We propose WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. Instead of representing each block only by its accumulated residual sum, WAV v1 augments every block with two directional detail bases: a phase basis that contrasts attention and MLP updates, and a split basis that contrasts early and late sublayer updates. These bases are routed together with standard block summaries through the same depth-wise softmax mixer, while negative detail-source initialization and detached RMS matching stabilize training. On character-level TinyStories and Text8 language modeling, WAV v1 shows a clear depth-dependent benefit. Although it is not consistently beneficial at 12 layers, it becomes competitive at 24 layers and outperforms all baselines at 48 layers. At 48 layers, WAV v1 reduces validation loss relative to Block AttnRes from 0.4960 to 0.4738 on TinyStories and from 0.9363 to 0.9305 on Text8, with negligible additional parameters. These results suggest that directional residual details, not only block-level sums, are important for scaling residual routing in deeper Transformers.

2606.06560 2026-06-08 cs.LG cs.AI cs.HC 新提交

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena: 在在线macOS环境中基准测试计算机使用代理

Victor Muryn, Maksym Shamrai, Sofiia Mazepa, Yehor Khodysko

发表机构 * MacPaw

AI总结 提出MacArena基准,包含421个任务和50个应用,在Apple Silicon上运行,揭示macOS对GUI代理的独特挑战,模型排名在移植任务和原生任务间反转。

详情
Comments
Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
AI中文摘要

计算机使用代理(CUA)通过视觉和控制原语操作图形用户界面(GUI),其能力迅速提升,部分得益于标准化在线评估基准(如OSWorld),这些基准既作为评估工具,也作为强化学习的训练环境。然而,macOS在此领域中仍未被充分覆盖:现有唯一基准macOSWorld仅覆盖少量第一方应用且任务较简单,并在与Apple Silicon不兼容的x86虚拟机上运行。我们引入MacArena,一个包含50个应用中421个手动验证任务的基准,结合了OSWorld任务的精选移植、来自macOSWorld的内容以及49个新的macOS原生任务,全部在Apple Silicon上的Apple原生虚拟化框架上运行。我们认为macOS呈现了Linux基准无法捕捉的独特GUI挑战,我们的评估支持这一观点:现有基准上的强模型性能可能反映对任务分布的熟悉程度,而非真正的跨平台GUI能力。值得注意的是,模型排名在移植任务和macOS原生任务之间发生反转,领先模型在MacArena子集上落后超过26%,表明macOS对当前GUI代理构成了一个真正更困难的环境。

英文摘要

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.

2606.06559 2026-06-08 cs.SD cs.AI eess.AS 新提交

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF:面向噪声鲁棒的端到端全双工口语对话系统的抗干扰自适应融合

Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) AudioLab Hong Kong, Huawei Leibniz Research Center(香港AudioLab,华为Leibniz研究中心) Nanyang Technological University(南洋理工大学)

AI总结 提出IRAF模块,通过逐帧预测可靠性门控来调节用户音频对LLM的贡献,提升全双工对话系统在干扰说话人环境下的响应质量和交互稳定性。

详情
AI中文摘要

全双工口语对话模型允许语音代理同时听和说,实现具有实时重叠的自然交互。然而,联合编码用户和代理流的端到端双通道模型在现实声学环境中可能会退化:干扰说话人泄漏到用户麦克风中,会被编码为用户查询的一部分,破坏LLM的条件,导致不稳定的轮流说话和响应质量下降。我们提出抗干扰自适应融合(IRAF),一个轻量级、流兼容的模块,逐帧调节用户音频对LLM的贡献。IRAF从目标说话人和用户音频嵌入中预测一个标量可靠性门控,并在与代理嵌入融合之前重新缩放用户表示。在MS-MARCO和InstructS2S-200K上的实验表明,在干扰说话人条件下,响应质量和全双工交互获得一致提升。

英文摘要

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

2606.06556 2026-06-08 cs.RO 新提交

Robots Need More than VLA and World Models

机器人需要的不仅仅是VLA和世界模型

Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar

发表机构 * Motoniq.ai Stanford University(斯坦福大学) Istituto Italiano di Tecnologia(意大利技术研究院) ETH Zurich(苏黎世联邦理工学院) Technical University of Darmstadt(德累斯顿技术大学) UCL Centre for AI(伦敦大学学院人工智能中心)

AI总结 本文认为机器人通用智能的关键瓶颈不仅是策略学习,还缺乏将非结构化行为数据转化为机器人可用监督的机制,并提出了四种缺失的接口组件。

详情
AI中文摘要

通用机器人智能通常被框定为策略扩展问题:收集更多机器人演示,训练更大的视觉-语言-动作(VLA)模型,并期望更广泛的泛化。在这篇立场论文中,我们认为这种框架是不完整的。核心瓶颈不仅是策略学习,而是缺乏将世界上丰富的非结构化行为数据转化为有监督的机器人监督的机制。人类运动、互联网视频、仿真 rollout 和交互式演示包含关于任务、目标、接触、失败和物理约束的丰富信息,然而这些信息中的大部分无法直接被机器人策略使用,因为它们缺乏特定于具身的动作标签、任务语义和奖励结构。我们为下一代机器人识别了四个缺失的组件:用于自动标注非结构化行为的数据接口、用于将人类运动重定向到机器人动作的具身接口、用于物理接地3D推理的世界模型接口,以及用于从视频和语言推断任务进展和成功的奖励接口。我们调查了机器人基础模型、跨具身数据集、从视频学习、世界模型和奖励建模方面的最新进展,并提出了一个研究议程,以构建不仅能够从机器人演示中学习,而且能够从更广泛的物理世界中学习的机器人系统。

英文摘要

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

2606.06550 2026-06-08 cs.SD cs.AI eess.AS 新提交

Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

几何二阶特征相关性学习用于自监督语音情感识别

Shuanglin Li, Ruxiao Qian, Siyang Song

发表机构 * Xiangjiang Laboratory(湘江实验室) University of Exeter(埃克塞特大学)

AI总结 针对自监督语音情感识别中一阶聚合忽略特征相关性和黎曼几何的问题,提出二阶相关层,通过协方差描述子捕获协同共现模式,并利用对数欧几里得映射保持几何完整性,在ESD和RAVDESS数据集上有效恢复判别信息。

详情
AI中文摘要

自监督学习(SSL)为语音情感识别(SER)提供了强大且富含上下文的表示,但将这些表示聚合为整体描述符仍是一个瓶颈。传统的一阶聚合隐式假设特征独立,忽略了潜在的黎曼几何,并丢弃了对骨干网络表示能力至关重要的高阶关系。为解决这一问题,本文提出了一种新颖的二阶相关(SOC)层。SOC不孤立地处理特征,而是将特征相关性建模为协方差描述子,以捕获协同共现模式,这些模式可作为鲁棒情感识别的判别性签名。通过对数欧几里得映射(LEM)将这些描述子从黎曼流形映射到欧几里得切空间,所提方法在保持几何完整性的同时,实现了直接的线性判别学习。在ESD和RAVDESS数据集上的大量实验表明,SOC恢复了一阶池化中丢失的判别信息,并有效聚合了高维SSL特征。

英文摘要

Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.

2606.06546 2026-06-08 cs.LG 新提交

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

Elmes*:面向长尾教育场景的大语言模型细粒度评估量规自动构建

Tao Liu, Ye Lu, Ruohua Zhang, Siyu Song, Wentao Liu, Aimin Zhou, Hao Hao

发表机构 * Shanghai Institute of AI for Education, East China Normal University(上海人工智能教育研究院,东华师范大学) School of Computer Science and Technology, East China Normal University(计算机科学与技术学院,东华师范大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出Elmes*框架,自动构建细粒度场景特定量规,用于评估大语言模型在教育场景中的多维教学能力,构建Edu-330基准并揭示模型差异。

详情
AI中文摘要

评估用于教育的大语言模型(LLMs)需要衡量模型如何教学,而不仅仅是它们知道什么。现有基准强调领域通用正确性或依赖手动设计的量规,这些量规难以扩展到长尾教学场景。我们引入Elmes*,一个用于构建、优化和应用细粒度场景特定量规的端到端框架。Elmes*结合了用于教师-学生-评判者交互的声明式多智能体引擎与SceneGen(一个自演化模块,从专家定义的教学维度共同优化评估标准和测试数据)。使用Elmes*,我们构建了Edu-330,涵盖11个学科、3个年级段和10种任务类型的330个场景,包含超过1000个二级指标。在Edu-330和四个专家撰写的黄金标准场景上的实验表明,教育能力是多维的:顶级LLM主要在创造力和价值观整合方面存在差异,知识强的模型可能在苏格拉底式支架教学中失败,而教育专用模型InnoSpark获得了最佳的人工评估平均分。LLM评判者保持了与人类可比的排名,但评分方差更低,但表现出评判者特定的偏见,如自我偏好。消融实验表明,专家评分的少样本锚定改善了人机对齐,而推理强制和贪婪解码依赖于模型。因此,Elmes*为基于教学法的LLM评估提供了可扩展的诊断基础设施。

英文摘要

Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.

2606.06539 2026-06-08 cs.CV cs.AI cs.LG cs.NE 新提交

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

合成基准高估了前向-前向扩展:真实数据对逐层训练的限制

Yucheng Chen

发表机构 * Amplimit

AI总结 通过DTG-FF方法在真实数据上评估前向-前向学习的扩展性,发现其与反向传播的差距随类别数增加而扩大,合成任务高估了其迁移能力,且内存优势不成立。

详情
Comments
23 pages, 6 figures
AI中文摘要

前向-前向(FF)学习[Hinton, 2022]用严格的逐层良好性更新取代了反向传播。最近的FF-CNN工作在32x32基准上缩小了与BP的差距,引发了逐层训练是否在现实规模下成为可行替代方案的问题。为了严格探究这一点,我们开发了DTG-FF——动态温度良好性、解耦归一化和多层融合——作为在九个真实数据基准上设定FF系列最先进水平的工具(CIFAR-10上91.8%,以及ImageNet-100 224x224上的首个FF基线),并用它来审计逐层训练实际能扩展到何种程度。(1)真实数据扩展。在相同配方和主干下,架构匹配的BP-DeepSup基线在CIFAR-10/CIFAR-100上分别超过DTG-FF 2.40/5.93个百分点,且差距随类别数增加而扩大。在224x224分辨率下,同一工具仅达到49.4%——这是该尺度下的首个FF基线,而典型BP超过75%[Tian et al., 2020]——暴露了在32x32下不可见的真实数据上限。(2)合成与真实K冲突。在合成教师-学生任务中,随着类别数K增长,DTG-FF越来越优于BP;而在真实图像上,FF-BP差距符号反转并随K扩大。数据集内CIFAR-100粗粒度与细粒度探针将标签层次与图像分布分离:合成K扫描将输出维度与细粒度判别难度混淆,从而高估了FF的可迁移性。(3)系统审计。FF可以在不存储深度激活的情况下实现,但在普通8 GB硬件上,标准BP+梯度累积达到4.18 GB / 157 imgs/s,而DTG-FF为7.90 GB / 138 imgs/s,因此在公平基线支持下,基于内存的理由在此规模下不成立。

英文摘要

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

2606.06538 2026-06-08 cs.CV 新提交

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

WorldBench: 一个具有挑战性且视觉多样的多模态推理基准

Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学) NYU(纽约大学) University of Waterloo(滑铁卢大学) Meta, FAIR(Meta和FAIR)

AI总结 提出WorldBench,通过构建多领域视觉概念分类法并收集多样化图像,设计前沿MLLM难以回答的问题,以评估多模态大语言模型的视觉理解能力,揭示其弱点。

详情
Comments
Project page: https://worldbench-vl.github.io/
AI中文摘要

在现实世界应用中,模型被期望在不同设置下可靠地执行。然而,许多现有的多模态基准扩展了任务类型,但没有捕捉到处理开放视觉输入所需的视觉多样性。我们提出了WorldBench,一个具有挑战性且视觉多样的推理基准,用于评估多模态大语言模型(MLLM)。我们构建了一个跨多个领域(例如,生物)的数千个视觉概念的分类法。在该分类法的指导下,我们从搜索引擎和现有数据集中策划了一个广泛的图像集合,以全面代表视觉世界。通过结构化的试错,我们手动设计了前沿MLLM无法回答的具有挑战性的问题。在定量和人工评估中,WorldBench比任何现有的多样化基准实现了更高的视觉多样性。在WorldBench上评估15个MLLM揭示了视觉理解中的弱点:即使是最强的模型也只达到64.0%的准确率,而一些模型的表现略高于随机水平。我们希望我们的工作强调视觉多样性在构建多模态基准中的重要性。

英文摘要

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

2606.06532 2026-06-08 cs.CV 新提交

GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

GOPAgen: 基于结构记忆与层次推理的运动感知高效智能长视频理解

Haozhe Chi, Yang Jin, Yadong Mu

发表机构 * Peking University(北京大学)

AI总结 提出GOPAgen方法,通过视频编解码的GOP运动代理、GOP树推理算法和结构记忆机制,实现高效长视频理解,在多个VQA基准上取得领先性能。

详情
AI中文摘要

尽管在智能长视频理解方面取得了显著进展,现有方法仍然缺乏详细的运动理解以及高效的内存架构。在本文中,我们提出GOPAgen,一种新颖的方法,该方法首先通过精心设计的运动代理将视频编解码器集成到视频理解框架中,该代理基于视频编解码器中的图像组(GOP)进行训练。我们进一步开发了GOP树推理算法,该算法与视频编解码器自然对齐,增强了模型理解视频中局部细节运动的能力。此外,我们精心设计了一种结构记忆机制,将局部运动信息与结构页面中的详细描述相结合,并提出了一种高效的从粗到精的缩放算法,以充分利用结构记忆。此外,我们将运动矢量数据库纳入框架,以实现不同粒度运动矢量的高效检索。总体而言,我们的方法在各种视频理解基准(包括MotionBench和Egoschema)上取得了优越的视频问答(VQA)性能,从而证明了我们提出框架的优越性。

英文摘要

Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

2606.06526 2026-06-08 cs.AI cs.LG 新提交

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath: 众包数学研究讨论数据集

Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛文分校) San Jose State University(圣何塞州立大学) Massachusetts Institute of Technology(麻省理工学院) Dartmouth College(达特茅斯学院) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出CrowdMath数据集,包含164条专家标注的进展链,用于评估大语言模型在协作开放问题求解中的能力,发现模型在局部预测上表现良好但在角色分类上存在不足。

详情
Comments
16 pages, 4 figures
AI中文摘要

大型语言模型在数学推理方面取得了实质性进展,但现有基准通常评估具有最终答案、逐步解决方案或完整证明的明确问题。它们没有捕捉到协作开放问题求解:参与者提出部分论证、识别先前步骤中的空白或错误、修复有缺陷的推理,并逐步将增量贡献综合成证明。我们引入了CrowdMath,一个包含164条专家标注的进展链的数据集,来自MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath项目(2016-2025),这是一个协作研究计划,其讨论已导致同行评审的出版物。每条链追踪一个从开放问题陈述到完成证明的多参与者论坛讨论。帖子根据其在不断演变的解决方案过程中的功能角色进行标注,包括部分进展、证明完成、错误推理和错误识别。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖子预测上达到83-88%的准确率,表明它们能够跟随数学讨论的局部流程。然而,它们难以识别单个贡献的功能重要性,最佳模型在帖子角色分类上仅达到0.42的宏F1分数。CrowdMath揭示了解决明确数学问题与理解协作数学进展之间的差距。

英文摘要

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

2606.06523 2026-06-08 cs.AI cs.LG cs.LO cs.SE 新提交

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent:面向智能体工作流与轨迹的形式化建模与验证

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Independent researcher(独立研究者)

AI总结 提出Lean4Agent框架,利用依赖类型形式语言Lean4对智能体工作流进行形式化建模与验证,通过FormalAgentLib库和LeanEvolve方法提升工作流可靠性,实验验证通过的工作流性能平均提升11.94%。

详情
AI中文摘要

使大型语言模型(LLMs)能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管LLMs的智能体能力近期取得了进展,但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的形式化方法。这一挑战类似于数学中长期存在的问题,其中自然语言(NL)的模糊性促使了形式语言(FL)的发展。受此范式启发,我们提出了**Lean4Agent**,据我们所知,这是首个使用依赖类型形式语言Lean4来建模和验证智能体行为的框架。**Lean4Agent**推出了**FormalAgentLib**,一个可扩展的Lean4库,用于在显式假设下形式化建模和验证智能体工作流的语义一致性,并能够定位轨迹揭示的运行时故障。基于**FormalAgentLib**,我们进一步开发了**LeanEvolve**,它应用**FormalAgentLib**中的结果来修订工作流以增强其能力。在SWE-Bench-Verified的困难子集和ELAIP-Bench子集上,针对5个领先LLMs的大量实验表明,通过验证的工作流比未通过的工作流平均性能提升**11.94%**,而**LeanEvolve**进一步将SWE性能平均提升**7.47%**。此外,**Lean4Agent**为使用表达能力强的依赖类型形式语言形式化建模和验证智能体行为这一新领域奠定了基础。

英文摘要

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

2606.06519 2026-06-08 cs.AI cs.LG 新提交

SafeGene: Reusable Adapters for Transferable Safety Alignment

SafeGene: 可重用的适配器实现可迁移的安全对齐

Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

发表机构 * Southeast University(东南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出SafeGene,一种可重用的安全适配器模块,通过从对齐-退化模型差异中提取安全表示,并利用数据感知层选择和少样本系数重校准,实现跨任务的安全恢复,在保持下游性能的同时降低有害响应率。

详情
AI中文摘要

开放权重的LLM越来越多地被微调成定制助手,但下游微调可能会削弱安全对齐,使模型更容易受到恶意提示的攻击,即使训练数据并非有意有害。这造成了反复的安全恢复问题,因为目标模型会随着新任务数据或用户交互而不断更新。我们提出SafeGene,一种可重用的安全适配器模块,设计用于每个架构兼容模型家族内的跨任务重用。SafeGene不将安全恢复视为特定于模型的修复步骤,而是将安全能力视为一种独立的、可重用的适配器表示,与任务特定更新解耦。这种表示从对齐-退化模型差异中获得,通过数据感知层选择精炼成任务可迁移的安全向量,并通过少样本逐层系数重校准在每个下游任务适应模型中表达。跨多个模型家族、下游任务和安全评估者的实验表明,SafeGene增强的模型在降低有害响应率的同时保持下游性能,在安全-效用权衡中优于代表性的安全适应方法。

英文摘要

Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

2606.06464 2026-06-08 cs.CL cs.AI 交叉投稿

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

人类成人与LLM作为科学家:谁从主动探索中受益?

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) University of California Berkeley(加州大学伯克利分校) New York University(纽约大学) Meta FAIR MIT Media Lab(麻省理工学院媒体实验室) Montreal Neurological Institute(蒙特利尔神经科学研究所)

AI总结 本研究通过主动探索实验,发现主动探索能显著提升成人对合取因果规则的推理能力,但合取规则仍需更多测试;同时比较了大型语言模型的表现,发现部分模型在假设推断准确率上接近人类,但探索策略效率较低且存在类似的合取-析取性能差距。

详情
Comments
Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)
AI中文摘要

因果学习文献中的一个长期发现是,成人难以识别合取因果规则(即一个效应需要多个原因同时存在),而在析取情境中表现更好。然而,这种“合取缺陷”的大多数演示依赖于被动观察范式,证据有限,学习者无法控制证据生成。本文探讨当成人通过主动探索获得能动性时,这种偏见是否仍然存在。使用修改后的“blicket检测器”任务,成人参与者在合取或析取规则结构下自由干预以识别因果对象。我们表明,主动探索显著改善了成人的合取因果推理,尽管合取规则仍比析取规则需要更多测试来推断。我们进一步将人类表现与同一设置下的多种大型语言模型进行比较。虽然一些最先进的模型在假设推断准确率上接近人类水平,但它们通常表现出效率较低的探索策略以及类似的合取-析取性能差距。

英文摘要

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.