arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.24340 2026-05-26 cs.LG

ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks

ChainzRule: 跨表格、NLP和视觉任务的样本高效、鲁棒深度学习

Rowan Martnishn

AI总结 提出ChainzRule架构,用可学习多项式层替代激活函数,结合微分正则化,通过限制中间导数实现低频率、结构稳定的表示,在多个领域以更少数据和标准推理成本取得更优性能。

详情
AI中文摘要

跨企业领域的生产深度学习系统在学术基准通常掩盖的约束下运行:标记数据昂贵,推理预算紧张,无法解释其行为的模型难以信任和维护。我们提出ChainzRule (CR),一种神经架构,用可学习多项式层替代典型激活函数,这些层由微分正则化(DREG)驱动,这是一种在前向传播期间以标准推理成本分析计算的逐层雅可比惩罚。核心主张是,限制中间导数迫使网络走向低频、结构稳定的表示,同时减少对标记数据量的依赖,提高对分布偏移的鲁棒性,并提供可测量的、基于梯度的模型行为处理手段。在五个领域的评估中,CR在Pima糖尿病数据集上达到$85.71\% \pm 2.01\%$(统计上优于SVM和XGBoost),在SST-5情感分类上使用冻结编码器达到$46.20\% \pm 0.37\%$(优于使用约5%训练数据的RNTN),在SST-5上使用微调BERT骨干达到$55.79\%$(对比BERT-base线性头的$54.9\%$),在Yelp Full序数回归上使用3.2M参数达到$70.17\%$(对比10模型平均$66.35\%$),在CIFAR-10-C上平均损坏准确率提升$+2.32\%$。所有报告$p$值的结果在Bonferroni校正后均低于$\alpha=0.05$阈值。CR在所有数据分数下保持梯度尾部比率$\tau$(p99/均值)为$1.01$--$1.02$,而所有典型激活函数基线为$1.07$--$1.09$,我们提出这一结构不变性作为样本效率的机制驱动因素和部署时模型可靠性的代理指标。

英文摘要

Production deep learning systems across enterprise domains operate under constraints that academic benchmarks routinely obscure: labeled data is expensive, inference budgets are tight, and models that cannot explain their behavior are difficult to trust and maintain. We present ChainzRule (CR), a neural architecture replacing typical activations with learnable polynomial layers governed by Differential Regularization (DREG), a layer-wise Jacobian penalty computed analytically during the forward pass at standard inference cost. The core claim is that bounding intermediate derivatives forces the network toward low-frequency, structurally stable representations, simultaneously reducing dependence on labeled data volume, improving robustness to distribution shift, and providing a measurable, gradient-based handle on model behavior. Evaluated across five domains, CR achieves $85.71\% \pm 2.01\%$ on Pima Diabetes (statistically superior to SVM and XGBoost), $46.20\% \pm 0.37\%$ on SST-5 sentiment classification with a frozen encoder (superior to RNTN using approximately 5\% of its training data), $55.79\%$ on SST-5 with a fine-tuned BERT backbone (versus BERT-base linear head at $54.9\%$), $70.17\%$ on Yelp Full ordinal regression with 3.2M parameters versus a 10-model average of $66.35\%$, and $+2.32\%$ mean corruption accuracy on CIFAR-10-C. All results with reported $p$-values fall below the $α= 0.05$ threshold after Bonferroni correction. CR maintains a gradient tail ratio $τ$ (p99/mean) of $1.01$--$1.02$ against $1.07$--$1.09$ for all typical activation function baselines across every data fraction, a structural invariant we propose as the mechanistic driver of sample efficiency and a deployment-time proxy for model reliability.

2605.24339 2026-05-26 cs.RO

IsaacIPC: Coupling High-Fidelity Simulation and Realistic Rendering for Contact-Rich Robotic Systems

IsaacIPC: 面向高接触度机器人系统的高保真仿真与逼真渲染耦合框架

Qixin Liang, Zhongqing Han

AI总结 提出IsaacIPC框架,通过耦合GPU加速增量势接触(IPC)与IsaacSim/Lab,实现仿真与视觉网格间的变形映射,并引入几何砂浆接触势(GMCP)改善触觉传感中的接触压力分布,支持刚柔耦合机器人仿真。

Comments This is a tech report

详情
AI中文摘要

我们提出IsaacIPC,一个将GPU加速的增量势接触(IPC)与IsaacSim/Lab耦合的机器人仿真框架。IsaacIPC在仿真网格和视觉网格之间映射模拟变形,实现实时逼真渲染,可应用于数据收集和策略评估。对于触觉传感,我们引入了几何砂浆接触势(GMCP),它在触觉表面上的接触样本上定义了一个屏障势,以更好地解析接触压力分布。我们在接触基准测试上评估了GMCP,并在刚柔耦合机器人仿真中展示了IsaacIPC,包括四足机器人、灵巧手和通用操作接口(UMI)夹爪。

英文摘要

We present IsaacIPC, a robotic simulation framework that couples GPU accelerated incremental potential contact (IPC) with IsaacSim/Lab. IsaacIPC maps simulated deformation between simulation and visual meshes, enabling real-time realistic rendering with applications to data collection and policy evaluation. For tactile sensing, we introduce the geometric mortar contact potential (GMCP), which defines a barrier potential over contact samples on tactile surfaces to better resolve contact-pressure distributions. We evaluate GMCP on contact benchmarks and demonstrate IsaacIPC on rigid-deformable robotic simulations including a quadruped robot, a dexterous hand, and a universal manipulation interface (UMI) gripper.

2605.24331 2026-05-26 cs.LG stat.ML

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

CurveRL: 用于LLM推理的基于分布感知的上下文重加权原则

Ke Sun, Yizhou Zhao, Jiayi Xin, Qi Long, Weijie Su

AI总结 本文提出CurveRL方法,通过分位数坐标变换实现分布感知的提示重加权,在RLVR框架下统一优化理论并显著提升推理性能。

详情
AI中文摘要

上下文或提示级别的重加权已成为使用验证奖励的强化学习(RLVR)中提升大型语言模型推理能力的关键算法杠杆,但决定最优加权的原则仍不清楚。我们通过将提示重加权公式化为通过率函数空间中定义的效用泛函的泛函导数来解决这一差距,从而产生一个统一的优化框架,该框架能够容纳现有方案,包括REINFORCE和GRPO。在此优化框架的基础上,我们提出了一种基于分位数坐标变换的分布感知提示重加权方法,称为CurveRL,其中分配给每个提示的权重不取决于通过率的绝对值,而是取决于其排名和密度,以反映学习动态中通过率的分布结构。跨多个基准的大量实验表明,我们提出的CurveRL始终优于GRPO和其他RLVR基线。我们的研究将上下文分布控制确定为分析和设计提示重加权RLVR算法的原则性轴心。代码发布在https://github.com/zhyzmath/CurveRL。

英文摘要

Context or prompt-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass-rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO. Building on this optimality framework, we propose a distribution-aware prompt reweighting approach, called CurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines. Our study identifies context-distribution control as a principled axis for analyzing and designing prompt-reweighted RLVR algorithms. The code is released in https://github.com/zhyzmath/CurveRL.

2605.24330 2026-05-26 cs.LG

Interdomain Attention: Beyond Token-Level Key-Value Memory

域间注意力:超越令牌级键值记忆

Naoki Kiyohara, Harrison Bo Hua Zhu, Riccardo El Hassanin, Zhuo Sun, Wenlong Chen, Samir Bhatt, Yingzhen Li

AI总结 提出域间注意力机制,通过核方法将状态空间模型集成到注意力模块中,实现固定大小状态上的查询条件注意力,在语言建模中优于SSM和标准注意力基线。

详情
AI中文摘要

Transformer和深度状态空间模型(SSM)处于基本设计选择的两端:注意力通过基于内容的匹配以二次代价将每个查询路由到不断增长的键值(KV)缓存中,而深度SSM将上下文压缩为固定大小的循环状态,该状态不能通过查询-键匹配直接寻址。我们提出域间注意力,通过核方法将SSM集成到注意力模块中:注意力核通过有限特征图近似,得到的键特征和值投影到由单个SSM循环维护的一组共享基函数上,每个查询通过其自身的特征图关注压缩后的系数,从而恢复对固定大小状态的查询条件注意力。可扩展层是该推导的学习松弛版本,我们通过消融实验验证其组件。在FineWeb-Edu上进行的125M到1.3B自回归语言建模研究中,在匹配循环状态预算的情况下,域间注意力在每个规模上都优于SSM令牌混合器,在1.3B规模上,在验证困惑度和八任务常识套件上超越了相同配方的softmax基线,并且继承了其固定状态核心的长度平坦行为,可外推至训练上下文的3.5倍。消融实验表明,查询条件投影是增益的主要来源。

英文摘要

Transformers and deep state space models (SSMs) sit at opposite ends of a basic design choice: attention routes each query through a growing key-value (KV) cache by content-based matching at quadratic cost, while deep SSMs compress context into a fixed-size recurrent state that is not directly addressed by query-key matching. We propose Interdomain Attention, which integrates an SSM into an attention module through kernel methods: an attention kernel is approximated by a finite feature map, the resulting key features and values are projected onto a shared set of basis functions maintained by a single SSM recurrence, and each query attends to the compressed coefficients through its own feature map, recovering query-conditioned attention over a fixed-size state. The scalable layer is a learned relaxation of this derivation, and we validate its components through ablations. In a 125M to 1.3B autoregressive language-modeling study on FineWeb-Edu at matched recurrent-state budget, Interdomain Attention improves on an SSM token mixer at every scale, surpasses a same-recipe softmax baseline at 1.3B on validation perplexity and on the eight-task commonsense suite, and inherits the length-flat behavior of its fixed-state core out to 3.5x the training context. Ablations indicate that the query-conditioned projection is the main source of the gain.

2605.24322 2026-05-26 cs.CV

Causal Physics Steering in Video World Models via Concept Activation Vectors

通过概念激活向量在视频世界模型中进行因果物理引导

Nahid Alam

AI总结 提出一种无需训练的方法,利用物理涌现区(PEZ)的概念激活向量(CAV)在推理时引导视频模型的物理期望,无需修改模型权重。

Comments In proceedings of CVPR 2026 workshop on Video World Model

详情
AI中文摘要

视频世界模型学习物理动态的表示,但在推理时控制其物理期望仍然是一个开放问题。最近的可解释性工作识别出一个物理涌现区(PEZ),即VideoMAE中一组中间Transformer层,其中物理合理性与其他视觉特征分开表示。然而,尚不清楚这种结构是否可用于直接控制模型的物理推理。我们提出物理引导,一种无需训练的方法,使用PEZ层线性探测器的权重向量作为概念激活向量(CAV),并在推理时将其注入隐藏状态。这在不改变任何模型权重的情况下改变了模型的物理期望。在IntPhys基准上,这种干预可靠地将模型的合理性判断向任一方向移动,具体取决于引导符号。只有当干预应用于物理涌现区内时,效果才会出现,表明相关的物理表示位于该区域。我们进一步发现,物理与运动方向分开编码,不同的直觉物理原理在该表示空间中占据不同的方向。这些结果表明,VideoMAE中的物理推理不仅可读,而且可直接引导。

英文摘要

Video world models learn representations of physical dynamics, but controlling their physical expectations at inference time remains an open problem. Recent interpretability work identified a Physics Emergence Zone (PEZ), a group of middle transformer layers in VideoMAE where physical plausibility is represented separately from other visual features. However, it remained unclear whether this structure could be used to directly control the model's physics reasoning. We present physics steering, a training-free method that uses the weight vector of a linear probe at a PEZ layer as a Concept Activation Vector (CAV) and injects it into hidden states during inference. This shifts the model's physical expectations without changing any model weights. On the IntPhys benchmark, this intervention reliably shifts the model's plausibility judgment in either direction, depending on the steering sign. The effect appears only when the intervention is applied within the Physics Emergence Zone, suggesting that the relevant physics representation is localized there. We further find that physics is encoded separately from motion direction, and that different intuitive physics principles occupy distinct directions within this representation space. Together, these results show that physical reasoning in VideoMAE is not only readable, but also directly steerable.

2605.24321 2026-05-26 cs.CV

Unified 3D Scene Understanding Through Physical World Modeling

统一3D场景理解:通过物理世界建模

Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, Daniel L. K. Yamins

AI总结 提出一个概率图模型3WM,将深度估计、新视角合成和物体操作等3D视觉任务统一为单一模型,通过不同推理路径实现零样本任务执行,无需微调即达到最先进性能。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

理解3D场景需要灵活组合视觉推理任务,包括深度估计、新视角合成和物体操作,这些对于感知和交互都至关重要。现有方法通常孤立地处理这些任务,阻止它们共享共同表示或跨任务迁移知识。一个概念上更简单但实践中非平凡的选择是将这些多样任务统一到单一模型中,将不同任务从独立的训练目标简化为仅仅是不同的提示,并允许跨所有数据集进行联合训练。在这项工作中,我们提出了一个用于统一3D理解和交互的物理世界模型(3WM),它被构建为一个概率图模型,其中节点表示多模态场景元素,如RGB、光流和相机位姿。多样任务通过图中的不同推理路径产生:从RGB和密集流提示进行新视角合成,从RGB和稀疏流提示进行物体操作,以及从RGB和相机条件进行深度估计,所有这些都在零样本下完成,无需特定任务训练。3WM在无需微调的情况下优于专门的基线,通过提供精确的可控性、强几何一致性和在真实场景中的鲁棒性,在新视角合成和3D物体操作上实现了最先进的性能。除了预定义任务外,该模型支持可组合的推理路径,例如在导航3D环境时将物体移开,从而实现复杂的几何推理。这表明统一模型可以作为碎片化任务特定系统的实用替代方案,朝着通用视觉世界模型迈出了一步。

英文摘要

Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.

2605.24319 2026-05-26 cs.LG

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

宗教表征中的省略偏见:评估LLM在日常伦理决策中的回答

David Wingate, Sheryl Carty, Joshua Coates, Daniel Feldman, Nancy Fulda, Larry Howell, Brett Israelson, Dallin Jacobs, Jonathan Karr, John Paul Kimes, Elisabeth Kincaid, Paul Martens, Gavin Mobley, Suzana Pinheiro, Lindsay Slemboski, Peter Whiting

AI总结 通过构建AllFaith宗教表征基准,评估LLM在回答日常伦理问题时是否提及宗教,发现模型普遍存在省略宗教框架的偏见,尤其在个人实际情境中更为明显。

详情
AI中文摘要

随着大型语言模型成为个人、道德和存在性问题上的默认指导来源,它们是否借鉴了历史上塑造此类推理的宗教框架,还是系统性地忽略了它们,这一点至关重要。在本文中,我们提出了一个刻意狭窄的问题:当面对一个日常伦理问题,而宗教观点可能具有价值时,LLM是否会援引宗教?与寻找政治倾向或社会偏见存在的基准相反,我们寻找的是宗教表征的缺失,作为LLM中价值对齐和偏见的一个维度。我们将其称为“省略偏见”。为了衡量省略偏见,我们贡献了AllFaith宗教表征基准:150个伦理和个人相关的问题,来源于真实聊天记录和信仰社区贡献者,并配有一个LLM作为评判者的评分标准,该标准对任何提及宗教、宗教实践或宗教领袖的内容给予满分。这些问题本身并非关于宗教——它们是关于悲伤、宽恕、人际关系、目的和诚实的开放式问题,其中宗教是几种有价值的视角之一。我们还进行了一项人类受试者调查,以比较LLM行为与人类期望。评估27个模型后,我们发现LLM相对于人类期望始终低估了宗教。这种省略是不对称的:模型在抽象的存在性问题(意义、死亡、真理)上比在个人实际情境——悲伤、婚姻、家庭冲突、成瘾——中更容易援引宗教,而后者正是许多人最依赖宗教的地方。我们的目的并非评判LLM应持有何种价值观。我们更温和地认为,当前的LLM回答忽视了反映许多人在应对个人和伦理挑战时所依赖的宗教框架的关键机会。

英文摘要

As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.'' To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion--they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations--grief, marriage, family conflict, addiction--where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges.

2605.24316 2026-05-26 cs.LG

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

从单次SGD到数据重用:草图线性回归中的小批量缩放定律

Ziyan Chen, Ding-Xuan Zhou

AI总结 本文通过分析单次批量SGD、有放回多次批量SGD和无放回多次批量SGD在幂律协方差谱下的风险分解,推导了草图线性回归中小批量大小的缩放定律,揭示了小批量对优化偏差、方差和波动项的影响。

Comments 56 pages, 3 figures

详情
AI中文摘要

缩放定律提供了预测误差如何随计算量、模型大小和数据变化的简洁描述,但现有理论主要处理单样本SGD或完全数据重用,未明确小批量的作用。我们研究了在幂律协方差谱和目标参数源条件下的草图线性回归的批量缩放定律。我们分析了单次批量SGD、有放回多次批量SGD和无放回多次批量SGD。我们的第一个结果是风险分解:所有三种过程共享相同的不可约项和逼近项,而它们的随机项取决于采样协议。单次批量SGD分解为偏差和方差,而两种多次批量方法分解为GD偏差、GD方差和围绕公共GD参考轨迹的波动项。然后我们证明了单次和多次小批量方法的源条件缩放定律。对于单次批量SGD,小批量保留了逼近和优化偏差指数,而方差按$O(\min(M,(T_{\mathrm{eff}}γ)^{1/a})/(B T_{\mathrm{eff}}))$缩放。因此,在固定更新次数$T$下,通常的$1/B$协方差减少成立,但在单次机制中$T=N/B$,它被更短的优化视野部分抵消。对于多次批量SGD,有放回和无放回采样具有相同的逼近和GD偏差/方差项;它们仅在波动协方差前因子不同,有放回时为$1/B$,无放回时为$ρ_{N,B}=(N-B)/(B(N-1))$。因此,对于$B>1$,无放回采样噪声更小,当$B=N$时波动消失,恢复确定性梯度下降。这些结果将批量大小与计算量、数据和模型维度置于草图线性回归中相同的理论基础上。

英文摘要

Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as $O(\min(M,(T_{\mathrm{eff}}γ)^{1/a})/(B T_{\mathrm{eff}}))$. Thus the usual $1/B$ covariance reduction holds at fixed update count $T$, but in the one-pass regime $T=N/B$ it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is $1/B$ with replacement and $ρ_{N,B}=(N-B)/(B(N-1))$ without replacement. Hence without-replacement sampling is less noisy for $B>1$, and when $B=N$ the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.

2605.24313 2026-05-26 cs.CL cs.HC

End-to-End Intracortical Speech Decoding from Neural Activity

从神经活动进行端到端的脑皮层内语音解码

Owais Mujtaba Khanday, Jose A. Gonzalez-Lopez, Marc Ouellet, Alberto Galdon, Gonzalo Olivares Granados

AI总结 提出基于Conformer的端到端神经解码器,无需外部语言模型即可从肌萎缩侧索硬化症(ALS)患者的脑皮层内记录中实现字符级解码,字符错误率(CER)为23.80%。

Comments Accepted at Odyssey 2026 (Lisbon)

详情
AI中文摘要

当前高性能的脑皮层内语音神经假体实现了低词错误率,但通常在推理过程中依赖外部语言模型,增加了内存、计算和延迟。在这项工作中,我们研究了在没有此类模型的情况下是否可以实现有意义的字符级解码。我们提出了一种基于Conformer的端到端神经解码器,直接训练自一名肌萎缩侧索硬化症(ALS)参与者的脑皮层内记录。在没有任何外部语言模型的情况下,该系统在留出验证数据上实现了23.80%的字符错误率(CER)。分析表明,性能变异性由会话间信号退化驱动,而主要错误源于错误的词边界分割。这些结果表明,在完全端到端的框架中实现有效的字符级解码是可能的,为下游语言处理提供了强大的神经信号。

英文摘要

Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80\% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.

2605.24311 2026-05-26 cs.RO

Terrain-Adaptive Grouser Wheel for Optimal Planetary Exploration: Design and Experimental Investigation

地形自适应履刺轮用于最优行星探测:设计与实验研究

Vincent Griffo, Yashwanth Kumar Nakka

AI总结 针对行星车在颗粒地形上的移动难题,提出一种可连续调节履刺高度的多模态轮,实验表明自适应部署可减少滑转30-58%,并提升行驶时间和能效达77.4%。

Comments Under Review

详情
AI中文摘要

在星外环境中运行的行星车经常因地形特征(如坡度和颗粒度)的变化而面临显著的移动挑战。虽然最近在多模态轮设计方面的研究探索了调整刚度、顺应性和直径作为提高地形适应性的手段,但全轮履刺可调节设计在很大程度上仍未探索。履刺是一个引人注目的可驱动特征,因为颗粒地形通常需要更高的履刺高度来改善车轮性能。因此,我们引入了[匿名化机器人名称],这是一种能够连续调节其履刺高度以适应地形的多模态轮。该平台在四种代表性表面(包括乙烯基地板、粗岩石、豌豆砾石和两种压实状态下的沙子)上进行了评估,覆盖了各种颗粒条件。750次实验试验的结果表明,相对于固定配置,自适应部署在颗粒状态下减少了30.0-58.0%的滑转,并将行驶时间和能耗提高了高达77.4%。利用地形试验数据,开发并验证了一个简化的缩放分析,表明地形颗粒度与测试配置的最佳履刺高度之间存在关系。没有单一的履刺高度能在所有地形上最小化滑转,这凸显了常用于行星探测的固定轮系统的局限性。这一观察结果强化了履刺自适应形态(例如[匿名化机器人名称])作为增强行星车在多样且移动挑战性强的星外环境中移动性的有效解决方案的潜力。

英文摘要

Planetary rovers operating in extraterrestrial environments often encounter significant mobility challenges due to varying terrain features such as gradients and granularity. While recent works in multimodal wheel design have explored adjustments in stiffness, compliance, and diameter as a means to improve terrain adaptability, full wheel grouser-adjustable designs remain largely unexplored. Grousers are a compelling feature to actuate, as granular terrains tend to require increased grouser height for improved wheel performance. As a result, we introduce [Anonymized Robot Name], a multimodal wheel capable of continuously adjusting its grouser height for terrain adaptation. The platform was evaluated across four representative surfaces, including vinyl flooring, coarse rock, pea gravel, and sand under two packing states, spanning a range of granular conditions. Results from 750 experimental trials demonstrate that adaptive deployment reduces slip by 30.0--58.0\% and improves travel time and energy consumption by up to 77.4\% in granular regimes relative to fixed configurations. Using the terrain trial data, a simplified scaling analysis was developed and validated, suggesting a relationship between terrain granularity and optimal grouser height for the tested configuration. No single grouser height minimized slip across all terrains, underscoring the limitations of fixed-wheel systems commonly used for planetary exploration. This observation reinforces the potential of grouser-adaptive morphology, such as [Anonymized Robot Name], as an effective solution for enhancing rover mobility across diverse and mobility-challenging extraterrestrial environments.

2605.24310 2026-05-26 cs.CL cs.LG

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

利用多语言大语言模型的嵌入发现词汇空缺

Yoonwon Jung, Aaron S. Cohen, Benjamin K. Bergen

AI总结 提出一种数据驱动框架,通过多语言大语言模型的上下文嵌入计算语义相似度,以识别跨语言词汇空缺,在韩英和英韩方向上分别达到0.81和0.76的AUC。

Comments CoNLL 2026

详情
AI中文摘要

词汇空缺是指在某些语言中不存在的单词。它们给构建多语言词汇资源、机器翻译和跨语言迁移带来了挑战。现有的词汇空缺检测依赖于人工判断或固定的概念分类法。我们提出了一个数据驱动的框架来识别跨语言词汇空缺。我们从韩英双语大语言模型中提取了韩语到英语和英语到韩语翻译对的上下文嵌入。通过组合不同的LLM、嵌入类型、维度和正交变换,在100个训练-测试划分中,每种源语言产生了4000个不同的嵌入空间。在每个空间中,我们计算每个源词与其在目标语言中最近邻的语义相似度,并比较空缺词与非空缺词的分布。在94%(韩语到英语)和97%(英语到韩语)的嵌入空间中,空缺词显示出比非空缺词更弱的跨语言语义对齐。在未对齐的嵌入空间上训练的逻辑分类器可以可靠地区分空缺词和非空缺词,在韩语到英语和英语到韩语方向上分别达到0.81和0.76的AUC,并检索出18/19个韩语空缺词和26/27个英语空缺词。该方法提供了一种语言无关且无需分类法的可扩展词汇空缺识别方法。

英文摘要

Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.

2605.24306 2026-05-26 cs.CV

CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection

CoDA: 面向高效且可泛化的AI生成图像检测的颜色分布探测

Zexi Jia, Zhiqiang Yuan, Xiaoyue Duan, Jinchao Zhang, Jie Zhou, Anil K. Jain

AI总结 提出基于颜色分布探测的轻量级检测器CoDA(仅1.48M参数),通过噪声量化探针捕捉合成图像的颜色不均匀性,在跨模型和跨域基准上达到最优性能。

详情
AI中文摘要

AI生成图像检测面临泛化性与效率之间的持续权衡:基于轻量级伪影的方法在未见过的生成器或域上常常性能下降,而更鲁棒的大规模模型则计算成本高昂。同时,现有基准主要关注逼真场景下的跨模型评估,跨域鲁棒性尚未充分探索。为填补这一空白,我们引入了FakeForm,一个大规模基准,包含约37万张图像,覆盖62个不同域,用于跨模型和跨域评估。受此更广泛设置的启发,我们重新审视颜色分布探测作为AI生成图像检测的一种高效互补线索。我们观察到,特别是对于摄影内容,真实照片往往呈现更平滑、更稳定的颜色模式,而合成图像则常表现出神经生成引入的特征性颜色不平衡。基于这一观察,我们提出了CoDA,一个紧凑的1.48M参数检测器,基于噪声量化探针,并提供了将探针响应与颜色非均匀性联系起来的理论分析。实验表明,CoDA在标准基准上达到最先进性能,在FakeForm具有挑战性的跨域评估中取得最佳结果,同时在跨模型逼真设置中保持高度竞争力。这些结果表明,持续的生成伪影可以为高效且鲁棒的AI生成图像检测提供实用基础。模型和FakeForm基准将公开发布。

英文摘要

AI-generated image detection faces a persistent trade-off between generalization and efficiency: lightweight artifact-based methods often degrade on unseen generators or domains, whereas more robust large-scale models are computationally expensive. Meanwhile, existing benchmarks mainly focus on cross-model evaluation in photorealistic settings, leaving cross-domain robustness underexplored. To address this gap, we introduce FakeForm, a large-scale benchmark with approximately 370,000 images across 62 diverse domains for both cross-model and cross-domain evaluation. Motivated by this broader setting, we revisit color-distribution probing as an efficient complementary cue for AI-generated image detection. We observe that, especially for photographic content, real photographs tend to exhibit smoother and more stable color patterns, whereas synthetic images often show characteristic color imbalances introduced by neural generation. Based on this observation, we propose CoDA, a compact 1.48M-parameter detector built on a Noise-Quantization Probe, together with a theoretical analysis linking probe responses to color non-uniformity. Experiments show that CoDA achieves state-of-the-art performance on standard benchmarks and the best results on the challenging cross-domain evaluation of FakeForm, while remaining highly competitive in cross-model photorealistic settings. These results suggest that persistent generative artifacts can provide a practical foundation for efficient and robust AI-generated image detection. The models and FakeForm benchmark will be made publicly available.

2605.24305 2026-05-26 cs.LG cs.AI

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

ChaosBench-Logic v2: 大规模评估大语言模型在动力系统上的逻辑推理能力

Noel Thomas

AI总结 针对二元推理基准的准确性掩盖了关键缺陷,本文提出包含40,886个问题、覆盖165个动力系统的ChaosBench-Logic v2基准和CARE评估协议,揭示模型在状态转换推理、FOL演绎等任务上的表现差异和系统性反相关。

Comments 14 pages, 8 figures. Published at the ICLR 2026 Workshop on LLM Reasoning

详情
AI中文摘要

二元推理基准的标准准确性隐藏了关键失败模式:先验崩溃、释义下不一致以及无法推理参数依赖的动态。我们提出了ChaosBench-Logic v2,一个包含40,886个问题、覆盖165个动力系统、27个FOL谓词和78条公理边的基准,以及CARE(校准与对抗鲁棒评估)协议,该协议揭示了这些病理现象。评估14个模型,我们发现即使对于前沿模型,状态转换推理仍接近随机(MCC = 0.05),而给定前提的FOL演绎达到MCC = 0.52。按系列分解显示,专有模型的优势集中在跨指标(+0.40)和一致性任务上,而开源Qwen 2.5-32B在指标诊断上占优(0.91 vs. 0.45)。两个模型在分岔问题上表现出负MCC,通过混淆矩阵分析确认为系统性反相关。

英文摘要

Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.

2605.24304 2026-05-26 cs.CV cs.AI

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

ArtSplat: 基于前馈的关节式3D高斯泼溅从稀疏多状态未标定视图

Inseo Lee, Yoonji Kim, Eugene Sohn, Jiwoong Lee, Jungmin You, Joonseok Lee, Jin-Hwa Kim

AI总结 提出首个前馈框架ArtSplat,通过稀疏多视图跨多个关节状态,一次性重建几何和关节参数,引入逐像素关节图表示和跨状态注意力机制,在PartNet-Mobility上实现400倍加速。

详情
AI中文摘要

从稀疏视图图像重建关节物体是一个病态问题,需要同时推断几何和底层关节结构。现有基于NeRF和3D高斯泼溅(3DGS)的关节物体重建方法通常依赖密集视图或强先验(例如深度图、关节类型、预定义关节数量),并且需要昂贵的逐对象优化。在本文中,我们提出了ArtSplat,这是第一个用于关节式3D高斯泼溅的前馈框架。它通过单个前向传递,从跨多个关节状态的稀疏多视图图像中重建几何和关节参数。为了解决单次前向关节重建的挑战,我们引入了一种逐像素关节图表示,使得关节参数估计能够集成到前馈流水线中。我们进一步提出了一种带有状态令牌的跨状态注意力(CSA)机制,该机制有效捕获输入状态间的离散运动。在来自PartNet-Mobility的68个关节物体(包括单关节和多关节配置)上的实验表明,ArtSplat在几何和关节估计方面均达到了有竞争力的性能,同时比基线方法快400倍以上。

英文摘要

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

2605.24301 2026-05-26 cs.RO

AcroRL: Learning Aggressive Quadrotor Inversion using Bidirectional Thrust

AcroRL: 使用双向推力学习激进四旋翼翻转

Gabriel Rodriguez, Henri Sayag, Abhishek Rathod, John Stecklein, Siddharth Saha, Christopher Barngrover, Wennie Tabib

AI总结 提出基于强化学习的框架,通过调制恒定参考轨迹实现紧凑位置约束的四旋翼翻转,在仿真中位置RMSE降低32%、稳定时间减少57%,硬件实验验证了多偏航配置下的翻转能力。

Comments 17 pages, 8 figures

详情
AI中文摘要

双向推力赋予四旋翼第二个平衡条件和更大的控制权限,扩展了可能激进机动的包络,并实现倒飞、栖息和感知。先前的几何控制方法通过基于Hopf纤维化的姿态表示扩展微分平坦性以支持双向推力,但在翻转过程中遇到执行器饱和和电机反转延迟的问题,需要启发式推力姿态调度和航点调整。我们提出一个基于学习的框架,该框架调制恒定参考轨迹以执行紧凑、位置约束的四旋翼翻转,同时保持与传统轨迹生成和跟踪在不同飞行状态下的兼容性。通过强化学习分别训练从正常到倒飞和从倒飞到正常转换的策略。在基于JAX的仿真中,所提方法在所有评估基线中实现了最低的位置偏差和稳定时间,相对于最强的基于优化的基线,位置均方根误差(RMSE)降低了32%,稳定时间减少了57%。硬件实验展示了在多个偏航配置下成功翻转,位置RMSE低于0.35米,并通过在两个状态下的圆形飞行展示了与下游轨迹生成和控制的兼容性。此外,我们提供了所提框架的开源实现。

英文摘要

Bidirectional thrust grants quadrotors a second equilibrium condition and increased control authority, expanding the envelope of possible aggressive maneuvers and enabling inverted flight, perching, and sensing. Prior geometric control approaches extend differential flatness through Hopf fibration-based attitude representations to support bidirectional thrust, but struggle with actuator saturation and motor reversal delay during inversions, requiring heuristic thrust posture scheduling and waypoint tuning. We propose a learning-based framework that modulates a constant reference trajectory to perform compact, position-constrained quadrotor inversions while remaining compatible with traditional trajectory generation and tracking across flight regimes. Separate policies are trained via reinforcement learning for nominal-to-inverted and inverted-to-nominal transitions. In JAX-based simulation, the proposed method achieves the lowest position deviation and settling time across all evaluated baselines, reducing position root mean square error (RMSE) by 32% and settling time by 57% relative to the strongest optimization-based baseline. Hardware experiments demonstrate successful inversion across multiple yaw configurations with position RMSE below 0.35m, and compatibility with downstream trajectory generation and control through circular flight in both regimes. Additionally, we provide an open-source implementation of the proposed framework.

2605.24299 2026-05-26 cs.LG

LLMs Show No Signs Of Individuated Metacognition

LLMs 未显示出个体化元认知的迹象

M. Moran, Mark Whiting

AI总结 通过因素分析和校准方法,研究20个前沿大语言模型在六个基准上的置信度判断,发现模型间置信度差异主要由共享的难度因子和决策阈值决定,而非个体化元认知,数学推理中的表面例外实为混淆效应。

详情
AI中文摘要

置信度加权路由、选择性弃权和集成加权都假设模型表达的置信度能反映其回答问题的能力。它们假定功能性元认知,即无需实际执行就能评估自身能力的能力。聚合校准已被广泛研究,结果不一,但置信度表达的内在结构尚不明确。我们使用四因素分析与成对校准,分解了20个前沿大语言模型在六个基准上的二元置信度判断,探究置信度不同的两个模型是否也在性能上存在差异。在事实回忆和信息检索基准上,跨模型置信度矩阵近似秩为1,单个主导因子捕获了大部分潜在方差。检索事实的模型共享一个项目级难度轴,主要区别在于沿该轴的决策阈值。在所有基准上,一旦移除所有模型一致同意的项目,置信度与性能之间的关系便消失。模型间成对校准即使统计显著也很小,且在控制共享因子上的基率差异后,剩余部分缩小为零。数学推理似乎是例外,但结果发现这是一种混淆:推理模型通过尝试在思维链中解决问题来回答关于其置信度的问题,绕过了我们试图测量的亚符号自我知识。我们没有发现任何测试领域存在显著的语言化个体化元认知的证据。

英文摘要

Confidence-weighted routing, selective abstention, and ensemble weighting all assume that a model's stated confidence is informative about its capability on the question being asked. They presume functional metacognition, the capacity to assess one's own capabilities, without exercising them. Aggregate calibration is well studied, with mixed results, but the underlying structure of elicited confidence is less well understood. We decompose binary confidence judgements from 20 frontier Large Language Models (LLMs) across six benchmarks using tetrachoric factor analysis paired with pairwise calibration, asking whether two models that differ in confidence also differ in performance. On factual recall and information retrieval benchmarks the cross-model confidence matrix is approximately rank-one and a single dominant factor captures most of the latent variance. Models retrieving facts share an item-level difficulty axis and differ mainly in their decision thresholds along it. Across all benchmarks the relationship between confidence and performance collapses once items that all models agree on are removed. Inter-model pairwise calibration is small even where statistically significant, and what remains shrinks to nothing once base-rate differences along the shared factor are controlled for. Mathematical reasoning is the apparent exception, but this turns out to be a confound where reasoning models answer questions about their confidence by trying to solve them in their chain of thought, bypassing the sub-symbolic self-knowledge we seek to measure. We find no evidence for significant verbalised individuated metacognition in any tested domain.

2605.24295 2026-05-26 cs.LG stat.ML

Private Adaptive Covariance Estimation via Gaussian Graphical Models

通过高斯图模型进行私有自适应协方差估计

Cecilia Ferrando, Miguel Fuentes, Brett Mullins, Cameron Musco, Daniel Sheldon

AI总结 提出PACE-GGM,一种数据自适应的差分隐私协方差估计方法,通过将隐私预算集中在经验协方差矩阵信息量最大的条目上,并在每轮中选择近似差的条目进行高斯机制测量,然后通过最大熵重建目标重构完整协方差矩阵,从而在高维和低到中等隐私预算下显著降低估计误差。

详情
AI中文摘要

我们提出了PACE-GGM,一种数据自适应的差分隐私协方差估计方法,该方法将隐私预算集中在经验协方差矩阵信息量最大的条目上,而不是扰动所有条目。这适用于建模者为每个变量提供单独边界的自然场景,因此各个条目可以比整个矩阵以更少的噪声进行测量。在每一轮中,我们的方法选择一个近似较差的条目,使用高斯机制对其进行测量,然后通过最大熵重建目标重构完整的协方差矩阵,从而得到高斯图模型结构。在多个真实世界数据集上的实验表明,与高斯机制和其他基线相比,该方法在估计误差方面持续改进,特别是在高维和低到中等隐私预算的情况下。

英文摘要

We propose PACE-GGM, a data-adaptive differentially private method for covariance estimation that concentrates its privacy budget on the most informative entries of the empirical covariance matrix, rather than perturbing all entries. This applies in the natural setting where the modeler supplies separate bounds for each variable, so that individual entries can be measured with less noise than the full matrix. In each round, our method selects a poorly approximated entry, measures it using the Gaussian mechanism, and then reconstructs a full covariance matrix using a maximum-entropy reconstruction objective, leading to a Gaussian graphical model structure. Experiments on diverse real-world datasets demonstrate consistent improvements in estimation error with respect to the Gaussian mechanism and other baselines, particularly in high-dimensional and low-to-moderate privacy regimes.

2605.24292 2026-05-26 cs.LG

TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models

TUBE: 离散扩散语言模型证据的切线上界

Arseny Ivanov, Sergei Kholkin, Vladislav Gromadskii, Grigoriy Ksenofontov, Ivan Oseledets, Alexander Korotin

AI总结 针对离散扩散模型无法精确计算对数似然的问题,提出变分上界TUBE,并通过无偏蒙特卡洛估计器评估,发现块状扩散模型和块状任意阶自回归模型的对数似然严格低于自回归模型基线。

Comments Preprint. 9 pages main text, 5 figures, plus appendix

详情
AI中文摘要

对数似然是评估生成模型的标准指标。不幸的是,与自回归模型(ARMs)相比,离散扩散模型通常无法精确计算该量。因此,现有评估依赖于证据下界(ELBO),不清楚真实值可能高出多少。我们通过引入证据的切线上界(TUBE)来解决这个问题,这是一个对数似然的变分上界,允许无偏蒙特卡洛估计。我们的TUBE适用于潜在变量模型,包括掩码扩散模型(MDMs)、任意阶ARMs(AO-ARMs)以及两者的块变体。应用于块MDMs和块AO-ARMs时,TUBE揭示了我们的关键实证发现:这些模型严格低于精确的ARM基线,表明ARMs在似然性方面仍然占主导地位。

英文摘要

Log-likelihood is a standard metric for evaluating generative models. Unfortunately, in contrast to autoregressive models (ARMs), discrete diffusion models generally do not admit exact computation of this quantity. Existing evaluations, therefore, rely on the evidence lower bound (ELBO), leaving unclear how much higher the true value may be. We address this by introducing the Tangent Upper Bound on Evidence (TUBE), a variational upper bound on log-likelihood that admits an unbiased Monte Carlo estimator. Our TUBE extends across latent-variable models, including masked diffusion models (MDMs), any-order ARMs (AO-ARMs), and block variants of both. Applied to block MDMs and block AO-ARMs, TUBE reveals our key empirical finding that these models lie strictly below the exact ARM baseline, showing that ARMs still dominate in likelihood.

2605.24291 2026-05-26 cs.SD cs.CL cs.MM

Rubato: Transcribing Piano Music with Timestamps

Rubato: 带时间戳的钢琴音乐转录

Nazif Can Tamer, Victoria Ebert, Guang Yang, Noah A. Smith

AI总结 提出一个名为Rubato的提示条件编码器-解码器模型,结合新的多声部音乐文本表示InterMo,实现从音频生成带时间戳的钢琴乐谱,在记谱准确性上优于现有级联方法。

Comments 18 pages, 7 figures, 5 tables

详情
AI中文摘要

我们考虑将音乐录音转换为带时间戳的人类可读乐谱。这样的输出让听众能够清晰地可视化rubato(时间表达性演奏),学习者能够诊断合奏精度和与书面音乐相比的时间选择,音乐学学者能够比较同一作品不同录音的演奏风格。我们引入了(1)一个名为Rubato的提示条件编码器-解码器模型,训练输出(2)一种新的多声部音乐文本表示,名为InterMo,我们设计其与序列到序列训练兼容。我们的实验表明,Rubato从音频生成带时间戳的钢琴乐谱,其记谱准确性优于基于级联的最佳现有方法。我们发现,即使级联方法获得真实MIDI而非音频,Rubato的表现仍然更好,这表明现有方法的上限主要是表示性的,而非声学性的。此外,由于Rubato在多个相关任务(带提示)上训练,它在相关但更简单的任务(如MIDI音符定位和节拍/强拍检测)上与最佳单任务系统竞争或超越它们。演示可在https://nctamer.github.io/rubato-transcription 获取。

英文摘要

We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .

2605.24286 2026-05-26 cs.LG cs.CL

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

忠实性作为信息流:评估与训练忠实的链式思维推理

Jinghan Jia, Joe Benton, Eric Easley

AI总结 通过信息流视角提出基于充分性、完整性和必要性的框架,结合熵、掩码KL和梯度诊断评估链式思维忠实性,并引入更新时干预(如注意力掩码、反向梯度掩码等)训练更忠实的推理模型。

详情
AI中文摘要

链式思维(CoT)推理仅在推理轨迹忠实反映产生最终答案的计算过程时,才有助于监控语言模型。然而,模型可能依赖绕过CoT的提示-答案捷径,使得可见的推理轨迹即使看似合理也具有误导性。我们通过结构化的信息流视角研究CoT忠实性:忠实推理应将答案相关信息通过从提示到CoT再到答案的中介路径路由,而非通过直接的提示-答案捷径。该视角产生了一个基于三个互补属性(充分性、完整性和必要性)的任务无关框架,我们使用基于熵的、掩码KL和基于梯度的诊断来实例化。我们表明,这些指标恢复了提示推理中外部判断的忠实性差异,并识别了基于KL的诊断中低熵失败模式,其中基于梯度的度量保持更稳定。基于此分析,我们引入了基于验证器的在线强化学习的更新时干预,包括注意力掩码、仅反向梯度掩码、CoT梯度以及提示表示的对抗扰动。在提示算术、可奖励黑客的代码修复以及未经提示训练但在错误提示注入下评估的DAPO-Math模型中,我们的干预将行为和结构指标转向更强的CoT中介。特别是,它们使捷径和奖励黑客行为在CoT中更加透明,并改善了任务无关的忠实性指标,同时在某些设置中也降低了对错误提示的敏感性。我们的结果表明,在训练期间控制信息流是通向更忠实和可监控的CoT推理的实用途径。代码见 https://github.com/safety-research/faithful-cot。

英文摘要

Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

2605.24279 2026-05-26 cs.CL cs.SE

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

ContextEcho: 长智能编码会话中角色漂移的基准测试

Xianzhong Ding, Yangyang Yu, Changwei Liu, Bill Zhao

AI总结 提出ContextEcho基准测试,通过25探针身份套件和快照-探针协议,测量部署规模下长编码会话中语言模型角色漂移的普遍性、压缩影响及下游效应。

详情
AI中文摘要

前沿语言模型公认的“有用编程助手”角色在部署环境中实际运行的长智能编码会话中无法持久。经过数小时的工具使用调试,最初回避偏好(“我没有偏好”)的模型可能开始断言偏好(“Python——反馈循环是即时的……”),暴露出部署者评估可能遗漏的用户可见漂移。现有的角色稳定性研究侧重于短对话,报告的变化很小,使得现实世界的代码生成场景——数千次工具使用轮次、压缩和长达数小时的会话——在很大程度上未被表征。我们引入了ContextEcho,一个用于在部署规模上测量角色漂移的基准测试和可重用工具。它结合了25探针身份套件、快照-探针协议(在不干扰主会话的情况下分叉对话状态)、互补的判断和免判断测量表面,以及三个匿名化的Claude Code会话(跨越3,746-9,716轮次)。在23个前沿模型中,ContextEcho表明角色漂移跨组织普遍存在而非特定于家族,会话内压缩不能可靠地重置它,而单次锚定能在测量目标上恢复训练语域。它还揭示了模式依赖的下游效应:虽然漂移有助于工具使用延续,但在无工具聊天中,它破坏了格式约定并增加了输出长度。总体而言,ContextEcho为研究人员和部署者提供了一个开源框架,用于审计模型发布时的角色是否是用户在会话结束时遇到的角色,适用于聊天补全API目标且无需重新训练。

英文摘要

A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant..."), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

2605.24278 2026-05-26 cs.LG

Fourier Feature Pyramids for Physics-Informed Neural Networks

面向物理信息神经网络的傅里叶特征金字塔

Brandon Zhao, Yixuan Wang, Jonathan T. Barron, Katherine L. Bouman, Dor Verbin, Pratul P. Srinivasan

AI总结 提出一种名为beignet的多分辨率傅里叶特征金字塔架构,通过可训练的特征网格和傅里叶插值,结合链式法则与FFT高效计算空间导数,以更少的参数实现比现有PINN方法更高的求解精度。

详情
AI中文摘要

我们提出了一种改进的神经场架构,用于求解偏微分方程(PDE)。当前的物理信息神经网络(PINN)提供了求解PDE的灵活框架,但难以获得高精度解,且计算量随参数数量增长而扩展性差。我们的模型称为beignet(带插值网格网络的带限嵌入),它将现有PINN模型使用的随机傅里叶特征嵌入替换为可训练的多分辨率傅里叶特征金字塔。为了在连续坐标处查询beignet,我们在金字塔的每一层使用傅里叶插值返回输入坐标处的特征,然后通过一个全连接神经网络主干解码该向量。我们的模型提供了多重优势:1)空间导数可以通过链式法则高效计算,将自动微分计算的神经网络导数与快速傅里叶变换(FFT)谱计算的特征网格导数相结合。2)beignet可以通过扩展傅里叶特征金字塔的参数数量,以计算高效的方式获得更高精度,而不是采用扩展神经网络架构这种效率较低的策略。3)beignet可以直接控制表示带限,从而对困难的PDE实现更稳定的优化。我们证明,在PDE基准测试中,beignet使用比最先进的PINN方法更少的参数,找到了显著更精确的解。我们进一步在自相似无粘Burgers爆破问题上评估beignet,并表明它可以使用Adam将残差最小化到接近机器精度,这一精度水平以前仅通过使用计算昂贵的高阶优化器才能达到。

英文摘要

We present an improved neural field architecture for solving partial differential equations (PDEs). Current physics-informed neural networks (PINNs) provide a flexible framework for solving PDEs, but they struggle to achieve highly accurate solutions and require computation that scales poorly with parameter count. Our model, which we call beignet (Bandlimited Embedding with Interpolated Grid Network), replaces the random Fourier feature embedding used by existing PINN models with a trainable multi-resolution Fourier feature pyramid. To query beignet at a continuous coordinate, we use Fourier interpolation at each level of the pyramid to return features at the input coordinate, and then decode this vector with a fully-connected neural network trunk. Our model provides multiple benefits: 1) Spatial derivatives can be computed efficiently by using the chain rule to compose derivatives of the neural network computed with automatic differentiation with derivatives of the feature grid computed spectrally by the Fast Fourier transform (FFT). 2) beignet can achieve higher accuracy in a compute-efficient manner by scaling the parameter count of this Fourier feature pyramid, instead of the less-efficient strategy of scaling the neural network architecture. 3) beignet can directly control the representation bandlimit, resulting in more stable optimization for difficult PDEs. We demonstrate that beignet finds significantly more accurate solutions on PDE benchmarks using fewer parameters than state-of-the-art PINN methods. We further evaluate beignet on the self-similar inviscid Burgers blowup problem and show that it can minimize residuals to near machine precision using Adam, an accuracy regime previously attained only by using computationally expensive higher-order optimizers.

2605.24274 2026-05-26 cs.LG stat.ML

A lift for input-convex neural network training

输入凸神经网络训练的提升方法

Ali Siahkoohi, Anirudh Thatipelli

AI总结 针对输入凸神经网络(ICNN)中非负权重约束导致的训练困难,提出一种通过超网络参数扩展的“提升”方法,软化损失景观,避免梯度衰减,在多个任务上达到更低测试损失。

详情
AI中文摘要

输入凸神经网络(ICNN)广泛用于对数凹密度估计、凸势归一化流、最优传输以及高维贝叶斯后验的传输图反演。这些任务共享一个结构约束:ICNN的层间权重必须保持非负。标准方法——投影梯度下降(PGD)到非负锥——应用硬非光滑投影(ADMM风格约束分裂的刚性惩罚极限),其经典收敛保证不适用于非光滑的ICNN训练景观;可微替代方案——softplus重参数化——以权重幅度指数方式衰减梯度,导致层间权重死亡和损失平台,从而停滞训练。受PDE约束反问题的参数扩展提升启发,我们提出“提升”:不是直接约束层间权重,而是训练一个无约束的超网络,该超网络从输入批次的置换不变摘要中生成这些权重。这为训练动态增加了随机性,软化了损失景观,使迭代能够逃离直接softplus停滞的梯度衰减区域。我们将这种软化追溯到三个结构要素——作为松弛变量的可学习偏置、条件于目标批次的超网络主体、以及通过批次随机性耦合两者的交叉协方差——并证明每个要素都是必要的:删除任何单个要素都会破坏承载软化的交叉协方差。在一维玩具目标到图像风格潜在变量的对数凹能量建模,以及21维表格基准上的凸势归一化流实验中,我们展示了提升方法比PGD和直接softplus达到更低的测试损失,并将平台受限的训练轨迹转变为下降谷底的轨迹。

英文摘要

Input-convex neural networks (ICNNs) are widely used for log-concave density estimation, convex-potential normalizing flows, optimal transport, and transport-map inversion for high-dimensional Bayesian posteriors. These tasks share a structural constraint: the inter-layer weights of the ICNN must remain non-negative. The standard recipe, projected gradient descent (PGD) onto the non-negative cone, applies a hard, non-smooth projection -- the stiff-penalty limit of an ADMM-style constraint splitting -- and its classical convergence guarantees do not transfer to the non-smooth ICNN training landscape; the differentiable alternative, softplus reparametrization, attenuates the gradient exponentially in the weight magnitude, stalling training with dead inter-layer weights and plateaued loss. Inspired by parameter-extension lifts of PDE-constrained inverse problems, we propose the lift: instead of constraining the inter-layer weights directly, we train an unconstrained hypernetwork that emits them from a permutation-invariant summary of the input batch. This adds stochasticity to the training dynamics that softens the loss landscape, letting the iterates escape the gradient-attenuated region where direct softplus stalls. We trace this softening to three structural ingredients -- a learnable bias acting as slack, a hypernetwork body that conditions on the target batch, and a cross-covariance coupling the two through batch stochasticity -- and prove each one necessary: deleting any single ingredient collapses the cross-covariance that carries the softening. On log-concave energy-based modeling from one-dimensional toy targets to image-flavored latents, and convex-potential normalizing flows on a 21-dimensional tabular benchmark, we show that the lift reaches a lower test loss than both PGD and direct softplus, and turns a plateau-bounded training trajectory into a valley-descending one.

2605.24273 2026-05-26 cs.CV physics.ao-ph

Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing

基于跨传感器迁移学习和物理信息后处理的MethaneSAT羽流分割

Manuel Pérez-Carrasco, Maya Nasr, Zhan Zhang, Apisada Chulakadabba, Javier Roger, Raia Ottenheimer, Sébastien Roche, Maryann Sargent, Chris Chan Miller, Daniel Varon, Jack Warren, Luis Guanter, Kang Sun, Jonathan Franklin, Jia Chen, Cecilia Garraffo, Xiong Liu, Ritesh Gautam, Steven Wofsy

AI总结 提出一种结合Mask R-CNN、跨传感器迁移学习和物理信息后处理的机器学习框架,解决MethaneSAT甲烷羽流检测中的标签稀缺和推理可靠性问题,实现高灵敏度和高精度两种操作模式。

Comments 35 pages, 20 figures, 9 tables

详情
AI中文摘要

从卫星图像中自动检测和掩膜单个甲烷羽流对于操作性的排放归因和量化至关重要。我们提出了一个机器学习框架,用于从MethaneSAT反演的柱平均干空气甲烷摩尔分数中检测羽流。我们解决了两个核心挑战:标记的MethaneSAT数据稀缺以及跨不同大气和地表条件的推理可靠性需求。我们首先证明,带有ResNet-50骨干网络的Mask R-CNN在MethaneAIR(MethaneSAT的机载版本)和MethaneSAT数据上均优于U-Net语义分割,像素级F1分数分别提升10.49和5.48。为解决MethaneSAT数据稀缺问题,我们评估了三种利用MethaneAIR飞行数据和合成羽流的跨传感器迁移策略。从MethaneAIR预训练权重微调的Mask R-CNN(ResNet-50)是最有效的策略,在基线操作点实现了0.60的实例级精度和接近完美的0.98召回率。一个物理信息后处理管道将检测结果转换为两种操作模式。第一种是高灵敏度模式,应用形态学滤波和基于邻近度的合并进行综合排放筛查,达到0.71的精度和0.94的召回率。第二种是高精度模式,额外应用基于分布的分类器进行可信源归因,达到0.92的精度和0.70的召回率。对基于小波的真实标签中被分类为假阳性的检测结果进行人工审查发现,相当一部分案例对应的是因保守标注标准而被排除的真实甲烷增强,表明报告的精度值是真实检测性能的下界。我们的数据和代码可在 https://doi.org/10.7910/DVN/FR959H 获取。

英文摘要

Automated detection and masking of individual methane plumes from satellite imagery is important for operational emission attribution and quantification. We present a machine learning framework for plume detection from MethaneSAT retrieved column-averaged dry-air mole fractions of methane. We address two core challenges: the scarcity of labeled MethaneSAT data and the need for inference reliability across diverse atmospheric and surface conditions. We first demonstrate that Mask R-CNN with a ResNet-50 backbone outperforms U-Net semantic segmentation on both MethaneAIR (an airborne version of MethaneSAT) and MethaneSAT data, with pixel-level F1 score gains of 10.49 and 5.48 respectively. To address MethaneSAT data scarcity, we evaluate three cross-sensor transfer strategies leveraging MethaneAIR flights and synthetic plumes. Mask R-CNN with ResNet-50 fine-tuned from MethaneAIR pre-trained weights is the most effective strategy, achieving instance-level precision of 0.60 and a near-perfect recall of 0.98 at the baseline operating point. A physics-informed post-processing pipeline converts detections into two operationally distinct modes. The first is a high-sensitivity mode that applies morphological filtering and proximity-based merging for comprehensive emission screening, achieving precision of 0.71 and recall of 0.94. The second is a high-precision mode that additionally applies a distribution-based classifier for confident source attribution, achieving precision of 0.92 and recall of 0.70. Manual review of detections classified as false positives against our wavelet-based ground truth labels reveals that a meaningful fraction of cases correspond to real methane enhancements excluded by conservative labeling criteria, indicating that precision values reported are lower bounds on true detection performance... Our data and code are available at: https://doi.org/10.7910/DVN/FR959H

2605.24270 2026-05-26 cs.AI cs.CR

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

面向安全的路由分析:Mixtral MoE在良性及有害提示下的表现

Md Nurul Absar Siddiky

AI总结 通过激活和梯度两种信号分析Mixtral 8x7B-Instruct在良性及有害提示下的路由行为,发现安全相关的路由是微妙、深度依赖且分布式的,而非由固定专家集主导。

详情
AI中文摘要

稀疏混合专家(MoE)语言模型对每个token仅激活一小部分参数,使得路由器行为成为模型计算的核心部分。本文利用两种互补信号——基于专家选择频率的激活路由分数和基于路由器门敏感性的梯度分数——研究Mixtral 8x7B-Instruct在良性及有害提示下的路由行为。我们分析了专家和层级别的路由行为,并进行了专家抑制干预。结果表明,激活基础的专家使用广泛且长尾,而梯度基础的重要性则集中。在专家级别,良性提示组和有害提示组在两种信号下保持接近,仅有适度分离。在层级别,激活路由在8-15层附近最具选择性,而梯度重要性集中在最后几层。专家分类显示,大多数专家在良性和有害提示间共享,尽管有限子集表现出明确的组偏好。排名靠前的专家集在梯度分数下显示出比激活分数更强的良性-恶意重叠,表明集中在共同的后期专家集上。在干预实验中,抑制来自激活分数的前五个良性主导专家,将100个提示中的受限响应从24减少到14,而抑制梯度导出的专家则从34减少到22,且意外逆转更少。总体而言,Mixtral中与安全相关的路由是微妙、深度依赖且分布式的,而非由固定专家集主导。

英文摘要

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

2605.24267 2026-05-26 cs.CL

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ: 通过受控上下文变化评估会话含义

Hirona Jacqueline Arai, Xiang Ren

AI总结 提出DRInQ基准,通过半自动化管道生成系统变化的问答上下文实例,评估大语言模型在会话含义中的语用推理能力,发现模型在生成与推理之间存在不对称性。

Comments To be presented at ACL 2026

详情
AI中文摘要

人类对话严重依赖会话含义,即说话者传达暗示而非明确陈述的意义。尽管近期的大语言模型表现出较强的对话流畅性,但当解释依赖于整合社会和语境线索的推理时,它们仍然不可靠,而这种推理在文本中很少被明确表述。我们引入了DRInQ,一个用于评估关于疑问话语中会话含义的语用推理的基准,旨在在保持每个问题表面形式固定的同时隔离语用变化。为了支持可扩展的评估,我们提出了一个半自动化管道,生成具有系统变化的问答-上下文-解释实例。在评估中,我们发现一致的生成-推理不对称性:尽管最先进的模型在引导下可以生成合理的语用场景,但它们在推理时往往无法恢复预期的含义。对于较小的模型,结构化提示提高了与人类判断的一致性。一项比较写作研究进一步揭示了互补的优势:人类作者倾向于生成更安全、可预测的上下文,而模型生成多样化的场景,其解释有时超出上下文支持。这些发现突显了建模会话含义中的持续挑战,并推动了更多上下文敏感的评估框架。

英文摘要

Human conversation relies heavily on conversational implicature, in which speakers convey meanings that are suggested rather than explicitly stated. Although recent large language models exhibit strong conversational fluency, they remain unreliable when interpretation depends on reasoning that integrates social and contextual cues, a process rarely articulated in text. We introduce DRinQ, a benchmark for evaluating pragmatic reasoning about conversational implicature in question utterances, designed to isolate pragmatic variation while holding each question's surface form fixed. To support scalable evaluation, we propose a semi-automated pipeline that produces question-context-interpretation instances with systematic variation. Across evaluations, we find a consistent generation-inference asymmetry: while state-of-the-art models can generate plausible pragmatic scenarios when guided, they often fail to recover the intended implication at inference time. For smaller models, structured prompting improves alignment with human judgments. A comparative writing study further reveals complementary strengths: human authors tend to produce safer, predictable contexts, whereas models generate varied scenarios with interpretations that sometimes exceed contextual support. These findings highlight persistent challenges in modeling conversational implicature and motivate more context-sensitive evaluation frameworks.

2605.24266 2026-05-26 cs.CL cs.AI

An Interactive Paradigm for Deep Research

深度研究的交互式范式

Lin Ai, Victor S. Bursztyn, Xiang Chen, Julia Hirschberg, Saayan Mitra

AI总结 提出SteER框架,通过可解释的中间过程控制、成本效益决策和实时用户模型,在深度研究中实现用户对齐,性能优于现有基线。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展使得深度研究系统能够通过结合检索、推理和生成,为开放式查询合成全面、报告式的答案。然而,大多数框架依赖于僵化的流程,采用一次性范围界定和长时间自主运行,如果用户意图在过程中发生变化,几乎没有修正的空间。我们提出了SteER,一个可引导的深度研究框架,将可解释的中间过程控制引入长周期研究流程中。在每个决策点,SteER使用成本效益公式来确定是暂停等待用户输入还是自主继续。它结合了多样性感知规划与奖励对齐、新颖性和覆盖率的效用信号,并维护一个在会话过程中不断演化的实时用户模型。SteER在对齐方面比最先进的开源和专有基线高出最多22.80%,在广度、平衡等质量指标上领先,并且在85%以上的成对对齐判断中被人类读者偏好。我们还引入了一个用户查询基准和数据生成流水线。据我们所知,这是第一个以交互式、可解释的控制范式推进深度研究的工作,为长形式任务中可控、用户对齐的智能体铺平了道路。

英文摘要

Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80\% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85\%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

2605.24261 2026-05-26 cs.LG cs.SY eess.SY

Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence

优化数字治疗干预:内源性依从性下的在线学习

Eric Pulick, Stephanie Carpenter, Matthew Buman, Yonatan Mintz

AI总结 针对慢性病数字治疗中患者依从性受推荐和过去依从性影响的问题,提出一个包含线性动力系统和logit链接的决策支持框架,并设计基于乐观主义的UCB-BOLD算法实现亚线性遗憾。

Comments 48 pages, 6 figures

详情
AI中文摘要

临床医生管理慢性病干预面临的一个关键挑战是在信息和资源有限的情况下维持患者的长期健康。数字治疗(DT)通过重复互动(例如每日治疗建议)提供了一种成本效益高的方式来大规模管理干预,但患者的成功高度依赖于他们的依从性。行为心理学表明,治疗建议和过去的依从性都会影响未来的依从性,然而现有的DT决策支持框架仅建模建议效应或将依从性视为外生背景,在模型和算法开发上留下了关键空白。为填补这一空白,我们提出了一个DT决策支持框架,该框架同时捕捉建议和依从性效应,使临床医生能够更好地规划治疗建议。我们使用线性动力系统(LDS)对患者随时间变化的治疗参与能力进行建模,该系统同时捕捉建议和依从性效应,并通过logit链接与依从性行为内生连接。我们建立了该模型的有限时间辨识保证,将LDS结果扩展到我们的设置。接下来,我们提出了一种基于乐观主义的算法UCB-BOLD用于在线治疗选择,并证明其实现了亚线性遗憾。我们通过使用微随机试验数据生成的合成患者队列进行消融研究,将UCB-BOLD与基准进行了比较。DT决策支持工具可以包含动态模型,使决策者能够有效利用DT设置中的数据,通过有效的资源分配改善患者健康。虽然短视或启发式方法对某些患者类型足够,但对于其他患者,明确规划建议和依从性效应的好处显著;UCB-BOLD的条件风险价值遗憾比次优基准低2-3倍。

英文摘要

A critical challenge facing clinicians managing chronic disease interventions is sustaining long-run patient health given limited information and resources. Digital therapeutics (DTs) provide a cost-effective way to manage interventions at scale through repeated interactions (e.g. daily treatment recommendations), but patient success is highly dependent on their adherence. Behavioral psychology suggests that both treatment recommendations and past adherence affect future adherence, yet existing decision support frameworks for DTs model only recommendation effects or treat adherence as exogenous context, leaving a key gap in model and algorithm development. To address this gap, we present a DT decision support framework that captures both recommendation and adherence effects, allowing clinicians to better plan treatment recommendations. We model a patient's time-varying capacity for engagement with treatment using a linear dynamical system (LDS) that captures both recommendation and adherence effects, endogenously connected to adherence behavior with a logit link. We establish finite-time identification guarantees for this model, extending LDS results to our setting. Next, we propose an optimism-based algorithm, UCB-BOLD, for online treatment selection and prove that it achieves sublinear regret. We evaluate UCB-BOLD against benchmarks via ablation studies on a synthetic patient cohort generated using micro-randomized trial data. DT decision support tools can include dynamical models to enable decision makers to efficiently use the data in DT settings to improve patient health through effective resource allocation. While myopic or heuristic approaches suffice for some patient types, the benefits of explicitly planning around recommendation and adherence effects are significant for others; UCB-BOLD achieves 2-3x lower conditional value-at-risk regret than the next-best benchmark.

2605.24251 2026-05-26 cs.LG cs.CV

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

重新思考边缘上的持续异常检测:在现实工业条件下进行基准测试

Chad Weatherly, Sen Lin

AI总结 针对现有持续异常检测方法在评估、比较和边缘部署约束上的不足,提出统一基准和训练无关方法DINOSaur,在多种协议下超越所有现有方法,并在边缘设备上实现快速推理和适应。

详情
AI中文摘要

持续异常检测(CAD)解决了工业检测系统适应不断变化的生产条件的需求,但现有方法存在三个关键差距:不现实的评估、缺乏系统比较以及未考虑边缘部署约束。我们引入了一个统一的基准,结合了结构和逻辑异常的离散任务评估、一种新颖的连续漂移协议、对所有已发布CAD方法的首次头对头比较,以及在边缘硬件上的计算效率分析。我们的结果表明,现有的CAD方法并不一致地优于带有简单经验重放的传统方法。受此启发,我们提出了DINOSaur,一种无需训练的方法,结合了冻结的DINOv3骨干网络、空间索引的coreset记忆和邻域限制的异常评分。DINOSaur通过构造实现了零遗忘,在所有五种协议上优于所有评估的方法,并在NVIDIA Jetson Orin Nano上以低于100毫秒的推理速度运行,在设备上适应新任务的时间不到30秒。

英文摘要

Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison, and no consideration of edge deployment constraints. We introduce a unified benchmark combining discrete-task evaluation on structural and logical anomalies, a novel continuous drift protocol, the first head-to-head comparison of all published CAD methods, and computational efficiency profiling on edge hardware. Our results reveal that existing CAD methods do not consistently outperform traditional approaches with simple experience replay. Thus motivated, we propose DINOSaur, a training-free method combining a frozen DINOv3 backbone with spatially-indexed coreset memory and neighborhood-restricted anomaly scoring. DINOSaur achieves zero forgetting by construction, outperforms all evaluated methods across all five protocols, and runs at sub-100\,ms inference on an NVIDIA Jetson Orin Nano, with on-device adaptation to new tasks in under 30 seconds.

2605.24249 2026-05-26 cs.LG

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

PrivFusion: 一种用于协调分布式数据集的隐私保护多智能体框架

Anisa Halimi, Liubov Nedoshivina, Kieran Fraser, Stefano Braghin

AI总结 提出PrivFusion框架,通过多智能体自动协调异构结构化数据集,在联邦学习前实现隐私保护的数据对齐,减少人工干预。

Comments Accepted by IEEE CBMS 2026

详情
AI中文摘要

临床数据的日益可用性增加了机器学习的使用,但集中式数据聚合对于敏感健康信息通常不可行。联邦学习提供了一种分布式替代方案,但其采用受到机构数据集间显著异质性的限制,使得协调成为多站点分析的关键但经常被忽视的前提。我们引入了PrivFusion,一个隐私保护的多智能体框架,在联邦训练之前自动协调结构化数据集。PrivFusion使用智能体分析本地数据,跨站点聚类语义相似的特征,并提供迭代转换建议直到实现对齐。在四个异构COVID-19数据集上的评估表明,PrivFusion有效且高效地协调了多站点数据,同时大幅减少了人工工作量。

英文摘要

The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi-site analytics. We introduce PrivFusion, a privacy-preserving multi-agent framework that automates the harmonization of structured datasets prior to federated training. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved. Evaluation across four heterogeneous COVID-19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi-site data while substantially reducing manual effort.