arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02953 2026-06-03 cs.CL

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

大型语言模型中的语言生产力:模型强制但不抢占

Claire Bonial, Claire Benet Post, Laura Michaelis, Harish Tayyar Madabushi

AI总结 通过测试大型语言模型是否受固化(高频使用)和抢占(未观察到结构)两种统计信号影响,发现模型能识别强制情况下的构式生产力,但无法利用负面证据避免过度泛化。

详情
AI中文摘要

基于使用的语法理论认为,语言的创造性生产力受到两种不同频率信号的增强和约束:固化(源于高频使用)和抢占(源于在期望出现特定语言结构的语境中从未观察到该结构)。大型语言模型也是基于使用的,因为语言结构是通过接触大量文本而习得的。在这里,我们测试固化和抢占这两种对立的统计力量是否也鼓励和约束了LLM中的语言生产力。我们跨模型架构证明,较大的模型在强制情况下能够识别并用非词再现构式生产力(固化),其中更广泛的构式语境强制了对词汇项的非典型解释。然而,我们也表明,即使最大的模型也不会将负面证据扩展到新语言,并且统计抢占不能使模型避免对语义上合适但从未在数据中观察到的模式进行过度泛化。

英文摘要

Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE:边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

AI总结 提出SCOPE模块化代理,用于自然语言控制的PTZ相机,在边缘部署实现实时感知、规划与控制,并通过仿真和物理实验评估延迟、准确性和错误模式。

详情
Journal ref
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026
Comments
9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE
AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估:自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具,并使用部署关键指标(包括延迟、准确性和错误模式)进行评估。我们提出了SCOPE(用于感知和评估的仿真与相机操作),这是一个模块化代理,用于自然语言、开放词汇的云台变焦(PTZ)相机控制和视觉场景理解,专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行,也可在物理PTZ相机上运行,所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试,涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别,在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合,以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合,将Qwen3小语言模型(SLM)与Moondream和Qwen视觉语言模型(VLM)配对。更强的SLM显著减少了幻觉并改善了工具路由,从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM,感知就成为主要的性能瓶颈。在规划和感知方面,混合专家模型在延迟和内存占用与更小网络相当的情况下,始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升,为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2606.02948 2026-06-03 cs.LG cs.DS

From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online Optimization

从非凸到强凸:曲率自适应的FTPL在线优化

Moses Charikar, Chirag Pabbaraju, Ambuj Tewari

AI总结 提出一种曲率自适应的FTPL算法,通过时变扰动尺度实现非凸Lipschitz损失下的最优遗憾界,并在线性累积曲率下达到对数遗憾。

详情
AI中文摘要

曲率自适应是在线优化中的一个经典主题:对于凸Lipschitz损失,自适应方法在一般凸损失的最优$O(\sqrt{T})$遗憾和强凸性下的$O(\log T)$遗憾之间进行插值。最近的研究表明,假设可以访问近似离线优化预言机,Follow-the-Perturbed-Leader (FTPL) 即使对于在线非凸Lipschitz损失也能实现最优的$O(\sqrt{T})$遗憾,但这些保证没有利用曲率。我们证明,在非凸设置中,FTPL可以变得曲率自适应,而无需事先知道曲率如何随时间累积。我们的算法将标准FTPL的固定扰动尺度替换为仅使用过去信息选择的时变尺度。我们给出了该尺度的简单跟随者调节规则,并表明它与事后最佳选择竞争(在常数因子内)。所得到的方法对于任意非凸Lipschitz损失实现了$O(\sqrt{T})$遗憾,并随着累积曲率的增长而改进;在足够精确的预言机调用下,当累积曲率线性增长时(包括经典的强凸情形),它实现了$O(\log T)$遗憾。我们用规定的累积曲率序列(即使对于一维凸损失)的匹配下界来补充这些上界,表明最坏情况非凸遗憾与曲率驱动的快速速率之间的权衡是内在的。

英文摘要

Curvature adaptivity is a classical theme in online optimization: for convex Lipschitz losses, adaptive methods interpolate between the optimal $O(\sqrt{T})$ regret for general convex losses and $O(\log T)$ regret under strong convexity. Recent work has shown that Follow-the-Perturbed-Leader (FTPL) achieves optimal $O(\sqrt{T})$ regret even for online non-convex Lipschitz losses, assuming access to an approximate offline-optimization oracle, but these guarantees do not exploit curvature. We show that FTPL can be made curvature-adaptive in the non-convex setting, without knowing in advance how curvature will accumulate over time. Our algorithm replaces the fixed perturbation scale of standard FTPL with a time-varying scale chosen using only past information. We give a simple follow-the-leader tuning rule for this scale and show that it competes, up to constants, with the best choice in hindsight. The resulting method achieves $O(\sqrt{T})$ regret for arbitrary non-convex Lipschitz losses and improves as cumulative curvature grows; with sufficiently accurate oracle calls, it achieves $O(\log T)$ regret when cumulative curvature grows linearly, which includes the classical strongly convex regime. We complement these upper bounds with matching lower bounds for prescribed cumulative-curvature sequences, already for one-dimensional convex losses, showing that the tradeoff between worst-case non-convex regret and curvature-driven fast rates is intrinsic.

2606.02947 2026-06-03 cs.LG cs.CV

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

BYORn:自举你的响应以防御大型视觉-语言模型的后门攻击

Ivan Sabolić, Marin Oršić, Josip Šarić, Sven Lončarić

AI总结 提出BYORn框架,通过识别并替换语义不合理的后门目标响应,打破触发器与目标输出的关联,从而在保持干净任务性能的同时提升对后门攻击的鲁棒性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

监督微调是将自回归视觉-语言模型适应下游任务的主要方法。最近的研究表明,这种范式极易受到后门攻击,并且现有的防御在开放生成设置中无效。为此,我们提出了BYORn,一个鲁棒的后门防御微调框架,其动机是观察到,在给定相应图像-文本输入和预训练模型的情况下,被毒化的目标响应通常在语义上不合理。BYORn识别这种不对齐的响应,并动态地用模型生成的替代响应替换它们,从而打破触发器与目标输出之间的相关性。由此产生的目标梯度对应于干净数据分布上总体风险上界的经验估计的梯度。实验上,BYORn在保持干净任务性能的同时,持续提高了对后门攻击的鲁棒性,建立了泛化与攻击成功率之间的新权衡边界。最后,我们证明了BYORn对专门设计用于规避所提防御的自适应攻击仍然有效。

英文摘要

Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

2606.02946 2026-06-03 cs.LG cs.CR

Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment

智取变色龙:针对直播风险评估中战术性OOD偏移的反事实解耦

Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu, Qiwei Zhong, Xiang Ao

AI总结 针对直播中恶意行为者通过战术性分布偏移(Tactical OOD Shift)规避检测的问题,提出基于潜在因果的反事实解耦框架LPCD,通过潜在层建模意图与叙事变化并强制潜在反事实一致性,实现鲁棒的风险评估。

详情
Comments
Accepted by KDD'26
AI中文摘要

直播已成为社交互动和数字商务的主要媒介,但日益受到复杂风险的困扰。该领域的一个基本挑战是战术性分布偏移(tactical OOD shift):虽然恶意行为者保持稳定的潜在目标,但他们不断重新设计叙事包装以逃避检测。这种对抗性偏移暴露了现有OOD泛化范式的关键局限性,其假设在意图-战术紧密耦合演变和原始级反事实定义不清的情况下难以满足。在本文中,我们从潜在因果角度解决这一问题,并提出潜在预测反事实解耦(LPCD),一个用于鲁棒直播风险评估的即插即用框架。LPCD通过在潜在层建模意图和叙事变化来实现对抗性战术重新包装下的反事实推理,并强制潜在反事实一致性以将风险预测锚定在因果稳定的恶意意图上。在推理时,LPCD应用轻量级、无参数的校准以进一步缓解战术引起的分布偏移。在大规模工业数据集和在线生产流量上的大量实验表明,LPCD持续优于最先进的基线,验证了其在现实直播中调节不断演变的对抗性风险的有效性。项目页面见此https URL。

英文摘要

Live streaming has emerged as a primary medium for social interaction and digital commerce, yet it is increasingly plagued by sophisticated risks. A fundamental challenge in this domain is \emph{tactical out-of-distribution (OOD) shift}: while malicious actors maintain stable underlying objectives, they continuously redesign narrative packaging to evade detection. Such adversarial shifts expose critical limitations of existing OOD generalization paradigms, whose assumptions are difficult to satisfy in the presence of tightly coupled intent-tactic evolution and ill-defined raw-level counterfactuals. In this paper, we tackle this issue from a \emph{latent causal} perspective and propose \underline{L}atent-\underline{P}redictive \underline{C}ounterfactual \underline{D}ecoupling~(LPCD), a plug-in framework for robust live streaming risk assessment. LPCD enables counterfactual reasoning under adversarial tactical re-packaging by modeling intent and narrative variation at the latent level, and enforces \emph{latent counterfactual consistency} to anchor risk prediction on causally stable malicious intent. At inference time, LPCD applies a lightweight, parameter-free calibration to further mitigate tactic-induced distribution shifts. Extensive experiments on large-scale industrial datasets and online production traffic demonstrate that LPCD consistently outperforms state-of-the-art baselines, validating its effectiveness in moderating evolving adversarial risks in real-world live streaming. The project page is available at https://qiaoyran.github.io/LiveStreamingRiskAssessment/.

2606.02939 2026-06-03 cs.LG eess.SP

ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

ERP-XTTN: 可解释的原型引导跨注意力用于跨被试ERP分类

Charlotte Genevier Wyman, Leanne Hirshfield

AI总结 提出ERP-XTTN,一种基于原型引导跨注意力的架构,在无需校准的跨被试条件下实现可解释的ERP分类,并揭示分类错误的神经生理学原因。

详情
AI中文摘要

可解释的脑机接口分类器能够在无需校准的情况下跨被试泛化仍然是一个开放的挑战。我们测试了基于原型的跨注意力是否能在部署兼容条件下提供具有竞争力且可解释的事件相关电位(ERP)分类。我们提出ERP-XTTN,一种跨注意力架构,通过仅查询-键的跨注意力(无值投影)将输入EEG片段路由到固定的差异波原型,因此分类完全依赖于注意力路由,且注意力忠实性是结构性的而非事后解释的。原型从训练折差异波的极值自动推导。我们在三个公开数据集(BNCI Horizon 2020、HRI Cursor和ERP CORE)上评估,涵盖八个ERP成分(ERN、LRP、ErrP、N170、P300、N2pc、MMN、N400),使用留一被试(LOSO)评估,并在两种通道数(3通道和全导联)下采用因果滤波,与EEGNet和基于黎曼几何的xDAWN(xDAWN+RG)对比。最佳基线与ERP-XTTN的平均差距在3通道时为0.018 AUROC,在全导联时为0.034,这源于两个大致不同的来源:相对于EEGNet的时间灵活性成本和相对于xDAWN+RG的空间利用成本,后者在全导联时由信噪比驱动。除了准确性,透明的路由揭示了黑箱模型无法发现的跨被试信号结构:假阳性与真阳性的相似度高于真阴性,表明分类错误在神经生理学上是可以解释的。ERP-XTTN在因果、无校准条件下泛化到多种ERP,并在最小导联设置下具有较小的可解释性代价。据我们所知,这是ERP CORE上首个epoch级LOSO基准测试。

英文摘要

Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.

2606.02936 2026-06-03 cs.LG

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

分层RBF-KAN和RBF-SKAN架构用于多维函数逼近和随机场学习

Mingtao Xia, Qijing Shen

AI总结 提出并分析使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型,并证明其通用逼近性质及缓解维度灾难的潜力。

详情
AI中文摘要

本文提出并分析了使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型。具体地,我们开发了用于多维确定性函数逼近的分层径向基函数Kolmogorov-Arnold网络(分层RBF-KAN)和用于随机场学习的分层径向基函数随机Kolmogorov-Arnold网络(分层RBF-SKAN)。从理论角度,我们为两种架构建立了通用逼近结果。特别地,我们推导了分层RBF-KAN的定量逼近估计,表明所提出的框架通过降低逼近问题的有效维度,有潜力部分缓解高维函数学习中的维度灾难。此外,我们证明了分层RBF-SKAN可以在Wasserstein-2度量下逼近随机场模型。实验上,我们表明所提出的基于径向基函数的神经网络结构能够有效学习多元函数和随机场模型。

英文摘要

In this manuscript, we propose and analyze hierarchical Kolmogorov--Arnold neural network architectures employing radial basis functions as activation functions for approximating deterministic functions and random field models. Specifically, we develop a hierarchical radial-basis-function Kolmogorov--Arnold network (hierarchical RBF-KAN) for multidimensional deterministic function approximation and a hierarchical radial-basis-function stochastic Kolmogorov--Arnold network (hierarchical RBF-SKAN) for random field learning. From a theoretical perspective, we establish universal approximation results for both architectures. In particular, we derive quantitative approximation estimates for the hierarchical RBF-KAN, showing that the proposed framework has the potential to partially alleviate the curse of dimensionality in learning high-dimensional functions by reducing the effective dimensionality of the approximation problem. Furthermore, we show that the hierarchical RBF-SKAN can approximate random field models under the Wasserstein-2 metric. Empirically, we show that our proposed radial-basis-function-based neural network structure could effectively learn multivariate functions and random field models.

2606.02935 2026-06-03 cs.CV cs.CE

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

AI总结 提出一种两阶段几何配准方法,通过检测CT切片中的椭圆截面估计旋转轴,再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体(电离室)的精确配准,无需强度校准或特征匹配,倾斜和方向误差低于0.1°。

详情
AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要;随着最新架构能力增强,需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时,传统的基于强度的方法失效,而基于点的算法(如ICP、RANSAC)需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体(电离室)的两阶段几何配准方法,利用对象的独特几何特征。首先,通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆,并在RANSAC异常值去除后对拟合椭圆中心进行PCA,来估计3D旋转轴。其次,将CAD模型体素化,沿检测轴定向,并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配,即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后,对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

2606.02928 2026-06-03 cs.RO

Improved Postural Stability Using a Lightweight Semi-Active Soft Back Support Device Under Standing Perturbations

使用轻量级半主动软背部支撑装置在站立扰动下改善姿势稳定性

Rohan Khatavkar, Jiefeng Sun, Hyunglae Lee

AI总结 研究提出一种结合气动人工肌肉与弹性带的轻量级半主动软背部支撑装置,通过快速提供辅助力显著降低全身角动量并增加稳定裕度,从而改善站立扰动后的平衡恢复。

详情
Comments
6 pages, 8 figures, submitted to IROS 2026, the IEEE/RSJ International Conference on Intelligent Robots and Systems
AI中文摘要

老年人在站立时受到扰动(如向前失去平衡)后特别容易跌倒。辅助躯干伸展的背部支撑装置可能通过防止过度躯干屈曲来帮助减轻跌倒风险。先前的研究已经研究了重型背部支撑装置;然而,这些系统由于其附加质量往往对稳定性产生不利影响,这会使身体自然重心发生不利的偏移。相比之下,轻量级被动装置显示出有限的益处,因为它们在向前平衡丧失相关的相对较小的躯干屈曲期间只能产生适度的辅助力。在本研究中,我们评估了一种轻量级半主动软背部支撑装置在站立扰动后对姿势稳定性的影响。我们的装置将一个主动元件(气动人工肌肉)与一个被动弹性带并联。主动元件在扰动后快速提供辅助力,克服了被动装置的局限性。对五名健康个体进行的实验表明,半主动装置显著降低了全身角动量并增加了稳定裕度,表明平衡恢复性能得到改善。这些结果突显了半主动软可穿戴机器人作为站立扰动期间跌倒预防的有效且轻量级策略的前景。

英文摘要

Older adults are particularly susceptible to falls following perturbations during standing, such as forward loss of balance. Back support devices that assist trunk extension may help mitigate fall risk by preventing excessive trunk flexion. Previous studies have investigated heavy back support devices; however, these systems often introduced adverse effects on stability due to their added mass, which shifted the body's natural center of mass unfavorably. In contrast, lightweight passive devices have shown limited benefits, as they can generate only modest assistive forces during the relatively small trunk flexion associated with forward balance loss. In this study, we evaluated the effects of a lightweight semi-active soft back support device on postural stability following standing perturbations. Our device combines an active element (a pneumatic artificial muscle) in parallel with a passive elastic band. The active element rapidly provides assistive force following a perturbation, overcoming the limitations of passive devices. Experiments conducted with five healthy individuals demonstrated that the semi-active device significantly reduced whole-body angular momentum and increased the margin of stability, indicating improved balance recovery performance. These results highlight the promise of semi-active soft wearable robots as an effective and lightweight strategy for fall prevention during standing perturbations.

2606.02927 2026-06-03 cs.CV

SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks

SaluNet: 在无归一化深度网络中实现完全可塑性

Mourad Zaied

AI总结 提出SALU激活函数替代归一化层,构建SaluNet网络,在无归一化条件下实现深度网络的稳定训练,并在多个数据集上取得优异性能。

详情
Comments
34 pages
AI中文摘要

归一化层如BatchNorm和LayerNorm长期以来被认为是深度网络稳定训练所必需的。本文证明它们可以被单一的可学习激活机制完全替代。我们发现标准归一化会引发可塑性抑制效应:当与归一化层配对时,可学习激活参数会迅速失去适应性。受此观察启发,我们引入SALU(饱和自适应线性单元),\[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] 一种有界的、可学习的激活函数,无需依赖批次统计或外部仿射参数即可提供内在的信号稳定。基于SALU,我们提出SaluNet,一种基于完全可塑性的范式:SALU替代归一化层,而SWALU和GALU替代标准激活函数。使用ResNet-18,SaluNet-C-18在CIFAR-10上达到97.35%,在CIFAR-100上达到83.25%,且无归一化;在批次大小为1时(归一化架构失败)仍保持93.44%和76.23%。对于Transformer,SaluNet-T在CIFAR-10上将LayerNorm-GELU从90.92%提升至91.01%,在CIFAR-100上从66.54%提升至68.10%。SaluNet-C-50在ImageNet-1K上达到78.67%的Top-1准确率(224×224),在288×288下为79.23%。这些结果表明归一化层抑制了完全可塑性——这是生物神经元固有的特性,使深度网络能够有效学习。

英文摘要

Normalization layers such as BatchNorm and LayerNorm have long been considered essential for stable training in deep networks. This work demonstrates that they can be fully replaced by a single learnable activation mechanism. We identify a plasticity suppression effect induced by standard normalization: learnable activation parameters rapidly lose adaptability when paired with normalization layers. Motivated by this observation, we introduce SALU (Saturated Adaptive Linear Unit), \[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] a bounded, learnable activation that provides intrinsic signal stabilization without relying on batch statistics or external affine parameters. Building on SALU, we propose SaluNet, a paradigm grounded in total plasticity: SALU replaces normalization layers, while SWALU and GALU replace standard activations. With ResNet-18, SaluNet-C-18 achieves 97.35\% on CIFAR-10 and 83.25\% on CIFAR-100 without normalization, maintaining 93.44\% and 76.23\% at batch size 1 where normalized architectures fail. For transformers, SaluNet-T improves over LayerNorm-GELU from 90.92\% to 91.01\% on CIFAR-10 and from 66.54\% to 68.10\% on CIFAR-100. SaluNet-C-50 reaches 78.67\% Top-1 on ImageNet-1K at $224\times224$, and $79.23\%$ at $288\times288$. These results suggest normalization layers suppress total plasticity, a property biological neurons inherently possess, enabling deep networks to learn effectively.

2606.02924 2026-06-03 cs.CV

ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception

ATLAS:面向对抗性激光雷达感知的大规模评估基准

Mellon M. Zhang, Siddhant Panse, Zimo Fan, Akshal Dhal, Rishit Sarkar, Glen Chou

AI总结 针对黑盒传感器攻击下激光雷达感知模型的鲁棒性评估空白,提出首个大规模物理驱动基准ATLAS,通过点注入和点移除两种攻击模式,揭示模型性能与鲁棒性的非对称性,并溯源至标准数据增强方法。

详情
Comments
preprint
AI中文摘要

自动驾驶感知通常在干净的基准数据上进行评估,然而实际部署需要对罕见、结构化且可能具有对抗性的传感器异常具有鲁棒性。这一差距对于激光雷达尤为关键,因为外部行为者可以在不访问模型的情况下物理操纵传感过程,引发黑盒感知故障。现有的激光雷达基准对此类故障模式几乎不提供可见性。先前的对抗性激光雷达研究主要集中于攻击硬件、几何和算法防御以及早期检测器,而现代感知系统的鲁棒性尚未被探索。为弥补这一评估空白,我们提出了ATLAS(对抗性时间激光雷达攻击套件),这是首个大规模、物理驱动的激光雷达感知模型评估基准,在黑盒传感器攻击下模拟两种主要攻击模式——点注入和点移除,覆盖真实驾驶序列。通过评估当前最先进的激光雷达感知模型的广泛截面,ATLAS揭示了一个令人惊讶的鲁棒性非对称性:在标准基准上表现更强的模型往往更能抵御移除攻击,但实际上比弱模型更容易受到注入攻击。我们将这一脆弱性追溯到标准对象数据库采样增强,揭示了当前训练实践如何引发与架构无关的鲁棒性故障,并研究了缓解两种攻击模式的初步方向。我们发布了ATLAS生成代码,以支持随着攻击能力演进而进行的可扩展、可重复的评估,帮助使黑盒传感器鲁棒性成为未来激光雷达感知发展中的明确考虑因素。

英文摘要

Autonomous driving perception is typically evaluated on clean benchmark data, yet real-world deployment requires robustness to rare, structured, and potentially adversarial sensor anomalies. This gap is especially critical for LiDAR, where external actors can physically manipulate the sensing process to induce black-box perception failures without accessing the model. Existing LiDAR benchmarks provide little visibility into this failure mode. Prior adversarial LiDAR studies have largely centered on attack hardware, geometric and algorithmic defenses, and early-generation detectors, leaving the robustness of modern perception systems unexplored. To address this evaluation gap, we introduce ATLAS (Adversarial Temporal LiDAR Attack Suite), the first large-scale, physically grounded evaluation benchmark for LiDAR perception models under black-box sensor attacks, simulating the two primary attack modes -- point injection and point removal -- across real driving sequences. Evaluating a broad cross-section of current state-of-the-art LiDAR perception models, ATLAS reveals a surprising robustness asymmetry: models with stronger performance on standard benchmarks tend to better withstand removal attacks, yet are actually more vulnerable to injection attacks than weaker models. We trace this vulnerability to standard object database sampling augmentations, revealing how current training practices can induce architecture-agnostic robustness failures, and study initial directions for mitigating both attack modes. We release the ATLAS generation code to support extensible, reproducible evaluations as attack capabilities evolve, helping make black-box sensor robustness an explicit consideration in future LiDAR perception development.

2606.02920 2026-06-03 cs.LG

Fast Unlearning at Scale via Margin Self-Correction

通过边际自我修正实现大规模快速遗忘学习

Federico Di Gennaro, Alexander Shevchenko, Fanny Yang

AI总结 提出MASC方法,通过在线停止规则在无需下游评估的情况下高效实现语言模型遗忘,显著降低计算成本。

详情
AI中文摘要

语言模型遗忘学习更新已训练模型,使其表现得好像从未见过选定的训练样本,同时保持效用并避免昂贵的重新训练。现有方法通常使用固定的训练预算微调预训练模型,然后通过在下游验证数据上评估几个保存的检查点来最终选择模型。两种不必要的计算限制了可扩展性:训练超出期望的遗忘-保留权衡,以及需要额外存储和重复评估的检查点选择。为了解决这些限制,我们引入了MArgin Self-Correction (MASC),一种高效的遗忘学习方法,具有在线停止规则,不需要下游评估。给定一个要遗忘的文本序列,MASC主动减小原始下一个词元与最可能替代词元之间的logit差距。一旦这个差距在所有遗忘序列的足够大比例的词元位置上平均较小,它就会输出最终模型。在TOFU、MUSE News和MUSE Books上,MASC以现有基线计算成本的一小部分实现了具有竞争力的遗忘-保留权衡。我们进一步观察到,随着模型规模(即参数数量)的增加,MASC和SimNPO的权衡都得到了改善——遗忘指标保持可比,而保留效用增加。

英文摘要

Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model with a fixed training budget and select the final model afterwards by evaluating several saved checkpoints on downstream validation data. Two sources of unnecessary computation limit scalability: training beyond the desired forget-retain trade-off, and checkpoint selection that requires extra storage and repeated evaluations. To address these limitations, we introduce MArgin Self-Correction (MASC), an efficient unlearning method with an online stopping rule that does not require downstream evaluation. Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences. On TOFU, MUSE News, and MUSE Books, MASC achieves a competitive forget-retain trade-off at a fraction of the computational cost of existing baselines. We further observe that as we increase model size (a.k.a. number of parameters), the trade-offs improve for both MASC and SimNPO -- the forget metrics remain comparable while retain utility increases.

2606.02915 2026-06-03 cs.CV

Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Any2Poster: 跨模态和领域的任意源海报生成

Amogh Vinaykumar, Aiden Li, Suozhi Huang, Shilong Liu

AI总结 提出Any2Poster Bench基准和Any2Poster Agent智能体,实现从多种输入模态和领域生成海报,并通过基于测验和视觉评估的方法验证信息保真度和视觉传达效果。

详情
Comments
Project Page: https://github.com/Any2Poster/Any2Poster
AI中文摘要

视觉海报是传达密集信息的紧凑媒介,然而自动海报生成的进展难以衡量,因为现有评估通常局限于仅论文输入、狭窄领域或表面视觉相似性。我们引入了Any2Poster Bench,一个用于任意源海报生成的基准,它评估系统在八种输入模态(PDF、URL、PPTX、DOCX、Markdown、LaTeX、笔记本和视频)和五个内容领域上的表现。Any2Poster Bench将每个源与基于测验的逐字事实保留和解释性理解探测,以及基于VLM的视觉质量、布局、可读性、内容完整性和逻辑流程判断相结合,从而实现对信息保真度和视觉传达的可重复评估。为了实例化和验证这一基准,我们进一步提出了Any2Poster Agent,一个端到端的参考智能体,它解析异构源、组织显著内容、规划海报布局、渲染海报,并使用视觉反馈迭代优化。在Any2Poster Bench上,Any2Poster Agent在输入模态上平均准确率达到87.25%,在内容领域上达到87.28%。在PaperQuiz风格评估中(其中先前的论文到海报智能体可直接比较),Any2Poster Agent将总体准确率从PosterAgent-4o的51.06-51.33%提高到72.58%,并将密度增强分数从116-121提高到145.16。总之,Any2Poster Bench和Any2Poster Agent为研究多模态、通用领域的海报生成提供了可复用的评估资源和有竞争力的基线。

英文摘要

Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

2606.02911 2026-06-03 cs.CL

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

幽灵标注者:通过共形预测探索内容审核中人类标签变异的框架

Mirko Lai, Alessandra Urbinati, Simona Frenda, Fabiana Vernero, Marco Antonio Stranisci

AI总结 提出结合共形预测与协同过滤式标注者表征的框架,通过幽灵预测度量和幽灵标注者表征量化模型预测与所有人类标注的分歧,并发现模型在标注者分歧时不确定性增加,但大型模型对无人类对齐文本更自信,且存在结构性人口统计偏差。

详情
AI中文摘要

当前研究主要关注模型性能,而对不确定性估计的关注相对较少,特别是在LLM越来越多地用于生成标注数据的场景中。我们引入了一个框架,将共形预测与协同过滤式的标注者表征相结合,以建模LLM相对于人类标注者的行为,并分析一致与分歧的模式。利用非一致性分数,我们引入了幽灵预测度量和幽灵标注者表征,以量化模型预测与所有可用人类标注不一致的情况。我们计算余弦相似度度量,以探索模型行为在不同社会人口统计轴上的差异。我们在四个内容审核数据集上评估了四种不同规模和家族的LLM。我们的发现表明,虽然所有模型的不确定性随着标注者分歧的增加而增加,但较大的模型在对与任何人类标注不一致的文本进行分类时往往更自信。最后,幽灵标注者框架揭示了一致且稳健的人口统计错位模式,表明可能存在源于预训练语料库的结构性偏见。

英文摘要

Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.

2606.02908 2026-06-03 cs.CL cs.AI

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

AI总结 针对多轮用户代理在信息收集和决策中面临的证据负担挑战,提出WRIT方法,通过合成写密集型和读密集型轨迹,训练代理在信息负载下做出基于证据的决策,仅用2K轨迹即可提升性能并减少推理时token使用。

详情
AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图,通过对话和工具收集缺失信息,并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径:现有流程通常通过将多个用户请求组合成更长的任务来增加难度,产生训练顺序执行的写密集型轨迹。我们认为,当代理必须在收集和比较大量读工具证据后才能确定其参数时,单个写决策本身可能很困难,这是仅靠写密集型数据无法解决的挑战。基于这一见解,我们提出WRIT(写-读密集型轨迹合成),这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程:任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后,它多样化用户行为指令以反映真实的对话变化,最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务,而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹,在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think,并大幅减少推理时token使用,表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

2606.02902 2026-06-03 cs.CY cs.LG

Fairness Definitions and Metrics in Deep Reinforcement Learning for Drug Discovery in Healthcare: A Rapid Evidence Review

医疗保健中深度强化学习的公平性定义与指标:药物发现的快速证据综述

Esmaeil Shakeri, Ronnie de Souza Santos, Behrouz Far

AI总结 本文通过快速证据综述,系统总结了深度强化学习在药物分子生成中公平性的定义、测量指标,并分析了数据集组成、奖励设计对公平性的影响。

详情
Comments
10 pages, 6 figures, 3 tables. Accepted as a full paper at a symposium of IEEE COMPSAC 2026
AI中文摘要

深度强化学习(DRL)越来越多地应用于从头分子设计,但数据、奖励和评估的选择可能导致在不同疾病区域和化学类型上的性能不均。尽管如此,目前尚无关于DRL药物发现中公平性如何定义、测量和测试的简明综合。在这篇快速证据综述中,我们综合了医疗保健中DRL驱动分子生成的公平性定义和指标。我们关注三个问题:(i)数据集组成和划分策略(特别是支架划分与随机划分)如何影响评估和分布偏移;(ii)奖励设计(如QED、对接、毒性、合成可及性)如何产生或减轻偏差,重点关注癌症靶点;(iii)哪些可测量指标最能捕捉公平性。这包括癌症与非癌症适应症之间以及癌症亚型之间的均等性。还包括关键物理化学描述符的分布平衡、支架/化学类型多样性、组间有效性、毒性和合成可及性。从2017年起,我们检索了主要的生物医学、计算机科学和工程文献数据库,并使用arXiv进行地平线扫描。记录通过PRISMA式程序筛选,并通过内容编码分析,将报告的均等性结果与数据集和奖励选择联系起来。我们的综述为DRL分子生成提供了一套简洁的公平性定义和指标。它为报告分布均等性和结果均等性提供了实用指南。它还总结了数据集和奖励选择如何与观察到的均等性效应相关,并指出了与可信、癌症相关的DRL生成相关的未解决问题。

英文摘要

Deep reinforcement learning (DRL) is increasingly applied to de novo molecular design, but choices in data, rewards, and evaluation can yield uneven performance across disease areas and chemotypes. Despite this, there is no concise synthesis of how fairness is defined, measured, and tested in DRL-based drug discovery. In this rapid evidence review, we synthesize fairness definitions and metrics for DRL-driven molecule generation in healthcare. We focus on three questions: (i) how dataset composition and split strategies, especially scaffold versus random splits, affect evaluation and distribution shift; (ii) how reward design (e.g., QED, docking, toxicity, synthetic accessibility) can create or mitigate bias, with emphasis on cancer targets; and (iii) which measurable metrics best capture fairness. This includes parity across cancer versus non-cancer indications and across cancer subtypes. It also includes distributional balance in key physicochemical descriptors, scaffold/chemotype diversity, groupwise validity, toxicity, and synthetic accessibility. From 2017 onward, we searched major biomedical, computer science, and engineering literature databases and used arXiv for horizon scanning. Records were screened using PRISMA-style procedures and analyzed via content coding to link reported parity outcomes to dataset and reward choices. Our review provides a concise set of fairness definitions and metrics for DRL molecule generation. It offers practical guidance for reporting distribution parity and outcome parity. It also summarizes how dataset and reward choices relate to observed parity effects and identifies open gaps relevant to trustworthy, cancer-relevant DRL generation.

2606.02892 2026-06-03 cs.LG

Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction

多模态机器学习用于乳腺癌复发预测

Jiahao Shao, Xudong Wang, Anam Nawaz Khan, Christopher Brett, Xueping Li, Bing Yao

AI总结 本研究通过整合结构化与非结构化临床数据(治疗记录、病理报告和临床笔记),结合基于规则的正则表达式提取和优先级冲突解决策略,显著提升了乳腺癌复发预测的准确性。

详情
Comments
33 pages, 10 figures
AI中文摘要

乳腺癌复发是幸存者长期死亡的主要原因,需要及时准确的风险评估来指导随访护理和治疗计划。传统预测模型通常局限于结构化或非结构化数据,难以捕捉完整的临床背景。本研究探讨了整合多模态临床数据(包括治疗记录、病理报告和临床笔记)对复发预测的影响。通过结合基于规则的正则表达式提取机制和严格的基于优先级的冲突解决策略,我们的方法有效地从自由文本病理叙述中恢复确定的肿瘤特征,以增强结构化记录。我们还与先前乳腺癌研究中常用的特征集进行性能基准测试,以评估多模态整合的附加价值。单源和多模态输入在一系列机器学习模型上进行评估。结果表明,与单模态方法相比,多模态整合一致地提高了预测准确性。

英文摘要

Breast cancer recurrence, a leading cause of long-term mortality among survivors, requires timely and accurate risk assessment to guide follow-up care and treatment planning. Traditional predictive models, often limited to either structured or unstructured data alone, struggle to capture the full clinical context. This study examines the impact of integrating multi-modal clinical data, including treatment records, pathology reports, and clinician notes, on recurrence prediction. By integrating a rule-based regular expression extraction mechanism with a rigorous precedence-based conflict reconciliation strategy, our approach effectively recovers definitive tumor characteristics from free-text pathology narratives to augment structured records. We also benchmark performance against commonly used feature sets from prior breast cancer studies to assess the added value of multi-modal integration. Single-source and multi-modal inputs are evaluated across a range of machine learning models. Results show that multi-modal integration consistently improves predictive accuracy compared to single-modal methods.

2606.02888 2026-06-03 cs.RO

Impact of a Soft Wearable Back-Support Device on Postural Stability during Trip-Like Perturbations

软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的影响

Yuanhao Chen, Rohan Khatavkar, Soubhagya Nayak, Jiefeng Sun, Hyunglae Lee

AI总结 通过扰动站立和行走实验,研究软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的增强效果,发现装置使用提高了最小稳定裕度,表明其可改善反应性平衡控制,具有防跌倒潜力。

详情
Comments
6 pages, 6 figures, to be published in the proceedings of the 2026 11th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob)
AI中文摘要

通过两种实验范式(扰动站立和扰动行走)研究了软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的增强效果。健康受试者在三种不同的背部支撑条件下完成试验:无装置、低刚度装置、高刚度激活装置。使用最大不稳定点的最小稳定裕度(MOS)量化全身稳定性。结果表明,使用装置时MOS增加,表明姿势稳定性增强。在站立条件下,MOS随装置刚度显著增加;而在行走条件下,两种装置条件相比无装置均改善了MOS,但两者之间无显著差异。这些发现凸显了具有可调刚度的软性可穿戴背部支撑装置在改善对外部扰动的反应性平衡控制方面的潜力,对防跌倒具有重要意义。未来研究应探索个性化刚度优化,并评估在跌倒高风险人群中的有效性。

英文摘要

The effectiveness of a soft wearable back-support device in enhancing postural stability was investigated under trip-like perturbations using two experimental paradigms: perturbed standing and perturbed walking. Healthy subjects completed trials under three different back-support conditions: no device, device worn with low stiffness, and device activated with high stiffness. Whole-body stability was quantified using the minimum Margin of Stability (MOS) at the point of maximal instability. Results demonstrated increased MOS during device use, indicating enhanced postural stability. In standing, MOS increased significantly with device stiffness, whereas in walking, both device conditions improved MOS relative to no device but did not differ significantly from each other. These findings highlight the potential of soft wearable back-support devices with adjustable stiffness to improve reactive balance control against external perturbations, with important implications for fall prevention. Future research should explore personalized stiffness optimization and evaluate efficacy in populations at elevated risk of falls.

2606.02887 2026-06-03 cs.LG cs.NA math.NA math.OC

A Nonmonotone Gradient-Based Algorithm for Symmetric Nonnegative Matrix Factorization and Graph Clustering

一种用于对称非负矩阵分解和图聚类的非单调梯度算法

Ryan Swart, Johannes Brust

AI总结 提出SNMPBB算法,首次将非单调投影Barzilai-Borwein方法应用于对称非负矩阵分解,并扩展至图聚类和大规模问题,证明全局收敛性,实验显示显著加速和精度提升。

详情
AI中文摘要

对称非负矩阵分解(Symmetric NMF)将矩阵近似为 $WW^T$,其中 $W$ 是非负矩形因子。它在图聚类和机器学习中有广泛应用。与NMF相比,投影梯度方法在对称问题上收敛缓慢。为了解决这个问题,我们引入了SNMPBB,这是非单调投影Barzilai-Borwein方法在对称NMF上的首次应用,表明梯度算法比以前认为的有效得多。我们进一步将SNMPBB扩展到使用图拉普拉斯正则化的图聚类(Graph-SNMPBB)以及使用低秩近似的大规模问题(LAI-SNMPBB)。对于所有变体,我们证明了全局收敛到一阶稳定点,并且Barzilai-Borwein曲率信息在随机近似下得以保留。在合成数据上,SNMPBB在相似残差下比替代的SymANLS快6倍,且优势随秩增加而扩大。在六个真实世界聚类基准测试中,Graph-SNMPBB匹配或超过了SymANLS的精度。最后,LAI-SNMPBB在34个SuiteSparse矩阵上,在运行时间和残差质量方面均优于最先进的LAI-SymPGNCG。

英文摘要

Symmetric nonnegative matrix factorization (Symmetric NMF) approximates a matrix as $WW^T$ with nonnegative rectangular factor $W$. It has broad applications in graph clustering and machine learning. In contrast to the NMF, projected gradient methods for the symmetric problem had been associated with slow convergence. To address this, we introduce SNMPBB, the first adaptation of nonmonotone projected Barzilai-Borwein methods to Symmetric NMF, demonstrating that gradient algorithms are significantly more effective than previously understood. We further extend SNMPBB to graph clustering using the graph Laplacian regularization (Graph-SNMPBB) and to large problems with low-rank approximations (LAI-SNMPBB). For all variants we prove global convergence to first-order stationary points and also that Barzilai-Borwein curvature information is preserved with randomized approximations. On synthetic data, SNMPBB achieves 6 times speedup over the alternative SymANLS for similar residuals, with advantages growing at higher ranks. Across six real-world clustering benchmarks, Graph-SNMPBB matches or exceeds SymANLS accuracy. Lastly, LAI-SNMPBB outperforms state-of-the-art LAI-SymPGNCG on 34 SuiteSparse matrices in both runtime and residual quality.

2606.02884 2026-06-03 cs.LG cs.AI

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

我们真的在倾斜吗?流模型和扩散模型中奖励引导的机制

Sanjit Dandapanthula, Nicholas M. Boffi

AI总结 本文通过高斯混合模型和二次奖励的闭式分析,揭示了奖励引导扩散中奖励黑客现象源于Doob h函数的有限粒子插件估计,并提出了无额外计算的闭式奖励阻尼调度来纠正模式内偏差。

详情
AI中文摘要

奖励引导算法在推理时将学习到的生成过程导向奖励倾斜的测度。虽然经验上强大,但这些方法容易产生奖励黑客行为:引导模型以牺牲对学习分布的保真度为代价过度优化奖励。先前的工作将其归因于神经奖励函数的复杂性或扩散训练中的隐式偏差,但其根本起源仍知之甚少。我们表明,奖励黑客行为源于大多数实际奖励引导扩散实现中的一个近似——Doob h函数的有限粒子插件估计——即使在最简单的高斯和高斯混合目标以及二次奖励的非平凡设置中也是如此。在闭式中,我们分离了插件估计器的两种不同失效模式:它导致每个模式内的奖励黑客行为,并且无法选择高奖励模式。我们提出了一种闭式奖励阻尼调度,无需额外计算即可纠正模式内偏差,并阐明了最佳-n采样在补偿模式选择失败中的作用。在高斯混合目标、2D棋盘和FLUX.1文本到图像生成上的实验证实了我们的理论见解适用于实际设置。

英文摘要

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

2606.02883 2026-06-03 cs.HC cs.AI cs.CY cs.IR

LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems

LLM辅助重排序以在推荐系统中实现细微目标

Amir Ghasemian, Homa Hosseinmardi, Upasana Dutta, Duncan J. Watts

AI总结 本研究通过零样本指令提示对YouTube侧边栏候选进行重排序,发现无约束的LLM辅助重排序会放大极端和阴谋论内容,而轻量级提示正则化可在轻微损失相关性的情况下减少极端内容并增加意识形态多样性。

详情
Comments
30 pages total; 11 pages, 5 figures, 2 tables (main text); 19 pages, 11 figures, 9 tables (appendix)
AI中文摘要

推荐系统已从内容组织工具发展为塑造日常行为的复杂系统。通过控制我们所看到的内容,它们塑造了我们的感知,引发了对过滤气泡、激进化、两极分化和社会不平等的担忧。大型语言模型(LLM)实现了更强大的个性化,加剧了这些动态。然而,大多数推荐系统针对参与度或有限的准确性指标进行调优,很少关注更广泛的社会影响,例如个性化如何重塑社会重要领域中的曝光度。我们研究了LLM辅助重排序在提高个性化的同时,是否无意中放大了对意识形态极端或阴谋论政治内容的曝光,这是一种在新闻推荐中理论上存在但尚未得到实证表征的风险。使用真实的新闻消费历史,我们通过零样本、基于指令的提示对YouTube侧边栏候选进行重排序。我们比较了基线提示与一个约束变体,该变体保持主题相关性并扩大意识形态曝光,同时减少阴谋论或极端内容。在没有约束的情况下,重排序加强了个性化,但增加了对历史中包含此类内容的用户的阴谋论和极端主义材料的曝光。轻量级提示级正则化减少了对极端内容的推广并增加了意识形态多样性,同时相关性损失较小。合成实验表明,LLM通过语言中的统计规律而非对意识形态的语义理解进行重排序,这解释了为什么朴素提示会放大这些模式,而正则化可以重塑它们。总之,我们的结果突显了LLM在高风险推荐中实现上下文细微差别的能力,以及评估LLM辅助个性化超越准确性并将提示设计视为有价值负载而非中性默认的必要性。

英文摘要

Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.

2606.02879 2026-06-03 cs.RO

Direct Informed Sampling on Riemannian Manifolds via Loewner Order Lower Bounds

基于Loewner序下界的黎曼流形直接知情采样

Phone Thiha Kyaw, Jonathan Kelly

AI总结 提出一种利用Loewner序计算度量张量最紧常数下界的矩阵值可容许启发式,将黎曼知情集映射为各向同性欧氏空间中的标准长球超椭球,实现直接无拒绝采样,加速多种最优运动规划器收敛。

详情
Comments
Submitted to IEEE Robotics and Automation Letters (RA-L)
AI中文摘要

知情采样技术通过将搜索聚焦于状态空间的有希望区域来加速基于采样的运动规划器,然而大多数现有方法依赖于欧氏启发式,这些启发式在依赖于构型的黎曼度量下变得不可容许。虽然标量特征值下界通过均匀缩放欧氏距离恢复了可容许性,但它们丢弃了度量的方向结构,产生过于保守的知情集。我们提出一种矩阵值可容许启发式,利用对称正定矩阵上的Loewner序计算度量张量最紧的常数下界,同时保留其完整的方向结构。该下界的Cholesky分解定义了一个到各向同性欧氏空间的线性映射,在该空间中黎曼知情集简化为标准的长球超椭球,从而能够使用现有算法进行直接无拒绝采样。在6自由度UR5、7自由度Franka和14自由度PR2上三种不同黎曼度量下的操作任务实验表明,我们的启发式产生的知情集始终比欧氏和标量特征值下界更紧,加速了多种最先进渐近最优规划器的收敛。

英文摘要

Informed sampling techniques accelerate sampling-based motion planners by focusing the search on promising regions of the state space, yet most existing methods rely on Euclidean heuristics that become inadmissible under configuration-dependent Riemannian metrics. While scalar eigenvalue bounds restore admissibility by uniformly scaling the Euclidean distance, they discard the directional structure of the metric, producing overly conservative informed sets. We propose a matrix-valued admissible heuristic that exploits the Loewner order on symmetric positive definite matrices to compute the tightest constant lower bound on the metric tensor while preserving its full directional structure. The Cholesky factorization of this bound defines a linear map to an isotropic Euclidean space in which the Riemannian informed set reduces to a standard prolate hyperspheroid, enabling direct, rejection-free sampling using existing algorithms. Experiments on manipulation tasks with a 6-DoF UR5, 7-DoF Franka, and 14-DoF PR2 under three distinct Riemannian metrics show that our heuristic produces consistently tighter informed sets than both the Euclidean and scalar eigenvalue bounds, accelerating convergence across multiple state-of-the-art asymptotically optimal planners.

2606.02877 2026-06-03 cs.CV

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

面向可部署计算病理学的通路结构特权蒸馏

Yongxin Guo, Hao Lu, Onur Koyun, Zhengjie Zhu, Muhammet Demir, Metin Gurcan

AI总结 提出MoPE框架,通过通路索引病理专家和记忆使用对齐,将多模态学习转化为仅组织学推理的特权蒸馏,提升全切片图像推理性能。

详情
AI中文摘要

整合转录组学和组织病理学可以改善癌症风险建模,但在常规环境中RNA分析的有限可用性限制了其实用性。本文引入了通路专家混合(MoPE),这是一个知识蒸馏框架,将多模态学习重新定义为仅组织学推理的特权蒸馏。MoPE的动机来自RNA谱和全切片图像之间的部分可观测性:组织学可以捕获某些分子程序相关的形态学后果,但不能期望重建完整的转录组状态。MoPE编码RNA衍生的通路,并通过记忆使用对齐将分子监督转移到通路索引的病理专家。在各种公共基准测试和两个独立的乳腺癌队列中,与基线方法相比,MoPE持续改善了仅WSI推理性能。通路使用分析和人工审核的视觉检查提供了模型行为和候选形态学相关读数的有限检查。这些结果支持通路结构特权蒸馏作为在训练期间利用分子信息同时保持无RNA推理的有前途的途径。

英文摘要

Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

2606.02876 2026-06-03 cs.LG

RRISE: Robust Radius Inference via a Surrogate Estimator

RRISE: 通过代理估计器进行鲁棒半径推断

Jong-Ik Park, Shreyas Chaudhari, Carlee Joe-Wong, José M. F. Moura

AI总结 提出RRISE框架,使用代理模型替代蒙特卡洛采样进行随机平滑认证,通过一次性共形校准保证保守半径,在保持认证精度的同时大幅降低计算成本。

详情
AI中文摘要

随机平滑(RS)使用平滑分类器提供与架构无关的$\ell_2$分类鲁棒性认证,但其对每个输入的蒙特卡洛(MC)采样的依赖限制了其在实时系统中的应用。我们认为这种代价是结构性的而非根本性的,因此可以通过在部署流中共享信息来显著降低。我们引入RRISE,一个RS框架,将认证压缩为通过学习的代理进行的单次前向传播。RRISE通过软标签交叉熵损失针对预计算的MC类计数目标训练代理,并通过一次性共形校准步骤将代理预测转换为可证明保守的认证半径。得到的证书是可部署验证的:每当校准半径为正值时,代理的预测可证明与平滑分类器的预测一致,且平滑分类器在输入周围该半径的球内是常数。在图像分类基准测试中,RRISE在固定预算MC认证精度上相差0.84个百分点以内,同时将每次查询最多10^4次噪声基础模型评估替换为单次代理前向传播,在约10^5次部署查询后即可收回MC训练成本。在CIFAR-100和Tiny ImageNet上,先前唯一的离线代理方法失效,而RRISE实现了1.23到1.91倍的认证精度提升,确立了高效随机平滑作为重复部署场景中认证鲁棒性的实用路径。

英文摘要

Randomized smoothing (RS) uses a smoothed classifier to provide architecture-agnostic certificates of $\ell_2$ classification robustness, but its dependence on per-input Monte Carlo (MC) sampling undermines its use in real-time systems. We argue that this cost is structural rather than fundamental, such that it can be significantly reduced by sharing information across the deployment stream. We introduce RRISE, an RS framework that compresses certification into a single forward pass through a learned surrogate. RRISE trains the surrogate against precomputed MC class-count targets via a soft-label cross-entropy loss and converts surrogate predictions into provably conservative certified radii through a one-time conformal calibration step. The resulting certificate is deployment-verifiable: whenever the calibrated radius is positive, the surrogate's prediction provably matches the smoothed classifier's and the smoothed classifier is constant on a ball of that radius around the input. Across image classification benchmarks, RRISE matches fixed-budget MC certified accuracy within $0.84$ percentage points while replacing up to $10^4$ noisy base-model evaluations per query with a single surrogate forward pass, recouping MC training cost after $\approx 10^5$ deployment queries. On CIFAR-100 and Tiny ImageNet, where the only prior offline-surrogate method collapses, RRISE achieves $1.23$ to $1.91\times$ higher certified accuracy, establishing efficient randomized smoothing as a practical path to certified robustness in repeated-deployment settings.

2606.02875 2026-06-03 cs.AI

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

交接债务:当编码代理接管被中断任务时的重新发现成本

Dipesh KC, Anjila Budathoki

AI总结 本文通过引入“交接债务”概念,研究编码代理在任务中断后从部分状态恢复时的重新发现成本,并提出一种接管协议来量化不同交接视图对后继代理效率的影响。

详情
AI中文摘要

编码代理基准测试评估单个不间断代理能否解决仓库问题。实际软件工作更为复杂:任务会被中断、重新分配、审查,并从另一个代理或工程师留下的部分状态恢复。我们通过“交接债务”研究这一缺失维度:即前任工作不透明或不完整时施加的重新发现成本。我们的接管协议在确定性交接点中断编码代理,冻结仓库,并在四种交接视图下评估后继代理:仅仓库状态、原始轨迹、摘要笔记和结构化笔记。在75个源任务中,该协议为每个后继模型生成181个交接点任务和724次接管运行。在三个后继模型中,相对于仅仓库接管,带有上下文的交接将中位代理事件减少20-59%,累积提示令牌减少42-63%。解决率的影响较小且依赖于模型,但效率提升是一致的。这些发现表明,编码代理评估不仅应报告任务是否解决,还应报告另一个代理恢复该工作的成本。

英文摘要

Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

2606.02872 2026-06-03 eess.SY cs.MA cs.RO cs.SY

Terminal Time and Angle-Constrained Nonlinear Intercept Guidance

终端时间和角度约束的非线性拦截制导

Shivam Bajpai, Abhinav Sinha

AI总结 针对单一控制输入下的欠驱动非线性拦截问题,提出基于分层滑模的制导律,同时控制终端时间和角度,并扩展至常速目标拦截。

详情
AI中文摘要

本文考虑使用横向加速度作为唯一控制输入,同时控制拦截器的撞击时间和撞击角度的问题。由于单一控制输入,非线性交战运动学本质上是欠驱动的,这使得制导律综合变得复杂。为了克服这一挑战,开发了一种基于分层滑模的制导律,以同时调节两个终端约束。所提出的架构包括一个两层滑模流形。第一层由分别对应撞击时间和撞击角度误差动力学的两个子滑模面组成,而第二层引入了一个组合两个单独子滑模面的复合滑模流形。然后,设计了一种变增益自适应制导律,以确保对静止目标的带时间和角度约束的拦截,并进一步扩展至拦截常速目标。针对各种交战场景进行了仿真,以证明所提出方法的有效性。

英文摘要

This paper considers the problem of simultaneously controlling an interceptor's impact time and impact angle using its lateral acceleration as the sole control input. With a single control input, the nonlinear engagement kinematics is inherently underactuated, which complicates guidance law synthesis. To overcome this challenge, a hierarchical sliding mode-based guidance law is developed to concurrently regulate the two terminal constraints. The proposed architecture consists of a two-layer sliding manifold. The first layer comprises two sub-sliding surfaces corresponding to the impact time and impact angle error dynamics, respectively, while the second layer introduces a composite sliding manifold that combines the two individual sub-surfaces. Then, a variable-gain adaptive guidance law is designed to ensure time and angle-constrained interception against a stationary target, which is further extended to intercept a constant velocity target. Simulations are conducted for various engagement scenarios to attest to the efficacy of the proposed approach.

2606.02871 2026-06-03 cs.CL cs.AI

Adaptive Latent Agentic Reasoning

自适应潜在智能推理

Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

AI总结 提出双模式框架ALAR,在常规决策步骤使用紧凑潜在推理,仅在需要深入思考时切换至显式思维链,在保持或提升任务准确率的同时显著减少生成令牌数。

详情
AI中文摘要

大型推理模型通过生成扩展的思维链推理来提升性能,但当应用于LLM智能体时,这种行为变得低效。当前的LLM智能体通常在每一步决策中生成冗长的文本推理,并在各轮次中几乎均匀地分配推理努力,导致多轮智能体轨迹中的严重低效。我们提出自适应潜在智能推理(ALAR),一种双模式框架,在常规轮次中使用紧凑的潜在推理,并在需要更深思熟虑时选择性地升级为显式思维链。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理,并进一步优化以在潜在推理足以完成任务时使用它,保留显式CoT用于更困难的决策。在智能体搜索和工具使用基准上的实验表明,ALAR在保持相当或更好任务准确率的同时,显著减少了生成的令牌数,在搜索中最多减少43.6%,在工具使用中最多减少84.6%。这些结果表明,ALAR通过减少不必要的文本推理,同时保留显式思考用于更困难的决策步骤,改善了LLM智能体的准确率-效率权衡。

英文摘要

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

2606.02867 2026-06-03 cs.MA cs.AI q-bio.PE

The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models

Epi-LLM框架:通过流行病学基于智能体的模型探究LLM行为先验

Petra Ferenz, Ava Keeling, Tobias O'Keefe, Lorenzo Stigliano, Francesco Di Lauro, Andres Colubri, Jasmina Panovska-Griffiths

AI总结 提出Epi-LLM框架,整合基于智能体的建模、真实流行病游戏和大语言模型,模拟疫情中智能体行为,发现LLM智能体减少峰值感染,感知健康严重性是隔离行为最强预测因子,且LLM架构影响疫情动态。

详情
Comments
Submitted to American Journal of Epidemiology
AI中文摘要

流行病期间的人类行为会影响传染病动态,但量化这一点仍然极具挑战性。本文介绍了Epi-LLM框架:一种新颖的集成方法,结合了基于智能体的建模、真实流行病游戏和大语言模型(LLM),其中合成智能体社会在疫情接触网络上进行推理并动态适应。将合成智能体行为与无干预的SEIR基线和来自AUIB流行病游戏研究的人类参与者数据进行比较,我们发现四种不同架构的LLM智能体减少了峰值活跃感染,在15天模拟的第6天,隔离合规率达到58-65%。二项广义线性模型显示,感知健康严重性是隔离行为的最强预测因子(β = 0.33, p = 0.002),伪R²为0.055,与人类试验中观察到的0.072相当。LLM架构是疫情动态的关键决定因素:低方差架构为测试行为规则提供了更高的内部效度,而高方差模型可能更好地代表现实世界中的决策。仅凭地理标签无法诱导文化差异化的行为;需要明确的态参数化。这项原理验证工作为将Epi-LLM框架部署为可扩展、无风险的模拟环境用于大流行准备研究奠定了基础。

英文摘要

Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($β= 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.

2606.02866 2026-06-03 cs.AI cs.CL cs.MA

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

当帮助有害时以及如何修复:多智能体辩论用于数据清洗

Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

AI总结 研究多智能体辩论在数据清洗中的效果,发现其会降低生成性能但提升错误检测,通过推导辩论收益条件并采用对抗性分离的辩论配置,首次在生成任务上显著超越单智能体。

详情
Comments
27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics
AI中文摘要

多智能体辩论何时有助于数据清洗,何时有害?在三个基准、四个模型家族和超过6000个任务-条件对中,我们发现辩论的效果会反转:通过批评引发的混淆(CIC),即生成器不加批判地接受幻觉性的批评反馈,辩论在所有四个模型上降低了生成性能(-1.6至-15.5个百分点),但提升了错误检测(F1提高27.4个百分点,d=1.0)。我们推导出一个辩论收益条件:当挽救错误输出的概率(由可修复性加权的批评者验证几率)超过破坏正确输出的概率时,辩论有帮助。一个析因实验证明对抗性分离至关重要:使用相同工具的自我验证失败,而一个独立的批评者,结合代码执行基础和证据门控生成,产生了第一个在生成任务上显著超过单智能体的辩论配置(+5.3个百分点,p<0.05)。该条件正确预测了所有九种任务类型,并在七个领域的19个已发表比较中实现了零假阳性泛化。

英文摘要

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

2606.02863 2026-06-03 cs.AI

Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

不要赌博,GAMBLe:AI驱动研究系统的分析框架

Marquita Ellis, Paul Castro

AI总结 提出GAMBLe框架,通过四个参数(生成器G、评估器A、发现机制M、预算B)和一个有效景观L_eff = A ∘ G分解AI驱动研究系统行为,实验表明组件选择可显著提升性能和搜索效率。

详情
Comments
Preprint. 21 pages (10 main, 11 appendix). 6 figures (2 in main, 4 in appendix)
AI中文摘要

AI驱动研究系统(ADRS)——将LLM与自动评估相结合以发现算法、证明和设计的系统——正在跨领域优化和采用,但分析它们的工具尚未跟上。ADRS性能取决于组件交互,这些交互难以理解、探索成本高,并且(如我们所示)标准收敛保证无法很好地捕捉。这些保证依赖于结构假设,而这些假设在我们形式化的ADRS过程中不成立。我们引入GAMBLe,一个将ADRS行为分解为四个参数(生成器$G$、评估器$\mathcal{A}$、发现机制$\mathcal{M}$、预算$B$)和一个组合对象——有效景观$L_{ ext{eff}} = \mathcal{A} \circ G$的框架,该框架揭示了不同的生成器-评估器对在每个问题上诱导出结构不同的优化景观。我们在760多次重复运行(>46,000次迭代)上应用该框架,涵盖从单个LLM到动态自适应集成等生成器、从贪婪选择到协同进化元搜索等机制,以及三个NP难问题(其评估器范围从连续评分到悬崖函数)。实验表明,生成器或机制没有完全排序:前沿模型可能不如开源替代品,最简单的机制有时优于最先进的元搜索。结果显示,即使在有限预算下(每次运行60次迭代),正确的组件选择可以将性能提高13-67%,搜索效率提高6-39倍。

英文摘要

AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\mathcal{A}$, discovery mechanism $\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\text{eff}} = \mathcal{A} \circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.