arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04262 2026-06-04 cs.CL cs.AI

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

我可以再服一剂吗?评估LLM在OTC剂量问答中时间不确定性下的决策能力

Maroof Kousar, Yibo Hu

AI总结 提出DOSEBENCH基准测试,评估大语言模型在非处方药剂量问答中处理时间推理、约束遵循和不确定性的能力。

详情
Comments
16 pages, 7 figures
AI中文摘要

大型语言模型(LLM)越来越多地被用于日常健康问题,包括用户是否可以安全地再服用一剂非处方(OTC)药物。然而,这一常见的安全相关场景在现有的医学问答评估中仍未得到充分探索,其中正确答案需要跟踪剂量时间、计算滚动24小时摄入量、遵循产品标签约束以及处理不完整的用药史。我们引入了DOSEBENCH,这是一个包含81个精心策划的OTC剂量场景的聚焦基准测试,专注于成人对乙酰氨基酚和布洛芬的使用,并带有手动标注的金标准参考。我们使用决策正确性、一致性、解释可验证性、失败类型和置信度相关信号等指标,在多次运行中评估了四个LLM,共获得1620个模型响应。我们的结果表明,模型在滚动窗口推理和模糊敏感场景中经常遇到困难,且稳定或看似自信的响应仍可能违反剂量约束。这些发现表明,OTC剂量问答为评估医学问答中的时间推理、约束遵循和安全相关不确定性处理提供了一个狭窄但实用的测试平台。

英文摘要

Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

2606.04261 2026-06-04 cs.AI cs.CL cs.CV cs.ET cs.LG

Can Generalist Agents Automate Data Curation?

通用智能体能否自动化数据筛选?

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

AI总结 本文提出Curation-Bench基准,通过通用编码智能体自动化数据筛选循环,实验表明现成智能体可达到强基线,但存在执行-研究差距,而结构化方法引导的智能体能在十分之一数据预算下自主组合出优于强基线的数据选择策略。

详情
Comments
Preprint
AI中文摘要

训练数据的筛选是现代AI开发中最重要但劳动密集的部分之一:实践者根据嘈杂的基准反馈迭代地提出、实施、评估和修订数据策略。我们探究通用编码智能体能否自动化这一数据筛选循环。我们引入了*Curation-Bench*,一个以智能体为中心的基准,它固定模型、训练配方和评估套件,同时赋予智能体命令行权限以检查数据、实施策略、提交到固定的训练/评估流水线并进行修订。在视觉-语言指令微调实例中,现成智能体在十次迭代内达到了已发表的强数据选择基线。然而,轨迹分析揭示了持续的*执行-研究差距*:即使提供了策略指南和论文参考,智能体主要调整局部策略变体,而非探索新的策略家族。要求每次迭代引用、实例化和改编先前方法的框架将智能体转向方法引导的探索。这种框架化的智能体自主组合——无需人工设计输入——一种数据选择策略,在十分之一的数据预算下优于已发表的强基线。总体而言,当前智能体可以运行筛选循环,但可靠的数据研究需要框架化的方法适应,而非仅靠开放式提示。代码和基准已开源。

英文摘要

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2606.04251 2026-06-04 cs.CV

SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections

SBP-Net: 基于滑动盒投影的薄结构重建学习

Ofir Gilad, Andrei Sharf

AI总结 针对薄3D结构稀疏、尺度变化和复杂几何带来的重建挑战,提出一种基于局部深度投影的SBP-Net方法,通过滑动盒生成局部正交深度投影并用神经网络重建缺失薄结构,再融合回3D模型,在肺动脉和工业管道重建中优于现有方法。

详情
Comments
Accepted to IEEE ICIP 2026, 6 pages, 4 figures
AI中文摘要

重建薄3D结构因其稀疏性、尺度变化和复杂几何而具有挑战性。这类结构出现在广泛领域,包括血管系统的医学成像和工业管道系统。虽然最近的神经方法在密集表面上表现良好,但常常无法恢复精细的薄几何形状。我们提出了一种基于局部深度投影的重建方法,该方法为薄结构提供了高效且信息丰富的2D表示。具体来说,我们使用滑动盒遍历3D模型以生成局部正交深度投影,然后由神经网络处理以在2D中重建缺失的薄结构。随后,局部重建结果被融合回3D模型,以产生连贯且详细的形状。在CT体积的肺动脉重建以及合成和真实扫描的工业管道恢复上的实验表明,与现有方法相比,该方法更好地保留了精细结构细节。

英文摘要

Reconstructing thin 3D structures is challenging due to their sparsity, scale variation, and complex geometry. Such structures arise in a wide range of domains, including medical imaging of vascular systems and industrial pipe systems. While recent neural methods perform well on dense surfaces, they often fail to recover fine thin geometries. We propose a reconstruction approach based on local depth projections, which provide an efficient and informative 2D representation of thin structures. Specifically, we traverse the 3D model with a sliding box to generate local orthographic depth projections, which are processed by a neural network to reconstruct missing thin structures in 2D. The local reconstructions are subsequently fused back into the 3D model to produce a coherent and detailed shape. Experiments on pulmonary artery reconstruction from CT volumes and industrial pipeline recovery from synthetic and real scans demonstrate improved preservation of fine structural details over existing methods.

2606.04249 2026-06-04 cs.CV eess.IV

Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

基于潜空间运动跟踪的单次测量前瞻性动态3D MRI重建

Lixuan Chen, Zhongnan Liu, Jesse Hamilton, James M. Balter, Jeong Joon Park, Liyue Shen

AI总结 提出PDMR框架,通过离线学习运动场的低维潜流形并采用三平面表示实现高效编码,从单次测量中实现高保真、时间一致的前瞻性动态3D MRI重建。

详情
AI中文摘要

前瞻性重建在许多临床应用中至关重要,例如MRI引导的放射治疗,这需要从当前获取的测量中实现精确的图像重建和快速运动估计。然而,由于超稀疏采样和严格的延迟要求,前瞻性重建仍然具有挑战性。在这项工作中,我们提出了PDMR,一种具有潜空间运动跟踪的前瞻性动态3D MRI重建框架。我们的核心思想是离线学习一个高效且可泛化的运动场潜流形,从而实现快速在线自适应以进行前瞻性重建。具体来说,我们将变形矢量场(DVF)参数化在低维流形上,有效减少了快速在线自适应的搜索空间,并采用三平面表示实现几何感知和内存高效的3D运动编码。在XCAT数字体模和内部腹部MRI数据集上的实验表明,PDMR在多个前瞻性场景(立即和2分钟后)中实现了高保真和时间一致的重建,优于最先进的回顾性和在线方法。我们的结果为临床实践中实现超快速、运动感知的前瞻性MRI重建提供了一条有前景的途径。

英文摘要

Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

2606.04248 2026-06-04 cs.RO

RSC: Decentralized Rigid Formation Flocking for Large-Scale Swarms via Hybrid Predictive Control and Online Reconfiguration

RSC:通过混合预测控制与在线重配置实现大规模集群的分散式刚性编队集群

Ganyu Zou, Linhan Wang, Chen Dai, Siji Chen, Chang-Tien Lu

AI总结 提出一种分散式控制框架RSC,结合有限时域轨迹预测与反应式人工势场安全控制器,并引入在线领航-跟随重配置机制,在25架无人机杂乱环境中实现83%的编队保持、避障与目标跟踪成功率。

详情
Comments
8 pages, 4 figures, two-column format
AI中文摘要

分散式刚性编队集群要求自主智能体集群在移动过程中仅依靠局部感知和通信来维持预定的几何构型。然而,现有的分散式控制方法在杂乱环境中难以保持严格的智能体间距离约束,常常遭遇局部极小死锁、高频控制振荡或避障时灵活性有限等问题,导致成功率低。为解决这些限制,我们提出了刚性集群控制(RSC),一种用于大规模刚性编队集群的分散式控制框架。为了通过鲁棒的长期规划逃离局部极小同时确保短期安全,RSC在混合架构中集成了有限时域轨迹预测与反应式人工势场(APF)安全控制器。此外,为了在穿越障碍后加速编队重组而不中断任务执行,RSC引入了一种基于稳定角色交换的在线领航-跟随重配置机制。在25架无人机的挑战性杂乱环境中的广泛评估表明,RSC可靠地统一了刚性编队保持、避障和目标跟踪。在严格的成功标准——无碰撞运行且最大相对边长度误差低于10%下,RSC实现了83%的成功率,显著优于成功率低于5%的现有启发式和基于学习的基线方法。

英文摘要

Decentralized rigid formation flocking requires a swarm of autonomous agents to maintain a predetermined geometric configuration while moving, relying solely on local sensing and communication. However, existing decentralized control methods struggle to maintain strict inter-agent distance constraints in cluttered environments, often suffering from local minima deadlocks, high frequency control oscillations, or limited flexibility during obstacle navigation, resulting in low success rate. To address these limitations, we propose Rigid Swarm Control (RSC), a decentralized control framework for large-scale rigid formation flocking. To escape local minima via robust long-term planning while ensuring short-term safety, RSC integrates finite-horizon trajectory predictions with a reactive artificial potential field (APF) safety controller within a hybrid architecture. Furthermore, to accelerate formation reassembly after obstacle traversal without interrupting task execution, RSC introduces an online leader-follower reconfiguration mechanism based on stable role exchange. Extensive evaluations in challenging cluttered environments with 25 UAVs demonstrate that RSC reliably unifies rigid formation maintenance, obstacle avoidance, and target tracking. Under strict success criteria - collision-free operation with a maximum relative edge-length error below 10%, RSC achieves an 83% success rate, significantly outperforming existing heuristic and learning-based baselines that fall below 5%.

2606.04246 2026-06-04 cs.AI cs.AR cs.CL

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL:基于逐步过程奖励引导的LLM微调以增强RTL综合

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

AI总结 提出StepPRM-RTL框架,结合逐步轨迹建模、过程奖励模型和检索增强微调,通过密集反馈和蒙特卡洛树搜索探索推理路径,提升LLM生成RTL代码的功能正确性和推理保真度,在基准数据集上相比先前方法提升超10%。

详情
Comments
6 pages, 2 figures, DAC'2026
AI中文摘要

由于Verilog和VHDL中的长程推理、多步依赖和严格正确性约束,数字硬件设计的RTL代码自动生成仍然具有挑战性。我们提出StepPRM-RTL,一种新颖的框架,结合逐步轨迹建模、过程奖励模型(PRM)和检索增强微调(RAFT),以增强基于LLM的RTL代码生成的功能正确性和推理保真度。StepPRM-RTL从规范解构建逐步推理轨迹,其中每一步包含一个理由和增量代码修改。过程奖励模型(PRM)评估中间步骤,提供密集反馈,指导RAFT微调期间的强化式更新。蒙特卡洛树搜索(MCTS)探索替代推理路径,用高质量轨迹丰富训练数据集。这种逐步和结果感知奖励的集成使模型能够学习如何以及为何构建正确的RTL,从而改善超出标准监督或基于结果训练的长程推理。在基准Verilog和VHDL数据集上的实验评估表明,StepPRM-RTL在功能正确性和推理保真度指标上优于先前最佳方法超过10%。消融研究证实,PRM引导奖励和逐步轨迹探索的结合是其性能的关键。StepPRM-RTL跨RTL语言泛化,并为高保真、可解释的代码生成提供了可扩展框架,为LLM辅助硬件设计自动化建立了新标准。

英文摘要

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

AI总结 提出VAMPS基准,通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现,发现直接解析求解优于工具辅助视觉求解。

详情
AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强,但当它们必须通过工具外部化问题然后基于工具输出进行推理时,尤其是在依赖视觉辅助的情况下,其性能往往会下降。这一差距尤为重要,因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了VAMPS(视觉辅助数学问题求解),一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对,这些题目来自伊朗大学入学考试的代数和微积分问题,并通过人工审核的LLM生成的合成变体进行了扩展,所有题目都经过精心挑选,使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断,它超越了以往主要评估在固定视觉输入上进行推理的多模态基准,通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言,我们发现,在一组多样化的模型中,直接解析求解出人意料地优于工具辅助的视觉求解,即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2606.04240 2026-06-04 cs.CV cs.AI cs.CL

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛(赛道1)概述

Jingbiao Mei

AI总结 本文介绍了EReL@MIR 2025多模态文档检索挑战赛(赛道1)的设计、数据集、评估协议、最终排名及前三名获胜系统的分析,所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

详情
Comments
MDR Challenge Report at WWW2025
AI中文摘要

对于视觉丰富的文档(即文本与图形、表格和图表交织的页面)的检索,对于多模态检索增强生成至关重要,然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会(与2025年万维网会议同期举办)中MIR挑战赛的赛道1,要求参与者构建一个\emph{单一}检索系统,处理两种互补的场景:基于文本查询在长文档内进行封闭集文档页面检索(MMDocIR),以及基于图像或图像加文本查询进行开放域维基百科风格段落检索(M2KR)。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议;报告了最终排名;并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器,而非CLIP风格的编码器,主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器,还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2606.04238 2026-06-04 cs.LG cs.AI

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Recover-LoRA 用于激进量化:通过低秩适配与合成数据知识蒸馏恢复2比特语言模型的精度

Devleena Das, Rajeev Patwari, Elliott Delaye, Ashish Sirasao

AI总结 针对2比特激进量化导致的大语言模型精度严重下降问题,提出Recover-LoRA方法,结合选择性混合精度策略(仅MLP的gate和up层量化为2比特)和基于合成数据蒸馏的低秩适配训练,在Qwen3-4B上以1万合成样本在12个基准中恢复9个基准80-95%的精度。

详情
AI中文摘要

将权重激进量化至2比特精度可大幅提升大语言模型推理的吞吐量和内存效率,但通常会导致严重的精度下降。这些增益对于内存容量和带宽为主要限制的边缘和设备端部署尤为重要。在本工作中,我们将Recover-LoRA——一种最初为通用模型权重损坏设计的轻量级、无需数据的精度恢复方法——扩展到超低比特量化场景。我们提出了一种选择性混合精度策略,其中仅MLP的gate和up投影层被量化为2比特(W2),而所有其他线性层保持更高精度,从而形成混合精度的GateUp配置。通过三个模型系列(4B-20B)和两个硬件平台的屋顶线分析,我们证明W4/W2-GateUp部署(4比特基础加2比特gate/up)相比均匀W4可实现7.5-23.3%的TPS提升(取决于模型和上下文长度),同时将量化误差限制在可预测的层子集内。然后,我们应用Recover-LoRA——在量化层上通过合成数据的logit蒸馏训练低秩适配器——来恢复因gate和up层的2比特量化而损失的精度。在Qwen3-4B的案例研究中,Recover-LoRA仅使用1万合成训练样本且无需标注数据,就在12个基准中的9个上实现了80-95%的精度恢复。我们进一步证明,对于基于蒸馏的恢复,合成数据的表现与精心整理的标注数据相当,并且恢复结果可泛化到分布外评估任务。我们的结果表明,Recover-LoRA是一种实用的后量化精度恢复工具,适用于部署场景中的激进权重压缩。

英文摘要

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.

2606.04236 2026-06-04 cs.CL cs.AI cs.LG

Supportive Token Revealing for Fast Diffusion Language Model Decoding

支持性标记揭示:快速扩散语言模型解码

Giries Abu Ayoub, Mario Barbara, Lluís Pastor-Pérez, Tanja Bien, Aneesh Barthakur, Alaa Maalouf, Loay Mualem

AI总结 提出AXON模块,通过选择注意力、不确定性和置信度信号中的锚点标记来改善扩散语言模型并行解码的质量-延迟权衡。

详情
AI中文摘要

离散扩散语言模型可以通过并行更新多个掩码位置来高效生成文本,但这种并行性引入了质量-延迟权衡。激进的解码可能过早提交相互依赖的标记,而保守的解码则需要大量去噪步骤。现有方法通过使用置信度或依赖性标准决定哪些标记可以安全揭示来解决这一矛盾。然而,避免不安全提交并不一定使剩余的掩码序列易于解码,因为不确定的标记可能依赖于掩码标记,从而成为去噪步骤的瓶颈。我们提出AXON,一个无需训练的模块,可添加到现有扩散语言模型的并行解码策略之上。AXON不替换基础解码器,而是监控剩余不确定的掩码标记,并仅当它们当前状态表明需要额外上下文时才进行干预。然后它将标准从揭示哪些标记最安全转变为哪些自信揭示最能支持后续去噪。AXON使用注意力、不确定性和置信度信号选择锚点,即不确定位置关注的自信掩码标记。在多个扩散语言模型的推理和代码生成基准上的实验表明,AXON改善了现有并行解码器的质量-延迟权衡,通常减少函数评估次数,同时保持或提高准确性。

英文摘要

Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.

2606.04233 2026-06-04 cs.RO

What Are We Actually Benchmarking in Robot Manipulation?

我们究竟在机器人操作中基准测试什么?

Tianchong Jiang, Xiangshan Tan, Samuel Wheeler, Luzhe Sun, Tewodros W. Ayalew, Matthew Walter

AI总结 本文通过识别基准测试的四种失效模式(捷径可解性、缺乏统计显著性、渐进过拟合和数据源依赖性),并提出相应诊断方法,对LIBERO、CALVIN、SimplerEnv、RoboCasa和RoboTwin 2.0进行审计,发现多数基准测试存在缺陷,并发布了诊断工具。

详情
Comments
31 pages, 6 figures
AI中文摘要

机器人基准测试分数衡量的是在固定评估设置下的成功率,但通常被当作通用操作能力的证据。我们识别出四种失效模式,每种模式都会削弱或否定基准测试作为该能力有效代理的作用:捷径可解性、缺乏统计显著性、渐进过拟合和数据源依赖性。我们为每种失效模式提出一种诊断方法。我们使用这些诊断方法审计了LIBERO、CALVIN、SimplerEnv、RoboCasa和RoboTwin 2.0。LIBERO和CALVIN未通过多项诊断。RoboCasa和RoboTwin 2.0未通过较少,尽管它们在近期进展声明中出现的频率远低于前者。在LIBERO上,一个没有语言编码器的0.09B探针得分达到或接近报告的最优结果,且大多数报告的性能提升无法证明具有统计显著性。在CALVIN上,在训练范围内随机化块的位置会降低所有测试策略的性能。我们发布了四种诊断方法及其参考实现,供作者和审稿人在将基准测试分数视为进展证据之前使用。代码和工件可在https://ripl.github.io/manipulation_benchmark_audit/获取。

英文摘要

A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. We identify four failure modes, each of which weakens or invalidates a benchmark's role as a valid proxy for that capability: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence. We propose one diagnostic per failure mode. We audit LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0 under these diagnostics. LIBERO and CALVIN fail multiple diagnostics. RoboCasa and RoboTwin 2.0 fail fewer, despite appearing far less often in recent progress claims. On LIBERO, a 0.09B probe with no language encoder scores at or near reported SOTA, and most reported gains are not provably statistically significant. On CALVIN, randomizing block poses within the training range drops performance for every tested policy. We release the four diagnostics with reference implementations for authors and reviewers to apply before treating a benchmark score as evidence of progress. Code and artifacts are available at https://ripl.github.io/manipulation_benchmark_audit/.

2606.04231 2026-06-04 cs.CL cs.AI

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

MM-BizRAG:面向通用企业问答的多模态检索增强生成再思考

Hanoz Bhathena, Parin Rajesh Jhaveri, Rohan Mittal, Prateek Singh, Aymen Kallala, Rachneet Kaur, Yiqiao Jin, Zhen Zeng, Adwait Ratnaparkhi, Denis Kochedykov

AI总结 提出MM-BizRAG框架,通过文档结构感知分割和布局感知解析,结合统一LLM驱动的工件转换与推理时多模态组装,无需微调即可提升企业文档问答性能,在异构企业数据集和两个公开基准上超越基线最多32个百分点。

详情
Comments
Accepted at ACL 2026 (Industry Track)
AI中文摘要

近期多模态检索增强生成(MM-RAG)的进展倾向于最小化解析,依赖页面级图像来生成检索器嵌入和答案生成。虽然高效,但这种趋势往往忽略了对复杂企业文档中丰富结构化信息的显式处理,而是依赖预训练嵌入或视觉语言模型隐式捕获这种结构。在本工作中,我们采取更直接的方法:MM-BizRAG通过文档结构感知分割主动提取和表示文档结构,该分割根据文档方向动态路由文档至特定方向的摄取管道,对垂直结构文档(如报告)应用显式布局感知解析,对水平结构文档(如幻灯片)应用整体页面级表示。统一的LLM驱动的工件转换管道通过基于占位符的位置对齐保留自然阅读顺序,而推理时的多模态组装将检索表示与生成上下文解耦,无需任何微调即可生成更丰富、更基于事实的答案。通过在大型异构企业数据集和两个公开基准(SlideVQA和FinRAGBench-V)上的实验,MM-BizRAG一致地超越最先进的以视觉为中心的基线最多32个百分点,在报告式布局上尤其强劲。此外,我们引入了FastRAGEval,一种单次调用的LLM评判指标,用于细粒度生成召回,将RAGChecker的成本减半,同时实现更强的人类对齐。

英文摘要

Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.

2606.04226 2026-06-04 cs.RO cs.AI

PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification

PerceptTwin:面向迭代LLM规划与验证的语义场景重建

Charlie Gauthier, Sacha Morin, Liam Paull

AI总结 提出PerceptTwin自动管道,从机器人感知的语义场景表示构建交互式仿真,结合LLM法官验证规划正确性与人类偏好,提升规划成功率约39%。

详情
Comments
Accepted at ICRA 2026 (Vienna); published on arxiv for archival purposes. See also https://percept-twin.github.io/
AI中文摘要

仿真环境对于机器人策略学习以及规划验证与确认都很有用。传统上,创建仿真的过程是繁重的。为机器人运行的每个单独环境创建定制的仿真环境是不可行的。在这项工作中,我们引入了PerceptTwin,这是一个全自动管道,直接从机器人感知栈产生的语义场景表示构建交互式仿真。PerceptTwin结合了开放词汇对象地图与3D资产生成、 afford预测和常识条件检查。这些交互式仿真可用于在机器人硬件上执行规划之前验证和完善规划。借鉴AI对齐文献,我们还引入了一个LLM法官,用于验证规划的正确性和与人类偏好的一致性。实验表明,PerceptTwin反馈允许LLM规划器完善规划、增强安全性并抵抗有害的黑盒提示攻击。在我们的任务套件中,PerceptTwin使GPT5、GPT5Mini和GPT5Nano规划器的规划成功率平均提高约39%。此外,对于因未满足技能前提条件而失败的规划,PerceptTwin还将人类规划验证平均提高高达18%。我们的结果证明了从机器人感知进行开放词汇场景仿真作为更安全、更可靠的机器人规划基础的潜力。

英文摘要

Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot's perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of approximately 39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.

2606.04223 2026-06-04 cs.AI

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

共识在策略上是不充分的:推理轨迹分歧作为知识表示信号

Michał Wawer, Jarosław A. Chudziak

AI总结 本文提出在价值负载任务中,分歧可能反映规范不确定性而非错误,通过将推理轨迹和决策抽象为符号分歧状态,构建知识表示层以支持可废止策略路由,连接亚符号LLM审议与符号知识表示。

详情
Comments
Accepted to LAMAS&SR workshop at FLoC 2026 (KR + ICPL + LICS + CP + FSCD)
AI中文摘要

多智能体系统通常通过投票、共识协议、辩论或容错聚合来减少分歧。我们认为,对于价值负载任务,这一目标是不充分的,因为分歧可能反映真正的规范不确定性而非智能体错误。基于先前关于人机协作审核中推理轨迹分歧的工作,我们提出一个知识表示层,其中推理轨迹和智能体决策被抽象为符号分歧状态。给定产生显式推理轨迹和二元决策的智能体,我们根据推理相似性和结论一致性区分四种状态:收敛一致、发散一致、收敛分歧和发散分歧。这些状态支持可废止的策略路由规则。我们在内容审核中实例化该框架,并论证分歧感知路由为多智能体策略推理中亚符号LLM审议与符号知识表示之间提供了桥梁。

英文摘要

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

2606.04222 2026-06-04 cs.RO

Towards Estimating Normal and Shear Interface Pressures in Prosthetic Sockets via Least Squares and Mechanics Modeling

通过最小二乘和力学建模估算假肢接受腔中的法向和剪切界面压力

Axel González Cornejo, Tianhao Yu, Chi Hwan Lee, Edgar Bolívar-Nieto

AI总结 针对假肢接受腔界面压力测量中剪切力缺失和传感器串扰问题,提出一种基于稀疏传感和最小二乘的准静态弹簧-质量接触模型,通过全局力/力矩和局部压力数据验证模型性能。

详情
AI中文摘要

假肢接受腔的适配仍然主要依靠手工和迭代,客观适配指标仍然有限。挑战之一在于缺乏残肢-接受腔界面的长期真实压力数据。传统压力传感器随时间漂移,且仅能捕捉接受腔内稀疏位置的法向压力,缺失了生物力学分析的关键分量:剪切力。尽管某些传感器可以同时报告法向和剪切界面应力,但由于测量串扰,这些分量往往难以解耦。一个潜在的解决途径是开发能够增强现有测量的模型。本文引入了一个测试平台,使用两种互补的验证信号评估稀疏压力传感下的模型性能:(i)通过人工残肢传递的全局力螺旋(即正交坐标系中的总力和力矩),以及(ii)由稀疏传感簇(每个簇由四个电容传感通道组成)测量的局部界面载荷(即每个仪器位置处右手正交坐标系中解耦的法向和剪切压力分量)。本文不呈现全场压力估计,而是聚焦于一个分析序列,量化候选力学模型在受控条件下解释全局和局部测量的能力。评估了一个准静态弹簧-质量接触模型,并通过两阶段凸最小二乘问题识别其参数。静态加载下的验证表明,估计恒定偏置项可以减少力螺旋通道的稳态偏移,并改善与局部测量的一致性。帕累托前沿敏感性分析进一步说明了当包含偏置项时,全局和局部目标之间的权衡如何变化。

英文摘要

Prosthetic socket fitting remains largely manual and iterative, and objective fit metrics are still limited. Part of the challenge is the lack of long-term real-life pressure data at the residual limb--socket interface. Traditional pressure sensors are prone to drift over time, and capture only normal pressures at sparse locations within the socket, missing a critical component for biomechanical analysis: shear. Although some sensors can report both normal and shear interface stresses, these components are often difficult to decouple because of measurement crosstalk. One potential path forward is to develop models that can augment available measurements. This work introduces a testbed to evaluate model performance under sparse pressure sensing using two complementary validation signals: (i) the global wrench (\ie, total forces and moments expressed in an orthonormal frame) transmitted through the socket, by an artificial residual-limb, and (ii) local interface loads (\ie, decoupled normal and shear pressure components in a right-hand-rule orthogonal frame that lives in each instrumented location) measured by sparse sensing clusters, each composed of four capacitance-sensing channels. Rather than presenting full-field pressure estimates, the focus is on an analysis sequence that quantifies how well candidate mechanical models explain both global and local measurements under controlled conditions. A quasi-static spring--mass contact model is evaluated, and its parameters are identified via a two-stage convex least-squares problem. Validation under static loading shows that estimating constant bias terms reduces steady offsets in the wrench channels and improves agreement with local measurements. A Pareto-front sensitivity analysis further illustrates how the trade-off between global and local objectives changes when bias terms are included.

2606.04221 2026-06-04 cs.SD cs.AR eess.AS

Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid

基于时域DNN的助听器嵌入式FPGA语音增强可行性研究

Feyisayo Olalere, Umut Altin, Kiki van der Heijden, Marcel van Gerven

AI总结 本文在AMD-Xilinx Kria KV260上部署轻量级SuDoRM-RF++模型,通过FP32和16位定点精度评估语音分离和降噪,发现数据移动是主要瓶颈,定点降噪加速器达到9.7ms首样本延迟,满足10ms临床阈值。

详情
Comments
13 pages
AI中文摘要

助听器对延迟和功耗有严格限制,当前基于DNN的语音增强系统在嵌入式硬件上难以满足这些要求。我们通过在AMD-Xilinx Kria KV260上部署轻量级SuDoRM-RF++架构进行语音分离和降噪,对每个任务评估了FP32和16位定点精度。在这些配置中,首样本延迟与片上参数缓存相关而非算术吞吐量,表明数据移动是主要瓶颈。精度降低使模型内存占用减半而不损害客观语音质量。定点降噪加速器达到9.7毫秒的首样本延迟,满足10毫秒的临床阈值,而语音分离达到16.0毫秒。这些测量结果为嵌入式DNN语音增强建立了具体的资源需求,并量化了与助听器部署之间的剩余差距。

英文摘要

Hearing aids impose strict latency and power constraints that current DNN-based speech enhancement systems struggle to meet on embedded hardware. We characterize this gap by deploying both speech separation and denoising using the lightweight SuDoRM-RF++ architecture on the AMD-Xilinx Kria KV260, evaluated at FP32 and 16-bit fixed-point precision for each task. Across these configurations, first-sample latency tracks with on-chip parameter caching rather than arithmetic throughput, identifying data movement as the primary bottleneck. Precision reduction halves the model memory footprint without compromising objective speech quality. The fixed-point denoising accelerator achieves a first-sample latency of 9.7~ms, meeting the 10~ms clinical threshold, while speech separation reaches 16.0~ms. These measurements establish concrete resource requirements for embedded DNN-based speech enhancement and quantify the remaining gap to hearing aid deployment.

2606.04210 2026-06-04 eess.AS cs.LG cs.SD

Representation Matters in Randomized Smoothing for Audio Classification

表示在音频分类的随机平滑中至关重要

Jong-Ik Park, Shreyas Chaudhari, José M. F. Moura, Carlee Joe-Wong

AI总结 研究随机平滑在音频分类中的表示问题,通过实验揭示预处理和表示选择对认证鲁棒性的影响,并提出报告规范。

详情
AI中文摘要

随机平滑(RS)在添加高斯噪声的向量空间中认证鲁棒性。在音频分类中,该空间通常不是唯一确定的,因为标准流程会对波形进行归一化、范围控制,并将其转换为log-mel或其他频谱特征。我们表明,除非认证对象和预处理策略明确,否则直接RS是欠定义的。在两个音频基准(关键词识别和环境声音分类)上,我们研究了波形、特征空间和后处理平滑。我们的诊断显示了为什么表示感知的报告是必要的:在相同的平滑水平$σ=0.0025$下,两个数据集共享相同的中位数原始半径$.007996$,但不同的波形能量产生不同的SNR等效尺度($83.98$ vs. $90.97$ dB);log-mel平滑在环境声音上给出更高的正半径认证准确率($68.42\%$ vs. $65.53\%$),认证了更多具有非零半径的样本,但基于特征而非波形;裁剪或峰值归一化将有效扰动范数改变约$230$--$351\times$。因此,我们建议音频RS研究选择并报告任务特定的认证对象和扰动模型,包括扰动位置、增益策略、原始半径以及任何噪声后的几何变化。

英文摘要

Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level $σ=0.0025$, the two datasets share the same median raw radius $.007996$, but different waveform energies yield different SNR-equivalent scales ($83.98$ vs. $90.97$ dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ($68.42\%$ vs. $65.53\%$), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly $230$--$351\times$. We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.

2606.04209 2026-06-04 cs.LG

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

反事实行为的几何视角:边界接近度与局部支持的交互作用

Ioanna Gemou, Matteo Gamba, Randall Balestriero, Ritambhara Singh

AI总结 本文通过几何视角研究反事实行为,发现决策边界接近度与局部数据支持的交互作用决定了反事实的可行性,且反事实行为是独立于预测性能的维度,可在不改变准确率的情况下被改变。

详情
AI中文摘要

反事实解释寻求对输入进行小的、语义上有意义的改变,以改变模型的预测,并广泛用于解释和审计机器学习系统。在现代视觉、语言和多模态系统中,预训练编码器将输入映射到表示空间,下游分类器头在这些空间内施加决策边界。因此,附近反事实的可行性和距离取决于边界相对于数据的位置。然而,具有相似预测性能的模型在是否能够实现此类改变以及表示必须移动多远方面可能存在显著差异。本文通过使用标准化局部搜索探针,在多个预训练编码器和线性分类器头上检验了这种变化。结果表明,尽管预测性能相似,但模型在反事实行为上存在显著差异。在固定表示下,仅改变分类器头就会改变反事实结果,而预测性能基本保持不变。这种变化由决策边界接近度和局部数据支持的交互作用解释,两者共同决定了预测变化是否可行且位于数据支持的区域内,并且还可以改进固定模型内的反事实搜索。总之,这些发现将反事实行为识别为超越预测性能的独立维度,并表明可以在不改变准确率的情况下改变它,这对模型选择、鲁棒性和反事实方法的可靠性具有启示意义。

英文摘要

Counterfactual explanations seek small, semantically meaningful changes to an input that alter a model's prediction, and are widely used to interpret and audit machine learning systems. In modern vision, language, and multimodal systems, pretrained encoders map inputs to representation spaces, and downstream classifier heads impose decision boundaries within those spaces. As a result, the feasibility and distance of nearby counterfactuals depend on boundary placement relative to the data. Yet models with similar predictive performance can differ substantially in whether such changes are achievable and how far representations must move. This work examines this variation using a standardized local search probe across several pretrained encoders and linear classifier heads. Results show that despite similar predictive performance, models differ substantially in their counterfactual behavior. Under fixed representations, varying only the classifier head alters counterfactual outcomes while leaving predictive performance largely unchanged. This variation is explained by the interaction of decision-boundary proximity and local data support, which jointly determine whether prediction changes are both feasible and lie in regions supported by the data, and can also improve counterfactual search within fixed models. Together, these findings identify counterfactual behavior as a distinct dimension beyond predictive performance and show that it can be altered without changing accuracy, with implications for model selection, robustness, and the reliability of counterfactual methods.

2606.04206 2026-06-04 cs.RO

DLO-Lab: Benchmarking Deformable Linear Object Manipulations with Differentiable Physics

DLO-Lab: 基于可微物理的可变形线性物体操作基准测试

Junyi Cao, Yian Wang, Ziyan Xiong, Chunru Lin, Zhehuan Chen, Chuang Gan

AI总结 针对机器人操作绳索、电缆等可变形线性物体(DLO)的挑战,提出一个可微模拟器,支持多种材料属性,并构建基准任务套件,结合专用DLO智能体,评估策略学习算法并验证仿真到现实的迁移。

详情
Comments
ICML 2026, the project page: https://dlo-lab-26.github.io/
AI中文摘要

我们解决了使机器人能够操作可变形线性物体(DLO),如绳索、电缆和橡皮筋的挑战。先前的工作主要集中于狭窄的、任务特定的问题,通常依赖于真实世界的演示或手工制作的启发式方法。然而,这些方法难以扩展到实践中遇到的各种材料和任务,并且收集足够多样化的真实世界数据通常是不切实际的。此外,现有的仿真环境对可泛化DLO操作所需的广泛材料行为支持有限。为了克服这些限制,我们引入了一个明确设计用于多功能DLO操作的可微模拟器。我们的模拟器模拟了广泛的材料属性——包括(不可)延伸性、弹性、弯曲塑性以及与其他物体的复杂交互——为学习和评估操作技能提供了坚实的基础。基于此模拟器,我们提出了一个代表性任务的基准套件,突出了DLO操作的独特挑战。这些任务的成功执行通常受到DLO固有的拓扑复杂性和抓取敏感性的阻碍。因此,我们引入了一个专门的DLO智能体,通过提出战略性抓取点并将长视界任务分解以最大化控制权,明确管理这些挑战。最后,我们使用我们的框架评估了各种策略学习算法,并进行了仿真到现实的迁移实验,展示了我们平台在推进DLO操作方面的潜力。

英文摘要

We address the challenge of enabling robots to manipulate deformable linear objects (DLOs), such as ropes, cables, and rubber bands. Prior work has primarily focused on narrow, task-specific problems, often relying on real-world demonstrations or handcrafted heuristics. Such approaches, however, struggle to scale to the wide variety of materials and tasks encountered in practice, and collecting sufficiently diverse real-world data is often impractical. Additionally, existing simulation environments offer limited support for the broad spectrum of material behaviors necessary for generalizable DLO manipulation. To overcome these limitations, we introduce a differentiable simulator explicitly designed for versatile DLO manipulation. Our simulator models a wide range of material properties-including (in)extensibility, elasticity, bending plasticity, and complex interactions with other objects-providing a robust foundation for learning and evaluating manipulation skills. Building on this simulator, we propose a benchmark suite of representative tasks that highlight the unique challenges of DLO manipulation. The successful execution of these tasks is often hindered by the topological complexity and grasp sensitivity inherent to DLOs. Therefore, we introduce a specialized DLO agent that explicitly manages these challenges by proposing strategic grasping points and decomposing long-horizon tasks to maximize control authority. Finally, we evaluate various policy-learning algorithms using our framework, alongside sim-to-real transfer experiments, demonstrating our platform's potential to advance DLO manipulation.

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo:一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

AI总结 提出DetectZoo,一个首个统一的多模态AI生成内容检测工具包,通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集,实现公平可重复的基准测试。

详情
AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限,推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件,要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标,这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距,我们引入了DetectZoo,这是首个可扩展的工具包,旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程,为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下,我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器,以及一个标准化的评估流程,通过通用接口报告多个指标。每个检测器都是自包含的,但可通过同一接口访问,自动缓存预训练权重,并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛,使研究人员能够识别跨领域的性能差距,并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取,且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2606.04202 2026-06-04 cs.AI

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

SMAC-Talk: 面向大型语言模型的星际争霸多智能体挑战的自然语言扩展

Joel Sol, Homayoun Najjaran

AI总结 提出SMAC-Talk环境,通过自然语言通信通道评估LLM智能体在合作多智能体场景中的协调与信任,并构建含欺骗性通信者的评估场景。

详情
Comments
8 pages, 1 figure
AI中文摘要

随着LLM的广泛部署,它们越来越需要与其他AI智能体协同工作而非孤立运行。在这些场景中,有效协调要求智能体进行通信、共享信息并在不确定性下做出决策。我们提出了SMAC-Talk,这是星际争霸多智能体挑战的自然语言扩展,用于评估基于LLM的智能体在合作多智能体环境中的表现。该环境具有分散控制、部分可观测性和长周期决策等关键特征。SMAC-Talk包含一个自然语言通信通道,用于探测智能体的协调与信任。我们利用该通信通道构建了不同的评估场景,包括嵌入欺骗性通信者的设置,该通信者试图仅通过通信来干扰和欺骗盟友。我们提供了三个基准测试智能体,使用Qwen3.5系列的4个模型,并研究了推理结构、记忆和模型规模如何影响智能体间的协调。我们将SMAC-Talk作为开放基准发布,以支持研究社区在合作多智能体场景中开发和评估LLM智能体。

英文摘要

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

2606.04199 2026-06-04 cs.CL cs.LG

Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features

使用可解释语言特征检测AI生成假新闻的跨提示泛化

Aya Vera-Jimenez, Samuel Jaeger, Calvin Ibenye, Dhrubajyoti Ghosh

AI总结 研究通过提取词汇多样性、可读性和情感特征,在跨提示框架下使用随机森林分类器检测AI生成假新闻,发现模型在不同提示下均表现稳定(AUC 0.988-1.000),表明这些特征可泛化。

详情
AI中文摘要

大型语言模型的日益普及引发了对AI生成假新闻传播的担忧,尤其是在不同的提示策略下。大多数现有的检测模型是在单一生成设置下训练和评估的,其跨未见提示的泛化能力尚不清楚。在本研究中,我们使用三个在不同提示下生成的AI文章数据集以及真实新闻文章,研究了假新闻检测中的跨提示泛化。我们提取了捕捉词汇多样性、可读性和情感特征的可解释语言特征,并在跨提示框架下评估了随机森林分类器,其中在一个提示上训练的模型在另一个提示上进行测试。在所有六个训练-测试组合中,性能始终保持较高,AUC值在0.988到1.000之间。特征分布分析显示,与整体数据集相比,AI生成文本表现出更高的词汇多样性、更低的可读性和显著较低的情感强度,且不同提示间存在差异。尽管存在这些分布变化,分类器仍保持强劲性能,表明这些特征捕捉了AI生成文本的稳定属性,这些属性可跨提示策略泛化。这些发现表明,基于特征的方法可以在提示变化下提供对AI生成假新闻的稳健检测。

英文摘要

The increasing use of large language models has raised concerns about the spread of AI-generated fake news, particularly under varying prompting strategies. Most existing detection models are trained and evaluated under a single generation setting, leaving their ability to generalize across unseen prompts unclear. In this study, we investigate cross-prompt generalization in fake news detection using three datasets of AI-generated articles produced under distinct prompts, combined with real news articles. We extract interpretable linguistic features capturing lexical diversity, readability, and emotion-based characteristics and evaluate a random forest classifier under a cross-prompt framework, where models trained on one prompt are tested on another. Across all six train-test combinations, performance remains consistently high, with AUC values ranging from 0.988 to 1.000. Analysis of feature distributions shows that AI-generated text exhibits increased lexical diversity, reduced readability, and substantially lower emotional intensity compared to the overall dataset, with variations across prompts. Despite these distributional shifts, the classifier maintains strong performance, indicating that these features capture stable properties of AI-generated text that generalize across prompting strategies. These findings suggest that feature-based approaches can provide robust detection of AI-generated fake news under prompt variability.

2606.04198 2026-06-04 cs.CV

Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG

空间伪影相干性决定基于补丁的rPPG中的编解码鲁棒性

Achraf Ben Ahmed

AI总结 提出空间伪影相干性(SAC)度量,解释编解码压缩下基于补丁的rPPG方法优于全局投影方法的原因,并设计PatchPCA算法族,实验表明SAC解释了93.8%的PCA优势方差。

详情
AI中文摘要

远程光电容积描记法(rPPG)在未压缩基准上实现了低心率误差,但在远程医疗、新生儿ICU和驾驶员疲劳应用中通过压缩视频通道部署。先前没有工作确定在编解码压缩下空间分解优于全局投影方法的物理量。我们提出空间伪影相干性(SAC),定义为4x4块间绿色通道协方差矩阵(带通0.75-2.5 Hz)的非对角能量与对角能量之比,以及PatchPCA算法族(四种编解码感知的rPPG算法)。我们在三个公共数据集上评估了280名受试者、11种编解码退化变体(MPEG-4、H.265、H.264、JPEG、色度子采样)和13种算法,通过Wilcoxon检验(BH-FDR,q < 0.05,904次检验)。SAC解释了PCA优势中93.8%的变体间方差(r = +0.969),编解码族之间零重叠:非MPEG-4变体聚集在SAC 0.10-0.18,PCA胜率84-90%;而MPEG-4变体聚集在SAC 0.48-0.59,胜率61%,平均改进降低5.8倍。在受试者内部,78%确认了预期模式(p < 10^-22,dz = 0.73)。变体内部受试者水平SAC相关性为r = +0.099,确认SAC分类编解码族而非预测个体结果。MPEG-4的影响是结构性的(宏块DCT几何,而非噪声幅度),由源编解码状态而非分辨率决定。P-Hybrid被确定为最部署鲁棒的算法。建立了PatchPCA优势的两个必要操作条件:SAC < 0.30和低到中等运动,直接排除了原始到MPEG-4转码流水线。SAC为临床远程监测系统中编解码感知的rPPG算法选择提供了物理基础度量。

英文摘要

Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q < 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p < 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4's effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC < 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems.

2606.04197 2026-06-04 cs.MA cs.CL cs.SI physics.soc-ph

Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions

探索共识的拓扑与记忆:LLM智能体在形成惯例时如何达成一致、分裂或稳定

Aliakbar Mehdizadeh, Martin Hilbert

AI总结 研究LLM多智能体系统中记忆深度与通信拓扑的交互作用,发现记忆对协调的影响符号会因网络中心化程度而反转,并揭示了记忆介导的速度-统一性权衡。

详情
Comments
Submitted to the Journal of Artificial Societies and Social Simulation (JASSS)
AI中文摘要

一个LLM智能体应该记住多少,以及多智能体系统在试图达成共识时应该如何连接?我们展示了这两个设计选择以某种方式交互,使得记忆对协调的影响符号发生翻转。通过对八个固定的16智能体拓扑上的网络化命名游戏进行432次模拟运行,我们改变了记忆深度和网络结构。更长的记忆在去中心化网络中减缓了达到稳态的时间,但在中心化网络中加速了这一过程;相同的参数根据拓扑将系统推向相反的方向。关键的是,中心化网络中的“更快稳定”意味着更快地锁定到一个碎片化的平台,而不是达到系统范围的共识,这可以用来产生分歧的意见。我们进一步记录了一种记忆介导的速度-统一性权衡:中心化网络始终比去中心化网络保留更多竞争性惯例,但它们的稳定速度严重依赖于记忆。在智能体层面,网络内分析表明,高中介性的桥梁遭受中介惩罚,而局部聚类邻域中的智能体实现更高的协调成功。最后,为了寻找可解析的生成机制,我们发现智能体的选择被虚拟博弈很好地捕捉,表明是基于信念而非基于奖励的适应。实际意义:记忆深度和通信拓扑应共同设计,而不是孤立优化。

英文摘要

How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.

2606.04194 2026-06-04 cs.LG cs.CL cs.IR

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

免训练的词汇-稠密融合用于对话记忆检索

Christian Lysenstøen

AI总结 本文提出一种免训练、仅CPU的检索方法,通过分数级融合最大查询-轮次相似度(后期交互)与BM25,显著提升多会话对话记忆检索的命中率,并分析了不同编码器和池化策略的影响。

详情
Comments
9 pages, 3 figures, 10 tables. Code, data, and per-table receipts: https://github.com/Chrislysen/opsem
AI中文摘要

在跨长多会话历史中检索回答新查询的过去几轮是长期对话记忆(LoCoMo, LongMemEval)背后的检索瓶颈。最近的并行工作Nano-Memory表明,通过最大查询-轮次相似度(后期交互,“轮次隔离检索”)对会话进行评分优于均值池化的会话嵌入。我们不声称该效果;我们复现它并询问一个免训练、仅CPU的检索阶段应在其周围添加什么。我们报告四个发现。(1)融合:在单个留一对话权重下,后期交互稠密分数与BM25的分数级融合,在六个编码器上比单独后期交互增加+8.8到+17.2个LoCoMo Hit@1点(所有p<1e-4),达到Hit@1 0.752 / NDCG@5 0.829(e5-large-v2),比BM25高+11.2个百分点。(2)一个现成的网络搜索交叉编码器重排序器在融合的前10个结果上效果不佳,将Hit@1降低6.9个百分点(一个重排序器,一种配置)。(3)池化算子消融显示top-k后期交互匹配最大相似度,但朴素的平滑最大值(log-sum-exp)对一半编码器失效。(4)所有六个编码器的后期减早期差距很大,且较大的编码器差距往往更大,而边际融合增益缩小;在LongMemEval-S上,一个BM25饱和的词汇机制中,相对于BM25的净融合增益很小且不显著。按类别分析将增益视为分工:稠密后期交互在多跳和时间问题上帮助最大,但在对抗性问题上落后于BM25。贡献是对一个强大的免训练检索方案的可控、可复现的描述,而非后期交互检索器本身(Nano-Memory的)。我们不声称完整的记忆架构;这是一个检索阶段的研究。

英文摘要

Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, "Turn Isolation Retrieval") beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p<1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory's). We make no claim to a complete memory architecture; this is a retrieval-stage study.

2606.04193 2026-06-04 cs.CR cs.AI cs.DC

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

公证代理:面向AI代理行为的接收方认证保密收据

Juan Figuera

AI总结 针对AI代理日志自审计的信任缺陷,提出接收方签名收据协议Sello,通过HPKE加密、JWS绑定和Merkle日志实现防篡改追踪。

详情
Comments
22 pages. Reference implementation at https://github.com/juanfiguera/sello
AI中文摘要

当前AI代理的可观测性在结构上存在缺陷:生成活动日志的实体与被记录活动的实体是同一个。被攻破或有缺陷的代理可以省略、篡改或伪造自身的追踪记录,而运行代理的操作员没有独立的方法检测篡改。我们提出了一类协议来解决这个问题,通过反转信任边界:接收代理调用的服务使用自己的密钥对观察到的内容签名收据,将收据加密给代理的所有者,并将其发布到公共透明度日志中。所有者可以在不信任代理或其操作员的情况下重建防篡改追踪。我们将该类协议实例化为Sello,一种结合了当前任何系统都不具备的四个属性的协议:(P1)接收方签名,(P2)通过JWS将HPKE加密绑定到所有者公钥的授权令牌,(C3)发布到见证人共同签名的Merkle日志,以及(P4)通过令牌引用进行所有者端发现。我们描述了该协议,分析了在攻击者控制代理及其操作员的情况下的安全性,给出了加密操作的微基准测试,并将Sello与相邻的收据协议工作(Signet、AgentROA、Agent Passport System、draft-farley-acta、SCITT)进行了比较。我们讨论了已知的限制,包括抑制攻击、服务合谋和采用激励问题。

英文摘要

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

2606.04191 2026-06-04 cs.LG cs.AI

Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge

CTF4Science Lorenz挑战的度量感知混合预测

Cen Lu

AI总结 针对CTF4Science Lorenz挑战,提出一种度量感知混合系统,通过为不同度量族分配专用预测器(去噪器、ODE拟合、直方图替换),在九项任务对上取得高分。

详情
AI中文摘要

我们描述了针对CTF4Science Lorenz挑战的方法,该基准混合了短时预测、长时间分布匹配和轨迹重建,涵盖九项任务对。关键发现是,没有单一模型族在所有度量上占优。相反,我们构建了一个度量感知混合系统,为每个度量族分配不同的预测器:(1)用于全轨迹重建的合成预训练去噪器,(2)用于前20个预测步的Lorenz ODE拟合和轨迹射击,以及(3)使用合成Lorenz库的直方图尾部替换用于长时间评估。该系统中一个具有代表性的成熟提交在公共排行榜上得分为83.83551,而采用相同思想的小型后续堆栈达到了83.85529。我们专注于更干净的中间系统,因为它捕获了完整方法,同时足够简单以重现和分析,而最终提交可以理解为同一骨干的保守扩展。

英文摘要

We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matching, and trajectory reconstruction across nine task pairs. The key discovery is that no single model family dominated all metrics. Instead, we built a metric-aware hybrid system that assigned a different predictor to each metric family: (1) synthetic-pretrained denoisers for full-trajectory reconstruction, (2) Lorenz ODE fitting and trajectory shooting for the first 20 forecast steps, and (3) histogram-tail substitution using synthetic Lorenz libraries for long-time evaluation. A representative mature submission from this system family scored 83.83551 on the public leaderboard, and a small follow-up stack of the same ideas reached 83.85529. We focus on the cleaner intermediate system because it captures the full method while remaining simple enough to reproduce and analyze, while the final submission can be understood as a conservative extension of the same backbone.

2606.04189 2026-06-04 cs.CL

ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation

ACAT:一种用于高效方面级情感数据集标注的协作平台

Ana-Maria Luisa Mocanu, Ciprian-Octavian Truica, Elena-Simona Apostol

AI总结 提出ACAT平台,通过自动化ETL流程和原生支持四种ABSA工作流,解决多标注者数据整合与一致性计算问题,实现高效标注并直接导出训练就绪数据集。

详情
Comments
Accepted at The 28th International Conference on Big Data Analytics and Knowledge Discovery (DaWak 2026)
AI中文摘要

方面级情感分析(ABSA)需要高质量的数据集来训练可靠的模型。然而,现有的标注工具将输出视为平面文件,使得研究人员不得不通过自定义脚本手动整合多标注者数据、重建关系结构并计算可靠性指标。本文介绍了ACAT(基于方面的情感分析协作标注工具),这是一个基于Web的平台,原生支持四种ABSA工作流:(1)方面类别情感分析,(2)子句级分割,(3)具有字符级位置跟踪的方面术语情感分析,以及(4)具有双跨度偏移保留的方面情感三元组提取。其核心贡献是一个自动化的提取、转换、加载(ETL)管道,该管道在导出时直接对齐协作标注并计算标注者间一致性(IAA)指标,生成训练就绪的数据集。在1002条餐厅评论的初步验证中,由两名不同专业水平的标注者进行标注,ACAT的中位标注时间为31.58秒,所有任务的原始IAA在0.78到0.86之间。

英文摘要

Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.

2606.04188 2026-06-04 cs.LG cs.AI cs.RO

Dual Advantage Fields

双优势场

Alexey Zemtsov, Maxim Bobrin, Alexander Nikulin, Dmitry V. Dylov, Fakhri Karray, Vladislav Kurenkov, Martin Takáč, Arip Asadulaev

AI总结 提出双优势场(DAF)方法,利用双线性对偶值模型生成局部优势信号,通过动作-效应模型预测折扣特征位移并与目标方向对齐来评分动作,实现离线目标条件强化学习中的策略提取。

详情
Comments
Accepted by ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
AI中文摘要

离线目标条件强化学习需要长期可达性估计和局部动作比较。双目标表示提供捕获全局目标可达性的值场,但它们不直接指定在给定状态下应优先选择哪个动作。我们提出双优势场(DAF),一种策略提取方法,将双线性对偶值模型转化为局部优势信号。在双线性对偶参数化下,目标嵌入是值场关于状态表示的梯度。DAF学习一个动作-效应模型,预测由动作引起的折扣特征位移,并通过该位移与目标方向的对齐程度对动作进行评分。在可实现的情况下,该分数等于目标条件Bellman优势,从而提供标准的局部策略改进保证。在OGBench的 locomotion、manipulation 和 puzzle 任务上,DAF改进了聚合RLiable指标,并在局部正确动作与直接朝向最终目标移动不同的设置中表现强劲。

英文摘要

Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.

2606.04185 2026-06-04 cs.RO

Distribution-Free Risk-Aware Planning and Control Under Uncertainty Using Conformal Spectral Risk Control

基于共形谱风险控制的免分布风险感知规划与控制

Junsik Eom, Tulga Ersal

AI总结 提出一种免分布的风险感知模型预测控制框架,通过扩展共形风险控制到谱风险度量,生成预测集以在不确定性下保证风险低于用户指定阈值,并在车辆避障仿真中验证了安全性和效率提升。

详情
Comments
Submitted to IEEE Robotics and Automation Letters
AI中文摘要

在动态和不确定环境中的安全导航通常依赖于对真实潜在不确定性的准确估计或假设。然而,由于数据有限或信息不完善,准确描述真实不确定性分布往往很困难。即使在高风险规避水平下,对不确定性及其相关风险的错误理解也可能导致危险决策。为了解决这个问题,我们提出了一种风险感知模型预测控制(RA-MPC)框架,该框架结合预测集来保证风险控制在用户指定阈值以下,而无需对潜在不确定性分布做出假设。为了生成预测集,我们开发了一种免分布的风险量化框架,将共形风险控制(CRC)扩展到一般谱风险度量。然后,我们证明将预测集纳入MPC框架即使在不确定性错误指定的情况下也能提供关于谱风险约束满足的统计安全保证。我们在模拟的车辆避障场景中验证了所提出的框架,与基线RA-MPC框架相比,展示了更高的安全性和更短的求解时间。

英文摘要

Safe navigation in dynamic and uncertain environments often relies on accurate estimation of, or assumptions about, the true underlying uncertainty. However, accurately characterizing the true uncertainty distribution is often difficult due to limited data or imperfect information. An incorrect understanding of the uncertainty and its associated risk may lead to dangerous decisions even under high levels of risk aversion. To address this issue, we propose a risk-aware model predictive control (RA-MPC) framework that incorporates prediction sets to guarantee risk control below a user-specified threshold without requiring assumptions about the underlying uncertainty distribution. To generate the prediction sets, we develop a distribution-free risk quantification framework that extends conformal risk control (CRC) to general spectral risk measures. We then show that incorporating the prediction sets into the MPC framework provides statistical safety guarantees in terms of spectral risk constraint satisfaction even under uncertainty misspecification. We validate the proposed framework in simulated vehicle obstacle avoidance scenarios, demonstrating improved safety and reduced solve time compared to a baseline RA-MPC framework.