arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11502 2026-06-11 cs.CL cs.AI 新提交

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.11499 2026-06-11 cs.CL cs.AI 新提交

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

枢纽或边缘:基于网页图中心性的预训练数据选择

Vedant Badoni, Danqi Chen, Xinyi Wang

发表机构 * Princeton Language and Intelligence(普林斯顿语言与智能) Princeton University(普林斯顿大学)

AI总结 提出WebGraphMix框架,利用Common Crawl主机级网页图的结构中心性得分调整预训练数据中中心与边缘文档的比例,无需模型训练或标注数据,在400M和1B参数模型上平均性能提升至41.4%。

详情
Comments
10 pages
AI中文摘要

现代语言模型的性能关键取决于预训练数据的组成。然而,现有的数据选择方法依赖辅助分类器进行文档评分或混合优化,增加了计算开销和对标注数据的依赖。我们提出WebGraphMix,一个轻量级的数据选择框架,它计算Common Crawl主机级网页图的结构中心性得分,并用其改变预训练混合数据中中心文档与边缘文档的比例。我们假设中心主机使模型暴露于可重用的抽象知识,而边缘主机编码专门的、长尾知识。WebGraphMix在网页规模下高效计算中心性得分,无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流水线中,训练了400M和1B参数规模的模型,分别使用8B和28B token,在从事实知识到符号推理的23个任务上进行评估。实验表明,中心和边缘网页区域编码互补的能力。以1:1比例混合两者平均达到41.4%,而均匀采样为39.8%。将结构得分与文档级质量分类器得分相结合,性能进一步提升至43.8%。这些发现表明,网页图拓扑是预训练数据策展的一个有意义维度,捕获了与现有基于内容的方法大致正交的信息。

英文摘要

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

2606.11489 2026-06-11 cs.RO 新提交

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

通过闭环仿射激活编辑引导多机器人行为

Satyajeet Das, Darren Chiu, Shashank Hegde, Gaurav S. Sukhatme

发表机构 * University of Southern California(南加州大学)

AI总结 提出CLAE框架,在推理时通过编辑冻结策略的中间激活来引导多机器人行为,无需微调或重训练,并在多四旋翼导航任务中验证了速度控制、编队保持和规避监控等新行为。

详情
AI中文摘要

现实世界中的机器人需要适应超出其预训练策略范围的行为。策略微调或重训练是可选方案,但它们存在灾难性遗忘的风险,会降低预训练策略的基础性能。为了解决这个问题,我们引入了CLAE:闭环仿射激活编辑,这是一种推理时框架,通过编辑中间激活来引导冻结策略的行为,同时保持基础策略权重和下游动作头不变。CLAE将行为引导视为一个闭环问题,其输出编辑策略激活,这些激活在线适应机器人状态、环境、目标行为和多机器人上下文。它在冻结策略激活上训练稀疏自编码器,通过事后探测选择行为相关的潜在特征,并学习一个轻量级的基于强化学习的引导策略,在推理期间对所选潜在特征应用状态相关的仿射编辑。我们在一个冻结的多四旋翼导航策略上验证了CLAE,该策略训练用于执行单一任务:在避开障碍物的同时将机器人导航到一组目标位置。通过大量仿真和物理测试,我们表明,在导航到目标位置的同时,CLAE可以:1. 通过控制每个机器人的速度曲线来引导单个机器人行为;2. 通过保持期望的编队来协调多机器人行为;3. 产生全新的行为,其中机器人需要减少在环境中暴露于监控摄像头的机会。

英文摘要

Real-world robots need to adapt their behavior beyond the envelope of their pre-trained policy. Policy finetuning or retraining are options, but they risk catastrophic forgetting, degrading the pretrained policy's base performance. To combat this, we introduce CLAE: Closed-Loop Affine Activation Editing, an inference-time framework for steering the behavior of a frozen policy by editing intermediate activations while keeping the base policy weights and downstream action head untouched. CLAE approaches behavior steering as a closed-loop problem whose outputs edit policy activations that adapt online to the robot state, environment, target behavior, and multi-robot context. It trains a sparse autoencoder over frozen-policy activations, selects behavior-relevant latent features via post-hoc probing, and learns a lightweight RL-based steering policy that applies state-dependent affine edits to selected latents during inference. We validate CLAE on a frozen multi-quadrotor navigation policy trained to perform a single task: navigating robots to a set of goal locations while avoiding obstacles. Through extensive simulations and physical tests, we show that while navigating to their goal positions, CLAE can 1. steer individual robot behavior by controlling each robot's velocity profile; 2. coordinate multirobot behavior by preserving a desired formation; and 3. produce entirely new behavior wherein robots are required to reduce their exposure to surveillance cameras in the environment.

2606.11480 2026-06-11 cs.LG 新提交

Accurate and Resource-Efficient Federated Continual Learning

准确且资源高效的联邦持续学习

Jebacyril Arockiaraj, Dhruv Parikh, Jayashree Adivarahan, Rajgopal Kannan, Viktor Prasanna

发表机构 * University of Southern California(南加州大学) DEVCOM Army Research Office(DEVCOM陆军研究办公室)

AI总结 提出FedRAN框架,通过紧凑随机特征统计替代梯度更新,利用截断SVD降低通信开销,结合原型伪标签处理标签稀缺,在多个数据集上提升准确率并大幅降低资源消耗。

详情
Comments
Technical Report
AI中文摘要

联邦持续学习(FCL)必须在有限的资源(如通信、计算、内存和标签可用性)下从分布式任务流中学习。现有的FCL方法通常依赖于重复的局部优化、重放和完全监督。解析替代方法避免了迭代训练和重放,但使用高维随机特征来提高准确性需要二阶特征统计量——Gram矩阵,其通信成本与随机特征大小$M$成二次方关系。我们提出FedRAN,一种资源感知的解析FCL框架,用紧凑的随机特征统计量替代基于梯度的更新。每个客户端传输其Gram矩阵的截断SVD摘要,将主要的二阶上传从$M$的二次方减少到线性(对于固定秩)。服务器执行两级QR-SVD子空间合并,在空间上跨客户端、在时间上跨任务,并以闭式求解岭分类器。FedRAN进一步通过基于原型的伪标签支持标签稀缺。在CIFAR-100、ImageNet-R和VTAB数据集上,FedRAN相比最强基线将平均准确率提高了最多4.8个百分点,每个客户端的通信量比基于优化的FCL少30.6-121.8倍,平均比基于梯度的基线快190.3倍;仅使用20%标签时,伪标签将平均准确率提高了最多6.61个百分点。这些结果表明,FedRAN在通信、计算和标签约束下实现了准确且资源高效的FCL。源代码可在该https URL获取。

英文摘要

Federated continual learning (FCL) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability. Existing FCL methods often rely on repeated local optimization, replay, and full supervision. Analytic alternatives avoid iterative training and replay, but using high-dimensional random features to improve accuracy requires a second-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature size $M$. We propose FedRAN, a resource-aware analytic FCL framework that replaces gradient-based updates with compact random feature statistics. Each client transmits a truncated-SVD summary of its Gram matrix, reducing the dominant second-order upload from quadratic to linear in $M$ for fixed rank. The server performs a two-level QR-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form. FedRAN further supports label scarcity through prototype-based pseudo-labeling. Across CIFAR-100, ImageNet-R, and VTAB datasets, FedRAN improves average accuracy by up to 4.8 percentage points over the strongest baseline, uses 30.6-121.8$\times$ less per-client communication than optimization-based FCL, and is 190.3$\times$ faster on average than gradient-based baselines; with only 20% labels, pseudo-labeling improves average accuracy by up to 6.61 points. These results show that FedRAN enables accurate and resource-efficient FCL under communication, computation, and label constraints. The source code is available at this https URL.

2606.11477 2026-06-11 cs.CV cs.AI 新提交

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分:基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University(奥芬堡大学机器学习和分析研究所(IMLA))

AI总结 提出使用视觉-语言基础模型(VLM)识别手写答案,在61份考试(3141个答案位置)上达到98.4%准确率,并通过轻量提示将假阴性率降至0.58%,实现公平的全自动评分。

详情
Comments
11 pages, 2 figures, 3 tables
AI中文摘要

手工批改手写试卷既耗时又容易出错,尤其是对于大规模班级,而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务,但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是,这种读取能否足够准确,并且最重要的是,足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败:答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型(VLM),它解释页面而非匹配像素模板,弥补了这一差距。在一个包含61份匿名考试(3141个答案位置)的基准测试中,最佳模型达到了98.4%的准确率,远高于之前的基线。关键的是,我们以公平性为中心进行评估:我们区分假阴性(正确答案被标记为错误,对学生不利)和假阳性,并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下,61份考试中只有3份会被评得更差,所有这些都通过学生自我审查步骤被发现。因此,大规模的全自动、公平性感知考试评分是合理的;我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

2606.11473 2026-06-11 cs.LG cs.AI stat.ML 新提交

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB: 通过分布匹配上下文批处理实现高效先验拟合网络推理

Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Niraj Kumar

发表机构 * Global Technology Applied Research, JPMorganChase(摩根大通全球技术应用研究)

AI总结 提出CRUMB方法,通过聚类查询、最小化最大均值差异选择训练子集、再执行精确推理,在不重新训练的情况下加速先验拟合网络推理,在51个数据集上优于同类方法。

详情
Comments
26 pages, 13 figures
AI中文摘要

先验拟合网络(PFNs)是一类有前景的表格基础模型,执行上下文学习,其中整个带标签的训练集作为上下文提供,并在单次前向传播中生成测试查询的预测。然而,许多PFN架构中二次缩放的自注意力机制使得对于非常大的训练数据集推理变得不可行。我们提出CRUMB(使用最小化MMD批处理的聚类检索),一个三阶段推理包装器:(i)聚类测试查询,(ii)通过贪心最小化最大均值差异(MMD)为每个聚类选择一个小型、分布匹配的训练子集,(iii)在每个缩减上下文的批次上执行精确的PFN推理。CRUMB是架构无关的,无需重新训练。在51个数据集的TabArena基准测试中,跨三种PFN架构(TabPFNv2、TabICLv1、TabICLv2)评估,我们展示了CRUMB优于类似的最先进的上下文选择策略。我们还展示了CRUMB对协变量漂移具有鲁棒性,因为MMD最小化步骤自然有助于对齐训练上下文分布以匹配当前测试批次分布。

英文摘要

Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

2606.11470 2026-06-11 cs.CL 新提交

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

LLM推理的周期表:推理范式、方法与失败模式的结构化综述

Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah

发表机构 * Singapore Institute of Technology(新加坡理工大学) Nvidia AI Center (SNAIC)(英伟达人工智能中心(SNAIC)) MIDAS Lab, IIIT Delhi(IIIT德里MIDAS实验室) MIDAS Lab, IIT Mandi(IIT曼迪MIDAS实验室) Owl Autonomous Imaging, Inc.(Owl自主成像公司) College of Computing & Data Science, NTU Singapore(新加坡南洋理工大学计算与数据科学学院) NVIDIA AI Technology Centre, Singapore(英伟达新加坡人工智能技术中心) Department of Computer Science and Engineering, IIT Kanpur(IIT坎普尔计算机科学与工程系)

AI总结 本文系统综述了300多篇论文,提出LLM推理研究的结构化分类法,涵盖多种推理范式,分析方法论趋势,并总结常见限制与失败模式,旨在为开发更鲁棒、可解释和可泛化的推理系统提供参考。

详情
AI中文摘要

大型语言模型(LLM)在自然语言处理任务中取得了强劲表现,但可靠推理仍是一个开放挑战。尽管现代LLM在结构化推理、多步问题求解和上下文理解方面显示出进展,但其推理行为往往不一致,且对提示策略、任务设计和模型规模敏感。本综述对来自arXiv、Semantic Scholar、Google Scholar、Papers with Code和ACL Anthology的300多篇近期论文进行了系统分析,以考察推理能力如何在LLM中涌现以及它们在何处失败。我们做出三项主要贡献。首先,我们引入了LLM推理研究的结构化分类法,涵盖思维链推理、多跳推理、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与智能体推理以及基于强化学习的推理。其次,我们分析了这些范式中的方法论趋势,包括提示方法、模型架构、训练目标、奖励建模和评估基准。第三,我们综合了反复出现的局限性和失败模式,例如推理幻觉、脆弱的多步推理、弱的因果抽象以及差的跨域泛化。通过组织快速扩展的文献,本综述提供了LLM推理当前能力和局限性的统一视图。我们还识别了新兴研究方向,包括元推理、自进化推理框架、多模态推理和社会基础推理。总体而言,本工作旨在为未来语言模型中开发更鲁棒、可解释和可泛化的推理系统提供参考。

英文摘要

Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

2606.11466 2026-06-11 cs.CV 新提交

PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

PT-WNO: 结合小波神经算子的点Transformer用于3D点云语义分割

Nhut Le, Maryam Rahnemoonfar

发表机构 * Lehigh University(里海大学)

AI总结 针对点云语义分割中全局上下文不足的问题,提出PT-WNO,通过在跳跃连接旁集成可学习的小波神经算子分支捕获多尺度全局频谱上下文,在四个基准上提升性能。

详情
AI中文摘要

点云语义分割需要同时捕捉细粒度局部几何和广阔全局场景结构的架构。基于Transformer的网络通过聚焦于详细的局部特征聚合表现出强大性能;然而,全局上下文主要通过编码器-解码器阶段之间的跳跃连接传递,我们认为这对于完整的场景理解是不够的。我们假设,用可学习的全局特征提取模块增强跳跃连接,使网络在深入局部细节之前获取场景级知识,从而产生更丰富且更具上下文基础的表示。为此,我们提出了点Transformer与小波神经算子(PT-WNO),它在点云Transformer骨干的跳跃连接旁集成了一个共享的小波神经算子(WNO)分支。在每个编码器-解码器过渡处,点特征被投影到密集的3D体素网格上,WNO通过可学习的小波分解和重建捕获多尺度全局频谱上下文。这些全局特征通过轻量级适配器融合回网络,补充而非替代现有的跳跃连接。在四个大规模3D点云基准上的实验证明了PT-WNO的有效性。在S3DIS(Area 5)上,PT-WNO达到71.59% mIoU,比Point Transformer v3(PTv3)基线高出+1.03个百分点。在DALES上达到81.05% mIoU(比基线高+1.47)。在ScanNet v2上,PT-WNO获得76.19% mIoU,与基线(76.36%)保持竞争力。

英文摘要

Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

2606.11464 2026-06-11 cs.RO 新提交

Bridging the sim2real gap in the table tennis robot with a transformer-based ball states predictor

基于Transformer的乒乓球状态预测器弥合仿真到现实的差距

Yin Bi, Christian Conti, Bilan Yang, Alexander Sigrist, Peter Dürr, Naoya Takahashi

发表机构 * Sony AI, Zürich, Switzerland(索尼AI,苏黎世,瑞士) Sony AI, Tokyo, Japan(索尼AI,东京,日本)

AI总结 提出基于Transformer的乒乓球状态预测框架,利用注意力机制建模长程时间依赖,结合大规模真实数据集,并引入SPAD策略替换仿真器,无需重新训练即可缩小sim2real差距。

详情
AI中文摘要

机器人乒乓球是动态环境中高速闭环机器人控制的代表性基准,其中准确快速地预测球状态对于可靠规划和控制至关重要。基于物理的方法严重依赖准确的参数识别和精确的初始状态,而基于学习的方法通常难以捕捉长程时间依赖,并且通常在有限或模拟数据上训练。我们提出了一种基于Transformer的乒乓球状态预测框架,利用注意力机制直接从历史观测中建模长程时间相关性,无需依赖显式的飞行或弹跳模型。为了支持鲁棒学习和泛化,我们从不同技能水平的球员和多种球炮配置中收集了大规模真实世界数据集。高容量Transformer架构与广泛真实世界数据的结合实现了准确的长期预测。基于此能力,我们引入了一种即插即用的仿真到现实迁移策略,即部署时交换预测器(SPAD),该策略在部署时用训练好的真实世界预测器替换训练中使用的基于物理的仿真器,从而在不需重新训练的情况下提高策略的仿真到现实迁移能力。我们证明,这种简单的替换有效地缩小了仿真到现实的差距,同时保留了基于仿真训练的效率和可扩展性。

英文摘要

Robotic table tennis is a representative benchmark for high-speed, closed-loop robotic control in dynamic environments, where accurate and fast prediction of ball states is critical for reliable planning and control. Physics-based approaches rely heavily on accurate parameter identification and precise initial state, while learning-based methods often struggle to capture long-range temporal dependencies and are typically trained on limited or simulated data. We propose a transformer-based framework for table tennis ball state prediction that leverages attention mechanisms to model long-range temporal correlations directly from historical observations, without relying on explicit flight or bounce models. To support robust learning and generalization, we collected a large-scale real-world dataset from players of varying skill levels and diverse ball cannon configurations. The combination of a high-capacity transformer architecture and extensive real-world data enables accurate long-horizon forecasting. Building on this capability, we introduce a plug-and-play sim-to-real transfer strategy, Swap Predictor at Deployment (SPAD), which replaces the physics-based simulator used during training with the proposed real-world-trained predictor at deployment, improving the sim-to-real transferability of the policy without requiring retraining. We demonstrate that this simple substitution effectively narrows the sim-to-real gap while preserving the efficiency and scalability of simulation-based training.

2606.11463 2026-06-11 cs.LG cs.AI 新提交

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测:气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University(石溪大学)

AI总结 针对气候变化导致传统精算方法失效的问题,提出使用LSTM神经网络检测结构性断点,在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%,并给出理论保证。

详情
Comments
15 pages, 0 figures, whitepaper YC
AI中文摘要

准确的损失准备金是保险公司偿付能力的基础,然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划,测试长短期记忆(LSTM)神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据,并辅以NOAA飓风强度指数和海面温度,我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升,这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证,我们还发展了一个理论框架,以概率术语为基础进行LSTM结构性断点检测,并提供形式化的性能保证,以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 新提交

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google(谷歌) UCLA(加州大学洛杉矶分校)

AI总结 提出APEX框架,通过动态数据分层(易、难、混合)优先选择高杠杆子集,在固定预算下提升提示优化效率,在三个基准上平均提升11.2%和6.8%。

详情
AI中文摘要

大型语言模型对提示表述高度敏感,需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式,但它们面临一个关键瓶颈:数据效率。当前方法将开发数据集视为静态基准,在无信息数据上浪费大量计算预算。在这项工作中,我们引入了APEX(自动提示工程专家),这是一个新颖的框架,它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级(即识别出LLM性能混合的数据),我们确定了两个高杠杆子集:用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX:IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下,由于其数据效率,APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%,在Gemma 3 27B上高出6.8%,这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 新提交

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体:方法多样,经验一致,解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Technical University of Munich(慕尼黑工业大学)

AI总结 研究LLM智能体在科学分析中的方法多样性与解释脆弱性,通过20次独立实验发现智能体在设计层匹配或超越人类多样性,但在裁决层易受提示影响,偏差源于解释而非估计。

详情
AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧:智能体可能减少方法多样性,或者可能放大分析灵活性,使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面:方法选择的设计层,以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行,并以多位分析师的人类基线为基准,对两者进行了测试。在设计层,Codex匹配了人类的方法多样性,而Claude Code产生了近三倍的规格;两个智能体的效应估计与人类共识大致一致,且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策,但与同一数据中有偏见的人类分析师不同,它并未改变总体估计或最终裁决;智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层,一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%,同时其系数分布基本保持不变,这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性,但在裁决层仍然脆弱。在我们的设置中,AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

2606.11450 2026-06-11 cs.CV 新提交

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

探索自适应掩码重建用于自监督基于骨架的动作识别

Shengkai Sun, Zhiyong Cheng, Zefan Zhang, Jianfeng Dong, Zhihui Li, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Jilin University(吉林大学) Zhejiang Gongshang University(浙江工商大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出自适应掩码重建(AMR)框架,通过解耦编码器-解码器并引入自适应引导模块,加速预训练并提升下游动作识别精度,在多个数据集上超越现有方法。

详情
Comments
Accepted by CVPR2026. The code is available at this https URL
AI中文摘要

最近,掩码骨架重建模型已成为强大的动作表示学习器,推动了自监督基于骨架的动作识别的重大进展。然而,现有的最先进方法必须预测极其大量的时空块,显著延长了训练时间。此外,通过在重建过程中平等对待所有时空区域,这些模型被分散了注意力,无法学习动作语义背后的关键运动模式。为了解决这些挑战,我们提出了自适应掩码重建(AMR),一个更快更强的预训练框架。我们首先将解码器与编码器解耦,使得能够灵活预测更大的时空块,并大幅降低重建复杂度。鉴于更大的块包含更复杂的信息,这难以预测并因此降低性能,我们相应地引入了一个自适应引导模块。该模块识别高运动信息量的区域,引导模型关注每个块中最具判别力的部分,并减轻重建难度。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明,AMR不仅显著加速了预训练,还提高了下游识别精度,超越了当前最先进的方法。

英文摘要

Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

2606.11447 2026-06-11 cs.CL 新提交

AI Coding Agents Can Reproduce Social Science Findings

AI编码智能体能够复现社会科学研究结果

Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi, Atoosa Kasirzadeh, Joshua Tucker

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学)

AI总结 本研究构建SocSci-Repro-Bench基准测试,评估Claude Code和Codex两个前沿编码智能体在221项社会科学任务中的复现能力,发现它们能复现大部分结果,且Claude Code表现更优,同时提示框架会影响确认性规范搜索。

详情
AI中文摘要

近期轶事证据表明,当提供原始数据和代码时,AI编码智能体能够复现已发表的研究结果;然而,在社会科学领域的系统评估仍然有限。现有的评估基准不足,要么规模较小,要么将智能体性能与复现材料本身的问题(如代码无法正确执行)混为一谈。本文介绍了SocSci-Repro-Bench,这是一个包含221项任务的基准测试,涵盖四个学科和13个实质性领域,这些任务基于那些结果要么完全可通过现有材料复现,要么因数据缺失而明显不可复现的研究构建,从而使我们能够隔离智能体的复现能力。评估两个前沿编码智能体Claude Code和Codex,我们发现两者都能复现大部分社会科学研究结果,其中Claude Code显著优于Codex。这些复现率远高于先前报道的通用基于LLM的智能体在类似可复现性基准上的表现。两个智能体在需要识别潜在研究问题的推理任务上也表现强劲,附加分析表明结果并非主要由记忆驱动。提供原始论文PDF与复现材料一起可适度提升性能,但在无法复现的任务上引入了偏差。我们还表明,通过微妙的提示框架,智能体可以被引导向确认性规范搜索。这些发现共同表明,至少某些前沿编码智能体可以作为计算工作流的可靠执行者,同时强调了在AI系统在科学生产中扮演更大角色时,需要仔细的基准测试和提示设计。

英文摘要

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

2606.11446 2026-06-11 cs.CV cs.GR 新提交

3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling

3D-CBM:生成式3D建模中基于概念可解释性的框架

Ahmad Al-Kabbany

发表机构 * Yubree Labs Multimedia Interaction and Communication Lab, Arab Academy for Science and Technology(阿拉伯科学技术学院多媒体交互与通信实验室)

AI总结 提出将概念瓶颈模型(CBM)融入3D生成架构,通过多层级可解释原语和功能属性映射,实现语义可操控的3D生成,实验验证了高概念预测精度和交互式纠错能力。

详情
AI中文摘要

本研究引入了一个将概念瓶颈模型(CBM)融入3D生成架构的框架,以解决深度几何学习中固有的“语义鸿沟”。随着深度模型成为3D内容创建的核心,可解释性从边缘特性转变为医疗和制造等安全关键领域中信任和问责的基本要求。CBM通过约束潜在表示与人类定义的概念对齐,提供了一种内在的可解释性解决方案,但其在非结构化3D数据上的应用仍 largely unexplored。我们设计、实现并验证了一个正式的3D-CBM架构,将原始几何输入(包括点云和网格)映射到可解释基元和功能属性的多层级分类中。该框架进一步确定了专门用于基于概念监督的战略性数据集,如PartNet和ShapeNet。来自3D部件操作概念验证实验的结果证明了该框架的有效性,实现了88.8%的概念预测准确率和0.0115的Chamfer距离。关键的是,该模型支持精确的测试时干预,允许交互式纠正结构错误。这项工作为语义可操控的3D生成奠定了基础,并邀请进一步探索协作式人在回路设计系统。

英文摘要

This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.

2606.11445 2026-06-11 cs.AI 新提交

Forecasting Future Behavior as a Learning Task

将未来行为预测作为学习任务

Mosh Levy, Yoav Goldberg, Asa Cooper Stickland

发表机构 * Bar-Ilan University(巴伊兰大学) Allen Institute for AI(艾伦人工智能研究所) UK AI Security Institute(英国人工智能安全研究所)

AI总结 提出将AI行为预测作为可学习任务,训练行为预测器从推理轨迹中预测未来行为,无需解释步骤,在两项任务上优于GPT-5.4和Claude Opus-4.6。

详情
AI中文摘要

对AI系统的信任通常基于对其工作原理的解释,人们利用这些解释来预测系统在新输入上的行为。对于大型推理模型(LRM),这条常规路径尤其难以遵循:针对单个token生成的解释方法无法自然推广到长轨迹,而轨迹本身在作为自然语言阅读时往往不忠实。我们提出一种绕过解释步骤的替代方案:将行为预测视为可学习任务,训练行为预测器(Behavior Forecasters)在单个推理轨迹上运行,以做出通常从解释中寻求的相同预测。预测器的训练数据通过查询LRM获得,无需人工标注,其推理在单次前向传播中完成。我们在两个任务上实例化该方法:LRM在重新运行时重复其答案的可能性,以及移除输入部分如何改变其答案。我们在三个不同的推理数据集上对这两个任务进行了评估,发现训练后的行为预测器比作为朴素读者阅读相同轨迹的GPT-5.4和Claude Opus-4.6更准确,而推理成本仅为其一小部分。我们发现,端到端微调骨干网络并从目标LRM初始化对于强性能都是必要的。这些结果表明,推理轨迹携带了关于LRM未来行为的信息,超出了朴素阅读所能传达的范围。

英文摘要

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

2606.11440 2026-06-11 cs.AI 新提交

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND: 基础设施感知的多智能体编排

Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出INFRAMIND框架,通过强化学习将基础设施状态(队列深度、KV缓存压力等)融入多智能体LLM编排的规划、路由和调度决策,在共享GPU集群上实现质量与延迟的平衡,相比基线提升最高7.6%准确率并降低7倍延迟。

详情
Comments
Preprint
AI中文摘要

现有的多智能体LLM编排方法,从暴力集成到学习型路由器,基于任务和模型特征选择模型和拓扑。然而,这些方法不考虑服务基础设施的运行时状态。在共享GPU集群上并发负载下,这种基础设施盲区导致系统性的资源利用不足:首选模型积累深度请求队列,而同等能力的替代模型闲置。在多智能体流水线中,每个查询触发多个顺序模型调用,这些延迟会进一步累积到每个下游步骤。弥补这一差距具有挑战性,因为相关基础设施信号(队列深度、KV缓存压力、延迟)是动态且嘈杂的,并且它们必须驱动三个不同的决策:规划、逐步骤路由和调度。我们引入INFRAMIND,一个使整个多智能体堆栈具备基础设施感知的框架。一个基础设施感知的规划器根据实时系统负载和剩余预算调节拓扑和角色选择,在拥塞时偏向简单图,在低负载时偏向丰富图。然后,一个基础设施感知的执行器在每个智能体步骤观察每个模型的队列深度、缓存利用率和响应延迟,以决定调用哪个模型以及推理深度;一个预算感知的调度器进一步重新排序每个模型的队列,使紧急请求优先得到服务。将其建模为分层约束MDP并通过强化学习端到端求解,系统自动学习平衡质量与延迟。在五个基准测试中,INFRAMIND在低负载下相比先前基线准确率提升高达7.6个百分点,延迟降低7倍,在高负载下维持高达99.9%的SLO合规性,而所有基线均降至50%以下。

英文摘要

Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

2606.11435 2026-06-11 cs.CL 新提交

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

智能体技能评估与进化:框架与基准

Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文系统综述了智能体技能从孤立创建到自动化评估驱动进化的范式转变,分类了四种进化范式并分析了六个技能基准类别,指出了覆盖缺口和开放方向。

详情
AI中文摘要

智能体技能的增长已经改变了智能体系统的构建、评估和部署方式。随着技能库的持续扩展,严格的评估对于确保其在现实应用中的效用、质量和安全性变得至关重要。因此,该领域正在经历从孤立技能创建到自动化、评估驱动的技能进化的新兴范式转变。在本综述中,我们系统地考察了超越基础技能创建的技能进化与评估的格局。我们将进化分为四种不同的范式,涵盖执行反馈、轨迹蒸馏、压缩和强化学习,展示了每种元素如何有助于提高技能效用和可靠性。我们还对六个以技能为中心的基准类别进行了分析,识别了基准覆盖范围、权衡和度量丰富性方面的结构性差距,以推动技能研究。最后,我们指出了构建可泛化、高效且可验证安全的技能生态系统的开放方向。项目网址为:https://this https URL

英文摘要

The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL

2606.11431 2026-06-11 cs.LG 新提交

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

超越欧几里得稳定性的镜像下降:初始化敏感性的指数级分离

Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院) Google Research(谷歌研究院)

AI总结 本文证明非二次正则化的镜像下降(MD)在凸光滑目标上对初始化的敏感性可呈指数级增长,与梯度下降(GD)形成鲜明对比,并提出基于锚点的Bregman正则化可缓解不稳定性。

详情
AI中文摘要

镜像下降(MD)将梯度下降(GD)扩展到欧几里得几何之外,最近重新成为强化学习和LLM后训练中KL正则化策略优化的视角。这引发了一个基本的鲁棒性问题,对可重复性和可靠性至关重要:MD动力学对其输入的敏感性如何?我们关注初始化,它本身通常是预训练或先前对齐的模型。众所周知,二次正则化的MD(包括GD和马氏几何)对于凸光滑目标是稳定的。我们展示了一个鲜明的对比:一旦正则化器是非二次的,即使正则化器在欧几里得范数下是良条件的,MD对初始化的敏感性也可能比GD高指数级。我们给出了一个三维构造,其中目标函数是凸光滑的,正则化器是强凸、光滑且良条件的,初始$\varepsilon$扰动在$T$次步长为$\eta$的MD迭代后迅速放大到$\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$。对于单纯形上的典型KL正则化MD,我们证明即使线性目标在高维或近边界区域也能指数级放大初始$\varepsilon$扰动。最后,我们展示向锚点添加Bregman正则化项可以在很大程度上保持优化保证的同时稳定动力学,并且锚点的选择至关重要:在初始化处锚定仅部分缓解不稳定性,而在固定点锚定则产生更稳定的机制。

英文摘要

Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial $\varepsilon$ perturbation is quickly amplified to $\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$ after $T$ iterations of MD with step size $\eta$. For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial $\varepsilon$ perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.

2606.11424 2026-06-11 cs.CL 新提交

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

SOMA-SQL: 通过合成日志和执行探测解决NL-to-SQL中的多源歧义

Sai Ashish Somayajula, Marianne Menglin Liu, Chuan Lei, Fjona Parllaku, Daniel Garcia, Rongguang Wang, Syed Fahad Allam Shah, Ankan Bansal, Sujeeth Bharadwaj, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI(甲骨文人工智能实验室)

AI总结 提出SOMA-SQL框架,通过合成查询日志和歧义驱动探测自动解决自然语言到SQL中的多源歧义,在6个基准上平均执行准确率提升13.0%。

详情
Comments
34 pages, 1 figure, 7 tables. Preprint
AI中文摘要

自然语言数据库接口旨在将用户问题转换为可执行的SQL,但在现实环境中,问题表述不明确且模式庞大且模糊时仍然脆弱。用户问题、数据库模式和模型解释之间的歧义是NL2SQL中的主要失败模式,导致意图不匹配、模式接地错误和SQL生成错误。现有方法依赖人工澄清或将歧义视为模式表示问题,但这些方法无法扩展也无法自主解决歧义。我们提出SOMA-SQL,通过目标合成查询日志和歧义驱动探测自动解决歧义。SOMA-SQL构建合成查询日志以接地模式解释并指导候选SQL生成;然后执行目标探测查询,由结构化歧义分类和候选不一致驱动,为最终SQL选择和修复生成消歧证据。这种主动的歧义发现和解决方法无需人工参与即可泛化到未见过的模式和查询分布。在六个公开基准上的实验表明,SOMA-SQL相比最先进的基线平均执行准确率提升13.0%,在歧义问题上提升高达16.7%。

英文摘要

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

2606.11420 2026-06-11 cs.CL cs.SI 新提交

Context-Aware Multimodal Claim Verification in Spoken Dialogues

口语对话中的上下文感知多模态声明验证

Chaewan Chun, Delvin Ce Zhang, Dongwon Lee

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) University of Sheffield(谢菲尔德大学)

AI总结 提出MAD2基准和上下文感知多模态融合方法,验证对话音频中的声明,发现对话结构比虚假信息框架对验证更重要。

详情
AI中文摘要

每天,数百万人从播客和流媒体中吸收声明,而这些声明从未被事实核查员看到。口语错误信息是通过对话构建的,其中可信度不仅来自事实本身,还来自声明如何在对话轮次中被构建、强化或未被质疑。然而,事实核查一直专注于孤立的文本,对话音频研究不足。我们引入了MAD2,一个新的用于口语声明验证的多轮音频对话基准,包含1,000个双说话者对话,3,368个值得核查的声明和约10小时的音频,并提出了上下文感知音频编码器和对话感知文本模型的校准多模态融合。在各种设置下,添加对话上下文改善了验证,但收益取决于场景类型。仅使用前文上下文通常与离线性能相当,支持实时审核设置,而当基于转录的模型被额外上下文 destabilized 时,音频贡献最大。总体而言,对话结构对验证的影响比错误信息框架更大。

英文摘要

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

2606.11419 2026-06-11 cs.RO 新提交

A Modular Dual-Camera Pipeline for Micro-Inspection Using Aerial Robots

一种用于空中机器人微检测的模块化双相机流水线

S.H. Mirtajadini, N. Rublein, R.M. Ramakrishnan, G. ter Maat, M. Aldibaja, A.Y. Mersha

发表机构 * Netherlands Organization for Scientific Research (NWO)(荷兰科学研究组织) Saxion University of Applied Sciences(萨克逊应用科学大学)

AI总结 提出一种模块化双相机空中微检测流水线,通过变焦云台相机和广角立体导航相机协同工作,结合视觉反馈回路,实现对树木和温室粘虫板等非结构目标的鲁棒微检测。

详情
AI中文摘要

现有大多数基于无人机的检测系统要求无人机危险地靠近目标飞行或遵循复杂飞行路径以捕获小细节。此外,无人机飞行受干扰和定位不准确的影响,当视野狭窄时可能导致无人机丢失目标。此外,轨迹规划通常需要目标几何、位置和方向的先验信息,这对于非结构目标(如树木、车辆或人员)并不总是可用。为解决这些挑战,本文提出了 aerial_micro_inspection,一种适用于不同用例的通用空中微检测流水线。该流水线假设一架搭载PX4的无人机配备两个摄像头:(i) 一个变焦云台检测摄像头,无需无人机飞得离目标很近即可捕获精细细节;(ii) 一个宽视场立体导航摄像头,现场获取目标表面,估计其距离,并将其分割成较小的检测区域。此外,当检测摄像头访问较大表面的小分区时,基于视觉的反馈回路补偿无人机运动。我们在仿真和真实实验中评估了该流水线,主要在两种用例场景中:用于检测橡树行军虫及其卵的树木检测,以及用于检测粉虱的温室粘虫板检测。结果显示,在仿真中,无人机干扰下的覆盖鲁棒性得到改善,在真实实验中,有效检测了幼虫和卵,并对昆虫进行了高细节成像。该流水线是开源的,基于ROS 2开发,可通过替换表面分割和微目标检测检查点来适应新应用。代码见:this https URL

英文摘要

Most existing drone-based inspection systems require the drone to fly dangerously close to the target or follow complex flight paths to capture small details. In addition, drone flight is affected by disturbances and localization inaccuracies, which can cause the drone to lose sight of its supposed target when it has a narrow view. Furthermore, trajectory planning often requires prior information about the target's geometry, position, and orientation, which is not always available for non-structural targets such as trees, vehicles, or people. To address these challenges, this paper presents aerial_micro_inspection, a generic pipeline for aerial micro-inspection across different use cases. The pipeline assumes a PX4-powered drone equipped with two cameras: (i) a zoomed, gimbal-mounted inspection camera that captures fine details without requiring the drone to fly very close to the target, and (ii) a wide-field-of-view stereo navigation camera that acquires the target surface on site, estimates its range, and partitions it into smaller inspection regions. In addition, a vision-based feedback loop compensates for drone motion while the inspection camera visits small partitions of a larger surface. We evaluate the pipeline in simulation and real-world experiments, mainly in two use-case scenarios: tree inspection for detecting oak processionary caterpillars and their eggs, and greenhouse inspection of sticky traps for detecting whiteflies. The results show improved coverage robustness under drone disturbances in simulation, as well as effective detection of caterpillars and eggs and high-detail imaging of insects in real-world experiments. The pipeline is open-source, developed in ROS 2, and can be adapted to new applications by replacing the surface-segmentation and micro-target detection checkpoints. The code is available at: this https URL

2606.11409 2026-06-11 cs.LG cs.AI cs.CR 新提交

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

压力下的风险:语言模型对抗鲁棒性的计算感知评估

Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Hugging Face

AI总结 提出基于计算压力(累积FLOPs)的对抗鲁棒性评估框架,通过风险-计算曲线和两个新指标,揭示不同攻击策略的计算成本差异,并在10个模型上验证了对齐训练、模型规模等因素对计算空间鲁棒性的非单调影响。

详情
AI中文摘要

大型语言模型(LLMs)的对抗鲁棒性评估通常报告固定查询预算下的攻击成功率(ASR),隐含地认为所有攻击成本相同。实际上,不同攻击策略的计算开销可能相差几个数量级。因此,固定预算下的ASR可能掩盖破解模型所需的真实努力,从而难以判断攻击成本是否值得。我们提出一个基于计算压力的计算感知评估框架,以累积浮点运算次数(FLOPs)作为对抗努力的代理。我们引入风险-计算曲线,将计算预算映射到攻击风险,并推导出两个指标,总结给定攻击成功所需的平均压力。在跨越三个模型家族和语言模型训练与对齐的四个不同阶段的十个模型上,使用三种攻击策略(基于梯度、迭代细化和基于模板)在两个破解鲁棒性基准上评估,我们发现:(1)对齐训练对计算空间鲁棒性具有非单调影响;(2)扩大模型规模降低了基于梯度的攻击有效性,但对更便宜的基于模板的攻击影响有限;(3)在代理模型上优化的基于梯度的攻击可以迁移到独立的目标模型,从而降低攻击者成本;(4)在单个模型内,不同危害类别的计算成本差异高达约5倍;(5)安全对齐的RL增加了总成本,同时使某些类别不成比例地易于攻击。我们发布框架以实现计算感知的风险评估和评价。

英文摘要

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

2606.11408 2026-06-11 cs.RO 新提交

Dynamic Execution Horizon Prediction for Chunk-based Robot Policies

基于分块的机器人策略的动态执行视界预测

Yuchi Zhao, Miroslav Bogdanovic, Arjun Sohal, Liyu Tao, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, Animesh Garg

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所) Acceleration Consortium(加速联盟) Canadian Institute for Advanced Research (CIFAR)(加拿大高等研究院) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(英伟达)

AI总结 提出DEHP方法,通过在线强化学习训练轻量级执行视界预测分支,在冻结预训练分块策略的情况下动态调整执行步数,显著提升高精度和长时域操作任务的成功率。

详情
AI中文摘要

动作分块已成为现代机器人策略中的标准设计,从扩散/流策略到视觉-语言-动作模型,策略预测一系列动作并执行固定数量的动作,而不是一次一步地行动。然而,这种范式依赖于一个关键假设:固定的执行视界。在分块执行期间,策略以开环方式运行,这对于需要频繁重新规划的精细操作任务尤其成问题。在实践中,执行视界通常通过经验调整来选择,并且高度依赖于任务。为此,我们提出了动态执行视界预测(DEHP),一种有效的方法,它使用在线强化学习训练一个轻量级的执行视界预测分支,同时完全冻结预训练的分块策略。这使得该方法与黑盒分块策略兼容,并将适应执行视界的效果与底层动作生成器的变化隔离开来。在我们的评估中,DEHP大幅提高了不同高精度和长时域操作任务的成功率。我们的定性分析进一步表明,DEHP在任务的精细阶段预测较短的执行视界,在自由空间运动中预测较长的视界。通过这种方式,DEHP平衡了开环分块执行的效率与闭环单步控制的反应性。项目页面:此 https URL

英文摘要

Action chunking has become a standard design in modern robot policies, from diffusion/flow policies to vision-language-action models, where the policy predicts a sequence of actions and executes a fixed number of them instead of acting one step at a time. However, this paradigm relies on a key assumption: a fixed execution horizon. During chunk execution, the policy operates open-loop, which is particularly problematic for fine-grained manipulation tasks that require frequent replanning. In practice, the execution horizon is typically chosen through empirical tuning and is highly task-dependent. To this end, we propose Dynamic Execution Horizon Prediction (DEHP), an effective method that trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. Across our evaluations, DEHP improves the success rate of different high-precision and long-horizon manipulation tasks by a large margin. Our qualitative analysis further shows that DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion. In this way, DEHP balances the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control. Project page: this https URL

2606.11396 2026-06-11 cs.RO 新提交

PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation

PLUME: 多指操作的概率潜在统一世界建模与参数估计

Abhinav Kumar, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson

发表机构 * University of Michigan(密歇根大学) Honda Research Institute USA(本田美国研究所)

AI总结 提出PLUME世界模型,联合学习参数信念演化与条件动力学,通过在线参数推断实现零样本迁移,在螺丝刀旋转等任务中优于现有方法。

详情
Comments
16 pages, 5 figures
AI中文摘要

多指手的灵巧操作可能对物理参数(如物体形状、姿态和摩擦系数)敏感。虽然仿真能够利用已知参数值进行大规模数据收集,但基于仿真训练的策略在部署时仍需处理不确定性,此时真实参数及由此决定的真实动力学是未知的。对于螺丝刀旋转等精确任务,标准域随机化策略可能不足,因为操作策略可能需要根据特定参数值而变化。为解决这一问题,我们提出了概率潜在统一世界建模与参数估计(PLUME),这是一种世界模型,它联合学习对参数值的信念演化以及以这些参数为条件的系统动力学。我们学习一个潜在空间,以联合表示多个性质不同的物理参数以及奖励(奖励本身是部分可观测变量的函数),从而为规划提供信息。我们的新颖学习框架通过在线参数推断(而非重新训练或微调)实现了世界模型与真实动力学的高效对齐。我们在模拟的螺丝刀旋转、阀门旋转、桶提升和圆盘弹射任务以及硬件螺丝刀旋转任务上评估了我们的方法,在这些任务中,我们实现了仿真训练策略的成功零样本迁移,并超越了最先进的离线强化学习和世界模型增强行为克隆基线。视频请见我们的网站:https://this URL。

英文摘要

Dexterous manipulation with multi-finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large-scale data collection with known parameter values, simulation-trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially-observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re-training or fine-tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero-shot transfer of our simulation-trained policy and outperform state-of-the-art offline reinforcement learning and world-model-augmented behavior cloning baselines. Please see our website at this https URL for videos.

2606.11391 2026-06-11 cs.LG 新提交

Recursive Binding on a Budget: Subspace Carving in Order-p Tensor Memories

预算上的递归绑定:阶-p张量记忆中的子空间雕刻

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出正交子空间雕刻(OSC)方法,通过将填充符投影到角色基的零空间来绑定到角色,固定阶张量记忆实现深度递归绑定,在恒定内存下提升高叠加场景的效率。

详情
Comments
24 pages, 12 figures, 7 tables
AI中文摘要

张量积表示为模型中的符号推理提供了所需的结构保真度,但在编码深层递归结构时会遭受指数级维度增长。相反,向量符号架构保持恒定维度,但由于通过叠加的噪声压缩而牺牲了容量和保真度。在这项工作中,我们提出了正交子空间雕刻(OSC),一种内存架构,通过将填充符投影到角色基的零空间上,然后聚合到固定的阶-p张量中,从而将填充符绑定到角色。OSC 使用投影来强制静态记忆痕迹中绑定结构之间的几何正交性。我们表明,这种机制将张量阶与结构深度解耦,从而在恒定内存占用内实现深度递归绑定。通过识别进行检索,这种构造允许分量向量比记忆张量小几个数量级,从而在涉及高叠加的场景中提供卓越的内存效率。我们还表明,TPR 是 Clifford 代数中绑定的一个特例,并给出了 OSC 的 Clifford 公式。

英文摘要

Tensor Product Representations provide the structural fidelity required for symbolic reasoning in models but suffer from exponential dimensionality growth when encoding deep recursive structures. Conversely, Vector Symbolic Architectures maintain constant dimensionality but sacrifice capacity and fidelity due to noisy compression via superposition. In this work, we propose Orthogonal Subspace Carving (OSC), a memory architecture that binds fillers to roles by projecting onto the null space of the role basis before aggregating into a fixed order-p tensor. OSC uses projections to enforce geometric orthogonality between bound structures within a static memory trace. We show that this mechanism decouples the tensor order from the structural depth, enabling deep recursive binding within a constant memory footprint. By performing retrieval via recognition, this construction allows for component vectors that are orders of magnitude smaller than the memory tensor, giving superior memory efficiency in settings involving high superposition. We also show that TPR is a special case of binding in Clifford algebra, and give a Clifford formulation of OSC.

2606.11390 2026-06-11 cs.CV cs.DC cs.GR cs.LG 新提交

A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

一种可扩展的多GPU高斯泼溅PyTorch抽象

Matthew Cong, Francis Williams, Jonathan Swartz, Mark Harris, Sanja Fidler, Ken Museth

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种多GPU高斯泼溅方法,通过CUDA统一内存和NVLink在算子级别分布参数,实现大规模场景重建,支持超过10亿高斯泼溅。

详情
Comments
14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material
AI中文摘要

高斯泼溅方法在真实世界的神经重建中越来越受欢迎。然而,由于计算和内存限制,它们在规模和分辨率上常常受限。我们提出了一种多GPU高斯泼溅方法,将重建扩展到更高的分辨率和更大的场景,同时抽象掉了通常与模型分布相关的代码复杂性。为实现这一目标,我们提出一个PyTorch后端,通过CUDA统一内存和NVLink在GPU之间分布高斯参数和泼溅算子。由于分布发生在算子级别,模型代码不需要显式的跨设备通信。更广泛地说,该后端将多个GPU暴露为一个聚合的PyTorch设备,并支持其他PyTorch算子。我们展示了包含超过10亿个高斯泼溅的城市规模重建,具有街道级细节,数量是当前最先进方法的25倍以上。

英文摘要

Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

2606.11387 2026-06-11 cs.CL cs.AI cs.LG 新提交

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

小实验,更经济的决策:微预训练中分阶段提升的案例研究

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(慧与科技公司)

AI总结 研究微预训练中分阶段提升协议,通过固定预算筛选配置,在Windows A100和Linux L40S上验证,发现早期排名不稳定,但最终协议以144 GPU小时找到最优配置,成本低于全量筛选。

详情
Comments
14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
AI中文摘要

短预训练运行可以降低实验成本,但它们也可能过度推广那些仅在小预算下表现良好的配置。我们针对固定微预训练运行器在两个异构主机块(Windows A100和Linux L40S)上研究了一种可审计的分阶段提升协议。从12个预先筛选的配置开始,我们使用2分钟、5分钟、10分钟、60分钟和12小时的分阶段预算,并在昂贵的延续之前设置固定的提升规则。早期筛选被有意视为不稳定:5分钟和10分钟的排名对主机敏感,而最终的12小时排名最优条件并非复制10分钟门控下的平均最佳条件。由于不同阶段的种子范围不同,这些变化是操作性的提升证据,而非种子内曲线。复制60分钟门控将分阶段因子筛选桥接参考保留在提升集中,它在所有四个60分钟主机-种子单元中排名第一。在最终的12小时确认包中,桥接条件在两个种子的所有四个主机-种子单元中排名第一;贪婪比较器未满足固定的0.010 val_bpb近似等价规则;更便宜的d8/ar48(深度8,宽高比48)哨兵未满足固定的0.020平均差距规则。执行的12小时分支花费144 GPU小时,完整的分阶段协议记录169.2训练GPU小时(包括筛选阶段)。继续所有四个60分钟候选将花费192 GPU小时,而继续所有九个复制10分钟候选将花费432 GPU小时。后者是未运行延续的会计反事实,并非表明跳过的候选不可能超越参考。结果是一个有界成本分配发现,而非全局最优性、容量归一化优越性或优于自适应超参数优化方法的声明。

英文摘要

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 新提交

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对全双工口语模型在用户打断时响应延迟的问题,提出基于感知向量的激活引导方法,无需微调即可显著提升中断理解能力。

详情
AI中文摘要

全双工口语语言模型(FD-SLMs)通过允许模型同时听和说实现无缝语音交互,但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为,发现它们表现出特定流的预测模式:在听时,它们优先预测传入的用户流;而在说时,它们优先预测模型输出流。基于这一观察,我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间:与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而,这种调节可能滞后于对话上下文的突然变化。在用户打断期间,模型在过渡到感知状态之前短暂地偏向生成状态,导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响,我们引入了零缓冲基准(ZBB),这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率(IWOR)来评估这一设置。最后,我们通过使用感知向量的激活引导来缓解状态惯性,这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上,激活引导显著改善了中断处理;例如,在PersonaPlex上,它将正确性从28%提高到45%,将IWOR从40%提高到72%,而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

2606.11385 2026-06-11 cs.CV 新提交

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

DeceptionX: 基于多模态大语言模型的可解释欺骗检测

Jiayu Zhang, Shuo Ye, Jiajian Huang, Yawen Cui, Taorui Wang, Wei Xia, Zeheng Wang, Haowen Tang, Hui Ma, Zitong Yu

发表机构 * Great Bay University(大湾区大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出DeceptionX框架,将欺骗检测从黑箱分类转变为可解释的观察-思考-总结推理过程,通过构建DeceptChain数据集和三阶段训练管道,在标准基准上超越现有方法,同时提供专家级可解释推理路径。

详情
AI中文摘要

欺骗检测是情感计算和行为分析中一项关键且极具挑战性的任务。现有的深度学习方法通常将此任务视为简单的分类问题;然而,这种黑箱方法缺乏可解释性,无法捕捉人类专家在识别谎言时使用的复杂逻辑推理过程。尽管多模态大语言模型(MLLM)已展现出潜力,但有效应用它们需要在低层视听线索与高层逻辑推理之间建立桥梁。在本文中,我们提出DeceptionX,一种新颖的MLLM框架,将欺骗检测的范式从黑箱分类转变为可解释的观察-思考-总结推理过程。为解决高质量推理数据稀缺的问题,我们首先构建了DeceptChain,这是一个通过人机循环过程开发的高质量数据集。该数据集将细粒度的视觉和听觉证据(如微表情和声音颤抖)综合为结构化的思维链推理数据。此外,我们提出了一个三阶段训练管道和一种针对DeceptionX的差异感知冗余消除(DARE)策略,以进一步增强模型的泛化能力。大量实验表明,DeceptionX不仅在标准真实世界基准上优于现有的MLLM基线和最先进方法,而且提供了透明的、专家级的推理路径,弥合了多模态欺骗检测中准确性与可解释性之间的关键差距。

英文摘要

Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.