arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.00532 2026-06-02 cs.AI

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

KACE: 知识自适应上下文工程用于数学推理

Jayant Parashar, Suchendra M. Bhandarkar

发表机构 * School of Computing, University of Georgia（计算学院，佐治亚大学）

AI总结提出KACE方法，通过难度和领域分层的知识库与分层自一致性，解决数学推理中上下文膨胀问题，在AIME 2025上达到62.2%准确率。

详情

Comments: 9 pages, 1 figure, 6 tables

AI中文摘要

上下文工程可以在不更新权重的情况下改进大型语言模型，但数学推理暴露了一个关键限制：在一个不断增长的提示中累积的反馈会导致上下文膨胀，并限制了可使用的学习指导量。现有方法常常混淆存储（跨运行学习的内容）与使用（针对特定问题包含的内容），因此继承了这种提示大小上限。我们引入了知识自适应上下文工程（KACE），通过基于难度和领域的组织将存储与使用分离。离线时，一个自我反思的学习循环将训练轨迹提炼成认知树：一个按问题难度和认知领域分层的类型化卡片知识库。每张卡片被分配到与其起源失败对应的难度-领域节点。在评估时，具有每层一致性门控的分层自一致性将每个问题动态分类为简单、中等或困难。简单问题无需检索卡片即可退出，而较难的问题仅检索树的匹配分支。这种分层方案在计算量相当的情况下匹配或超过Best-of-N，并以78%的成对一致性对问题难度进行分类。主要的实证贡献是通过分层自一致性构建和使用了一个难度和领域分层的知识库。在AIME 2025上，KACE达到了62.2%的准确率，在可比的求解器调用预算下，比固定的Best-of-5自一致性绝对提高了10.4个百分点，比最强的学习上下文基线Tiered + GEPA提高了5.6个百分点。我们还在MATH-HARD和OlymMATH的可验证子集上观察到一致的提升。

英文摘要

Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.

URL PDF HTML ☆

赞 0 踩 0

2606.00523 2026-06-02 cs.CL

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

ProactiveLLM: 学习流式大语言模型的主动交互

Junlong Tong, Yao Zhang, Anhao Zhao, Yingqi Fan, Yunpu Ma, Xiaoyu Shen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ProactiveLLM，利用模型内生状态指导交互决策，通过掩码流式建模和同步特权自蒸馏实现主动交互，减少延迟并保持质量。

详情

Comments: ICML 2026

AI中文摘要

标准大语言模型（LLMs）遵循先读取后生成的范式，导致不必要的延迟和计算。流式LLMs通过在接收输入的同时生成来缓解这一问题，但仍难以决定何时与流交互。现有方法要么硬编码交互时机，要么依赖昂贵的外部对齐信号，如时间标签、推理轨迹或更强的教师模型。本文提出ProactiveLLM，通过利用模型内生状态来指导交互决策，实现主动交互。模型首先通过两种互补的训练机制学习从部分输入中感知语义充分性：基于掩码的流式建模和同步特权自蒸馏（SPSD）。前者在训练期间对输入应用单调随机掩码，模拟逐步揭示的流式输入，使模型能够从部分输入视角学习局部语义依赖。后者将部分上下文的学生视图与同一演化模型生成的全上下文教师视图对齐，允许特权全上下文证据指导学生在不完整观察下的理解。这些机制共同诱导内生充分性线索，无需外部教师或标注，为多种决策头的即插即用集成提供了通用基础。在文本和语音流式任务上的广泛评估证实，ProactiveLLM在保持质量的同时显著降低了交互延迟，验证了其动态主动交互的能力。代码公开于https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM。

英文摘要

Standard Large Language Models (LLMs) follow a read-then-generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard-code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model's endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask-based streaming modeling and synchronized privileged self-distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial-input views. The latter aligns the partial-context student view with a full-context teacher view generated by the same evolving model, allowing privileged full-context evidence to guide the student's understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug-and-play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM.

URL PDF HTML ☆

赞 0 踩 0

2606.00519 2026-06-02 cs.RO

DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning

DriveAnchor: 用于自动驾驶规划的渐进式基于锚点的流学习

Limin Yan, Haoyun Tang, Yutao Qiu, Hongqing Liu, Haoyu Xu

发表机构 * Meituan Autonomous Driving（美团自动驾驶）； Xi’an Jiaotong University（西安交通大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出三阶段框架DriveAnchor，通过示范流预训练、引导流后训练和奖励精炼流微调，实现行为多样性、可控性和安全性，在200万场景中近距碰撞率降低89%，平均奖励提升32%。

详情

AI中文摘要

我们提出DriveAnchor，一个用于自动驾驶规划的三阶段框架，在可组合流水线中实现行为多样性、可控性和安全性。示范流预训练通过最远点采样构建的2398个轨迹形状词汇表替代无结构高斯先验，在词汇覆盖中结构化地奠定行为多样性基础。引导流后训练联合后训练一个能量场模块与流匹配（FM），仅以静态道路几何为条件，在流生成前将锚点重新定位到用户指定的走廊多边形，无需可微引导即可增加可控性；在第二阶段后，新的走廊预设只需更新能量场，无需重新训练FM。奖励精炼流微调应用零阶强化学习，使每个锚点的输出与避碰目标对齐：由于流匹配模型在单步模式下是确定性前馈网络，每个锚点唯一确定输出轨迹，将奖励优化简化为锚点空间中的方向搜索，无需对数似然计算或ODE到SDE转换。在约200万个保留驾驶场景上的评估表明，DriveAnchor将近距碰撞率降低89%，平均奖励提升32%，且模仿精度不下降，在NVIDIA Drive Orin上推理时间为2.06毫秒。DriveAnchor已通过真实车辆测试验证，确认其适用于生产部署。

英文摘要

We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00518 2026-06-02 cs.AI

Acting with AI: An Interaction-Based Framework for Agentic Tort Liability

与AI行动：基于交互的代理侵权责任框架

Yiheng Yao

发表机构 * Yiheng Yao（姚艺恒）

AI总结本文基于Bratman规划理论和普通法人类协同行动原则，提出一种交互分类框架（自主漂移、纯工具使用、协作规划）来分配AI代理系统的侵权责任，并引入“合理代理”标准。

详情

AI中文摘要

代理AI系统能够多步规划、使用工具并随时间执行任务。当此类系统造成损害时，侵权法难以分配责任，因为有害路径可能既非用户完全选择，也非开发者特别预见。本文借鉴Michael Bratman的规划理论和普通法对人类协同行动的处理，提出一个基于交互的代理侵权框架。我们区分三种交互类型：自主漂移、纯工具使用和协作规划。纯工具案例仍受普通产品缺陷和警告原则管辖；协作规划案例映射到独立承包商控制测试、专业过失和过失性虚假陈述；自主漂移则映射到雇主责任下的“擅自行动”和严格产品责任。该框架将有状态交互日志作为主要证据线索，使法院能够推断人-AI轨迹何时偏离授权行为以及责任应归于何处。我们解决了四个事件锚定案例，将该观点与严格责任和基于保险的提案并列，指出其与监管监督的关系，并提出了一个围绕约束验证、认知透明度、运行时基础和取证日志构建的“合理代理”标准。

英文摘要

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

URL PDF HTML ☆

赞 0 踩 0

2606.00516 2026-06-02 cs.AI

Threshold-Based Exclusive Batching for LLM Inference

基于阈值的独占批处理用于LLM推理

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma, Shining Wu

发表机构 * arXiv.org

AI总结针对混合批处理中预填充与解码干扰导致边际成本上升的问题，提出基于GPU内存带宽、模型大小和工作负载的EB-MB性能交叉条件及最优切换阈值，优化后的独占批处理在带宽受限GPU上吞吐量提升高达41.9%。

详情

Comments: 37 pages, 12 figures. Accepted at ICML 2026

AI中文摘要

混合批处理（MB）——将预填充和解码交错在单个批次中——已成为大型语言模型（LLM）推理的标准调度策略，因其在最大化计算和内存利用率方面的效率。然而，通过受控实验，我们发现预填充-解码干扰使MB的每步边际成本高于纯解码。在高带宽H200（4.8 TB/s）上，这仅在解码token超过批次的80%时发生；然而，在带宽受限的RTX PRO 6000（1.792 TB/s）上，该阈值骤降至仅20%。因此，MB与独占批处理（EB）之间的最优选择根本上取决于GPU内存带宽、模型大小和工作负载组成。我们推导了该EB-MB性能交叉的闭式条件，以及渐近最优的相位切换阈值和EB的内存安全批次大小。优化的EB在带宽受限GPU上吞吐量提升高达41.9%，而MB在具有更大模型的高带宽硬件上保持优势。我们的混合调度器EB+在线应用该条件，在无需人工干预的情况下动态切换EB和MB。在分布或并发度变化的非平稳流量下，EB+在每个设置中达到最高或接近最高的吞吐量，比MB高出高达36.4%。

英文摘要

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

URL PDF HTML ☆

赞 0 踩 0

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University（西南交通大学）； University of Leeds（莱斯特大学）

AI总结提出PaCo-VLA框架，通过被动屏蔽将VLA模型输出转化为任务级柔顺建议，并利用能量罐和边界检查防止无效预测绕过底层接触物理，实现安全精确的富接触操控。

详情

Comments: Under review, code will be available soon

AI中文摘要

富接触操控既需要高层语义推理，也需要对高频接触动态的安全调节。虽然视觉-语言-动作（VLA）模型提供了前所未有的语义泛化能力，但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟，我们引入PaCo-VLA，一种被动屏蔽的柔顺先验，重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA，而是将网络输出视为任务级柔顺建议：语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议，防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估，将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明，PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度，即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约，并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00514 2026-06-02 cs.LG cs.CV

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

在重建空间中生成，在语义空间中匹配：一步生成的传输几何

Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev

发表机构 * Genentech（基因泰克）

AI总结本文研究自监督表示学习（SSL）特征在一步生成模型中的作用，提出在语义特征空间中使用Sinkhorn散度进行分布匹配，显著降低ImageNet FID，并揭示了评估指标与训练特征之间的潜在冲突。

详情

Comments: 26 pages, 4 figures

AI中文摘要

生成建模和自监督表示学习（SSL）优化结构不同的目标：生成训练奖励分布保真度，而SSL奖励语义一致性。然而，最近的研究反复发现SSL特征改善了生成训练，尽管这种协同作用的机制仍不清楚。在这里，我们在一步生成的框架下研究SSL在生成建模中的优势，其中表示的作用是明确的：冻结的SSL特征用于将生成的样本与真实数据匹配。我们在该特征空间中使用Sinkhorn散度，为Wasserstein距离提供了一个可处理的代理，这是由Fréchet风格评估指标（如FID）近似的总体差异。我们发现，当在语义结构化的SSL特征空间中计算时，这个目标变得非常有效（ImageNet FID降低39倍）。我们将这种行为主要归因于匹配估计：抑制无关重建细节的语义SSL特征诱导出更紧凑的几何结构，使分布匹配更易处理。因此，最佳的训练SSL特征不一定与评估指标使用的特征匹配。特别是，我们表明使用Inception作为特征提取器可以改善FID，同时降低匹配稳定性和样本质量，揭示了一种形式的指标黑客攻击。通过在ImageNet上的大量实验，我们确定了哪些SSL特征族能带来最佳的生成性能，并表明匹配稳定性是选择它们的定量标准。代码可在https://github.com/Genentech/semantic-transport-generation获取。

英文摘要

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

URL PDF HTML ☆

赞 0 踩 0

2606.00512 2026-06-02 cs.LG cs.IT math.IT stat.ML

Semi-Supervised Learning with Noisy Proxy Covariates: Generalization Bounds and Distribution Regression

带噪声代理协变量的半监督学习：泛化界与分布回归

Kwangho Kim, Jisu Kim

发表机构 * arXiv.org

AI总结针对带噪声代理协变量的半监督回归问题，提出两阶段估计器，利用所有代理协变量学习核本征特征，并在标记数据上拟合岭回归，理论证明在代理扰动可控且未标记代理协变量充足时能恢复快速标记样本率，实验表明在低标记率下优于监督和半监督基线。

2606.00511 2026-06-02 cs.LG cs.CV

Saliency-Aware Model Merging

显著性感知模型合并

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * arXiv.org

AI总结提出SA-Merging方法，利用结构剪枝中的连通性显著性（如SynFlow）进行数据无关模型合并，通过任务向量显著性评分和合并感知调制减少任务干扰，并在视觉和语言任务上验证有效性。

详情

Comments: ICML 2026 Camera-ready

AI中文摘要

模型合并旨在将多个在不同数据集上微调的任务特定模型整合到一个统一架构中，以实现跨领域能力。当前的数据无关模型合并方法通常难以扩展，因为它们依赖于忽略层间依赖性和非均匀专业知识分布的简单参数级启发式方法。本文提出SA-Merging，它基于结构剪枝（如SynFlow）中的连通性显著性公式，并将其扩展到数据无关模型合并设置。我们相对于共享基础模型定义任务向量上的显著性分数，并进一步引入合并感知调制，该调制结合专家间的一致性以减轻任务干扰。基于此公式，迭代的显著性感知合并过程逐步移除非信息性更新，同时保留端到端连通性。此外，我们将SA-Merging扩展到为LoRA引入秩级显著性分解，而不损害其结构完整性。在视觉和语言任务上的大量实验证明了我们基于显著性方法的有效性，进一步缩小了数据无关方法和测试时自适应方法之间的差距。

英文摘要

Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

URL PDF HTML ☆

赞 0 踩 0

2606.00510 2026-06-02 cs.CL cs.AI

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

技能还是跳过？通过双粒度偏好学习在智能体任务中学习选择性技能调用

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng

发表机构 * Meituan（美团）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； Peking University（北京大学）

AI总结提出SelSkill框架，通过双粒度偏好学习实现选择性技能调用，在ALFWorld和BFCL上显著提升任务成功率和执行精度。

详情

Comments: 18 pages, 4 figures, 10 tables

AI中文摘要

智能体技能是可调用的程序化模块，为复杂智能体任务提供可重用知识和执行策略。然而，现有方法主要关注选择相关技能或改进技能本身，而忽略了在当前决策点是否应该实际调用相关技能。无帮助的调用可能引入无关上下文并破坏原本正确的执行过程。为解决此问题，我们提出SelSkill，一个用于选择性技能调用的双粒度偏好学习框架。SelSkill将技能使用表述为技能或跳过决策，利用预测不确定性优先考虑候选决策点，并从共享轨迹前缀构建受控的调用-跳过偏好对。它进一步结合了回合级结果偏好与步骤级调用偏好，以捕捉整体轨迹质量和技能调用的局部有效性。在ALFWorld上使用Qwen3-8B，SelSkill将任务成功率提高了10.9个百分点，执行精度提高了29.1个百分点。在BFCL上，它将任务成功率提高了5.7个百分点，执行精度提高了29.5个百分点。在Tau-bench和PopQA上的零样本结果进一步表明，学习到的调用策略可迁移到具有未见技能的新领域。

英文摘要

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

URL PDF HTML ☆

赞 0 踩 0

2606.00509 2026-06-02 cs.CV

Structure-Aware Consistency Priors for Shape from Polarization in Complex Media

复杂介质中偏振形状恢复的结构感知一致性先验

Kaimin Yu, Puyun Wang, Huayang He, Xianyu Wu

发表机构 * arXiv.org ； cs.CV（计算机视觉）

AI总结针对复杂介质（以冰为例）中偏振观测与表面法线间的非线性映射问题，提出基于自相关函数的结构感知偏振先验，并设计双分支网络IceSfP通过跨模态注意力和多尺度特征融合实现精确法线估计，在首个真实冰SfP数据集上达到16.01°的平均角度误差。

详情

Journal ref: 2026ICML

AI中文摘要

在复杂介质中从单视角偏振图像恢复表面法线仍然具有挑战性。本文以冰作为代表性复杂介质，其中复杂的光与物质相互作用导致偏振观测与表面法线之间存在非线性映射。为了解决这一问题，提出了一种基于自相关函数的结构感知偏振先验，以捕获AoLP的局部空间一致性。在此基础上，设计了一个双分支网络（IceSfP），通过跨模态注意力和多尺度特征融合将原始偏振特征与先验集成，从而在复杂介质条件下实现准确的表面法线估计。为了评估该方法，构建了首个真实世界的冰SfP数据集。实验结果表明，该方法在所有指标上均优于现有方法，平均绝对误差（MAE）为16.01°，比第二好的方法低2.74°。该框架为复杂介质中的高精度几何感知提供了一种可推广的解决方案。

英文摘要

Recovering surface normals from single view polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01 deg, which is 2.74 deg lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.

URL PDF HTML ☆

赞 0 踩 0

2606.00508 2026-06-02 cs.CV cs.AI

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * arXiv.org ； cs.CV（计算机视觉）

AI总结本文发现视频大语言模型中存在令牌接口连续流形，并提出V-LynX框架，通过轻量辅助路径对齐注意力响应和统计分布，无需配对监督即可集成新模态，在音视频问答、3D推理等任务上达到最优效率。

详情

Comments: ICML 2026 Camera-ready

AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象：视频大语言模型不仅仅是简单地将帧转换为文本嵌入，而是建立了一个连续流形——令牌接口，使得视觉令牌能够在架构内作为独立实体运行。利用这一发现，我们提出了V-LynX，这是一个可扩展的框架，通过重新利用内部化接口，将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同，V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布，将新的感官输入与内在视频先验相结合。这确保了流形兼容性，同时保持了视频大语言模型的完整性。大量基准测试表明，V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

URL PDF HTML ☆

赞 0 踩 0

2606.00507 2026-06-02 cs.CL

LaSR: Context-Aware Speech Recognition via Latent Reasoning

LaSR：通过潜在推理实现上下文感知的语音识别

Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ant Group（蚂蚁集团）

AI总结提出LaSR训练范式，利用潜在推理轨迹增强语音大语言模型的上下文感知能力，在学术术语识别上显著提升性能且不增加延迟。

详情

AI中文摘要

近期语音大语言模型（Speech LLMs）的进展显著增强了口语理解与推理能力。然而，其上下文感知能力有限，难以有效反映说话者意图和主题上下文的语音识别。本文提出LaSR（潜在语音推理），一种新颖的训练范式，具有利用潜在推理过程的上下文感知推理轨迹。LaSR不生成显式中间令牌，而是将思维链（CoT）监督对齐到目标单词的声学特征区域，并引入潜在推理阶段用于上下文信息锚定和转录转换。此外，为有效基准测试专业词汇的上下文识别，我们提出Spoken Darwin-Science，一个专注于学术术语的大规模语料库。在Fun-Audio-Chat上的初步实验表明，LaSR显著提升了术语识别，且不引入额外延迟，并持续优于标准监督微调基线。我们的发现凸显了潜在推理在构建高效、上下文感知语音助手方面的潜力。

英文摘要

Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker's intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.

URL PDF HTML ☆

赞 0 踩 0

2606.00506 2026-06-02 cs.AI cs.LG

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

EnergyMamba：一种用于能耗预测的具有不确定性感知的图增强选择性状态空间模型

Dahai Yu, Rongchao Xu, Lin Jiang, Guang Wang

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出EnergyMamba框架，通过图增强选择性状态空间模型（GE-Mamba）和自适应序列分位数回归（AS-CQR）模块，实现时空联合建模与不确定性量化，在能耗预测中提升准确率约5%、不确定性量化约6%。

详情

DOI: 10.1145/3770855.3818841
Comments: Accepted by KDD 2026 AI4S

AI中文摘要

能耗预测对于高效的电网管理、需求侧优化和可持续能源规划至关重要。尽管先进的机器学习方法已被用于提高预测性能，但现有工作存在两个关键局限：（1）通常将任务视为纯时间序列预测问题，未显式建模不同区域间的空间依赖关系；（2）在极端天气等异常情况下无法提供带有不确定性估计的可靠预测。为推进现有研究，我们提出EnergyMamba，一种具有不确定性感知的时空学习框架，用于准确可靠的能耗预测，包含两个关键组件：（i）一种新颖的图增强选择性状态空间模型（GE-Mamba），将从电网拓扑中学到的空间上下文注入时间动态，实现耦合的时空建模；（ii）自适应序列分位数回归（AS-CQR）模块，包括局部自适应归一化和在线反馈机制，以在潜在分布偏移下动态校准预测区间。我们在来自佛罗里达、纽约和加利福尼亚的四个大规模真实数据集上评估EnergyMamba。结果表明，与15个最先进的基线相比，EnergyMamba在预测准确率上提升约5%，在不确定性量化上提升约6%。

英文摘要

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00503 2026-06-02 cs.LG cs.AI

TabChange: Precise Attribute Changes in Tabular Data

TabChange: 表格数据中的精确属性变化

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

发表机构 * The University of Texas at Arlington（德克萨斯大学阿灵顿分校）； National Institute of Standards and Technology（美国国家标准与技术研究院）； Information Technology Laboratory（信息技术实验室）

AI总结针对表格数据中修改属性时破坏自然性的问题，提出TabChange方法，通过分析属性间关系并利用对抗框架去除潜在空间中的属性信息，实现精确且自然的属性修改。

详情

AI中文摘要

修改表格数据中的属性通常会破坏其与其他属性的关系，从而产生不自然的实例。修改后的实例必须既自然又与原始实例变化最小。本文解决了生成这种修改实例的挑战。我们识别了现有方法的关键局限性：生成模型要么不支持实例级属性编辑，要么像CVAE这样的方法在潜在空间中保留属性信息，导致不必要的修改。为了解决这个问题，我们提出了TabChange，一种分析数据集中目标属性与其他属性关系的方法。如果关系较弱，它直接翻转属性；如果关系较强，它使用对抗框架去除潜在空间表示中的属性信息。这种去除使得能够进行精确修改，只进行必要的调整以保持自然性。我们在七个数据集上的实验表明，TabChange生成的属性反事实在自然性方面与基线相当，并且更接近原始实例。与基线相比，这导致了更多有效的反事实和更少的无效反事实。

英文摘要

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don't support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00499 2026-06-02 cs.CV

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

OptiWorld: 物理约束下的视频世界生成最优控制

Yu Yuan, Jianhao Yuan, Xijun Wang, Daiqing Li, Liu He, Lu Ling, Stanley H. Chan

发表机构 * Purdue University（普渡大学）； University of Oxford（牛津大学）； SixteenMiles Labs（SixteenMiles 实验室）

AI总结提出OptiWorld框架，在推理时结合经典最优控制与视频生成，通过提取紧凑世界状态、规划最优轨迹并生成条件视频，实现符合物理约束的动态优化。

详情

Comments: Porject Page: https://yuyuanspace.com/OptiWorld/

AI中文摘要

视频生成模型正成为一种可扩展的世界模型形式，但它们主要生成合理的运动，而非主动控制或优化底层动态。因此，生成视频中的物体可能遵循不安全、不光滑、低效或物理不一致的轨迹。在这项工作中，我们提出了 extbf{OptiWorld}，一个在推理时将经典最优控制引入视频生成的框架。OptiWorld首先提取紧凑的、与任务相关的世界状态，然后在物理约束下规划最优轨迹，最后基于该轨迹渲染视频。我们将规划表述为连续流形上的几何问题，将3D几何和任务相关的物理约束转化为统一的规划几何。通过添加这一最优控制层，OptiWorld生成具有更优动态的视频，在多个任务中展现出强大潜力，包括目标条件的图像到视频生成、视频动态编辑和反事实生成。

英文摘要

Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

URL PDF HTML ☆

赞 0 踩 0

2606.00496 2026-06-02 cs.LG

Torus Graphs for Large Scale Neural Phase Analysis

大规模神经相位分析的环面图模型

Jack Goffinet, Casey Hanks, David E. Carlson

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种随机得分匹配方法，将环面图模型的计算复杂度从O(d^6)降至O(d^2)，使其能处理数千变量，并扩展至隐马尔可夫模型和自回归模型，用于分析脑状态依赖的相位耦合和方向性交互。

详情

Comments: 23 pages, 15 figures; to be published in ICML 2026

AI中文摘要

振荡神经信号（如脑电图和局部场电位）表现出协调跨脑区通信的相位关系。现代记录在多个频率区间捕获数百个通道，但标准相位分析仅限于少数变量。环面图模型是一种相位上的指数族分布，其单变量和成对势函数推广了冯·米塞斯分布，推断振荡之间的结构化关系，但仅建模静态无向依赖，且由于得分匹配推断复杂度为O(d^6)，仅限于约100个变量。我们引入一种随机得分匹配过程，将每次迭代成本降至O(d^2)，使得能够对数千变量的数据集进行推断。这一可扩展基础支持对来自多电极LFPs的1,860个频率-相位特征进行分析，并实现了之前环面图或经典圆形统计无法实现的两种扩展：(i) 捕获状态依赖的相位耦合变化（例如睡眠期间纺锤波相关状态）的环面图隐马尔可夫模型，以及(ii) 通过传递熵估计推断方向性交互的自回归环面图。应用于LFP记录，这些模型揭示了清醒和NREM睡眠之间状态依赖的相位交互模式。它们共同实现了对大脑和认知状态中动态和方向性相位关系的系统性大规模映射。

英文摘要

Oscillatory neural signals such as electroencephalography (EEG) and local field potentials (LFPs) show phase relationships that coordinate communication across brain regions. Modern recordings capture hundreds of channels across many frequency bins, yet standard phase analyses are restricted to only a few variables. The Torus Graph (TG) model, an exponential-family distribution over phases whose univariate and pairwise potentials generalize von Mises distributions, infers principled structure among oscillations but models only static, undirected dependencies and is limited to $\sim \! 100$ variables because its score matching inference scales as $\mathcal{O}(d^{6})$. We introduce a stochastic score matching procedure that reduces the per-iteration cost to $\mathcal{O}(d^{2})$, enabling inference on datasets with thousands of variables. This scalable foundation supports analyses of 1,860 frequency-phase features from multi-electrode LFPs and enables two extensions previously inaccessible to TGs or classical circular statistics: (i) a TG Hidden Markov Model capturing state-dependent phase-coupling changes (e.g., spindle-related states during sleep) and (ii) an autoregressive TG inferring directional interactions via transfer-entropy estimation. Applied to LFP recordings, these models reveal state-dependent phase-interaction patterns between wakefulness and NREM sleep. Together, they enable systematic, large-scale mapping of dynamic and directional phase relationships across brain and cognitive states.

URL PDF HTML ☆

赞 0 踩 0

2606.00491 2026-06-02 cs.CV cs.AI

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

CT分割系统的部署前鲁棒性压力测试：使用临床驱动的多损坏增强

CholMin Kang, Jonghyun Chung, Amanpreet Kaurb, Nagesh Gulkotwarb, Arthi Sivasankaranb

发表机构 * Seoul National University（首尔国立大学）； Google Inc.（谷歌公司）

AI总结提出RAMP框架，通过多损坏增强提升CT分割模型在临床异质成像条件下的鲁棒性，显著缩小干净与损坏图像性能差距。

详情

AI中文摘要

基于深度学习的CT分割系统在干净基准图像上通常能达到高精度，但在噪声、分辨率损失、对比度变化、强度偏移和伪影等异质临床成像条件下，其性能可能会下降。这种不稳定性可能限制其在真实医疗成像工作流程中的可靠部署。我们提出鲁棒性增强多损坏流水线（RAMP），这是一个面向鲁棒性的CT分割增强框架。RAMP结合了解剖约束的空间扰动、CT强度变换和随机多损坏组合，使模型在训练过程中暴露于临床可行的图像退化。在两个CT分割评估设置中，RAMP实现了最强的损坏图像性能和最小的干净到损坏鲁棒性差距。在五器官噪声评估基准中，与nnU-Net基线相比，RAMP将平均损坏Dice从0.610提高到0.753，并将鲁棒性差距从0.264降低到0.064。在Abdomen1K中，RAMP将平均损坏Dice从0.633提高到0.789，并将鲁棒性差距从0.290降低到0.070。尽管RAMP未达到最高的干净图像Dice，但它显著减轻了严重图像退化下的最坏情况分割崩溃。这些结果表明，多损坏增强可以作为提高CT分割系统在异质临床环境中可靠性的实用部署前策略。

英文摘要

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2606.00487 2026-06-02 cs.AI

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS: 面向扩散草稿推测解码的目标感知前缀树选择

Zhuoyu Wang, Junnan Huang, Xinyu Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出TAPS方法，通过目标感知的前缀树选择优化扩散模型草稿的验证效率，实现最高7.9倍无损加速。

详情

AI中文摘要

使用扩散模型进行并行草稿是推测解码的一种有前景的方法。通过在单次前向传播中预测多个未来位置的token，扩散草稿器显著降低了草稿延迟。然而，这会将瓶颈转移到验证上：验证单个序列限制了接受长度，而验证大型草稿树会导致过度的目标模型延迟。我们发现了现有草稿树方法中的一个关键不匹配：现有的扩散树方法按边际概率对节点排序，忽略了验证是前缀条件化的。因此，它们可能会验证被拒绝前缀的不可达后代，从而增加延迟而接受增益有限。为了解决这个问题，我们提出了TAPS，一种目标感知的前缀选择方法，将扩散边际转化为路径条件化的接受估计。然后，TAPS在固定的验证预算下选择一个紧凑的前缀封闭子树，改善接受-成本权衡，而不是简单地扩展草稿树。跨不同数据集和模型族的实验表明，TAPS在无损端到端速度上比普通自回归解码最高提升7.9倍，分别比最先进的DFlash和DDTree提升1.36倍和1.74倍。我们的工作可在https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD获取。

英文摘要

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

URL PDF HTML ☆

赞 0 踩 0

2606.00477 2026-06-02 cs.CL cs.CV

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

文本编辑能否泛化到视觉生成？统一多模态模型中的跨模态知识编辑基准

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）

AI总结提出跨模态知识编辑基准UniKE，发现文本编辑在图像生成中效果显著下降（VQA准确率仅18.5%），并提出推理增强参数编辑方法提升跨模态迁移效果。

详情

Comments: Published at ICML 2026; Code and data available at https://github.com/gxx27/UniKE

AI中文摘要

统一多模态模型（UMMs）已成为通用多模态智能的有前途的范式。随着它们在现实世界应用中的部署，有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已经成熟，但成功修改文本输出的编辑是否也能迁移到UMMs中的图像生成仍不清楚。为了研究这个问题，我们引入了UniKE，这是第一个用于UMMs中跨模态知识编辑的基准，包含2,971个编辑主题，涵盖属性和关系编辑。使用基于VQA的视觉验证，我们揭示了一个显著的模态差距：文本侧的有效性可以达到约92%，而直接图像生成下的最佳整体VQA准确率仅为18.5%。我们进一步提出了推理增强参数编辑，它在生成前显式激活编辑后的知识，并提高了所有评估模型-编辑器对的整体VQA准确率，提升高达18.6个百分点。机制分析表明，这种差距与编辑后的文本表示与视觉生成的条件路径之间的部分对齐有关，其中足以用于文本输出的编辑可能仍然太弱或未对齐，无法引导图像合成。这些发现表明，文本知识编辑不能保证可靠的跨模态迁移，并激励了模态感知的编辑方法。我们的代码和数据可在https://github.com/gxx27/UniKE获取。

英文摘要

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

URL PDF HTML ☆

赞 0 踩 0

2606.00476 2026-06-02 cs.AI

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

做他们所说的，而不是他们所推理的：定位LLM智能体中的忠实性差距

Yufeng Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过将忠实性差距分解为推理-结论和结论-行动两个步骤，在可控的德克萨斯扑克模拟器中研究LLM智能体是否按照其陈述的推理行动。

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

CodeCytos: 通过代码增强的智能体动作空间实现AI辅助空间分子成像分析

Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen

发表机构 * University of Houston, Department of Electrical and Computer Engineering（德克萨斯大学休斯顿分校电子与计算机工程系）； Houston Methodist Hospital, Department of Systems Medicine and Biomedical Engineering（休斯顿 Methodist 医院系统医学与生物医学工程系）

AI总结提出CodeCytos框架，通过代码驱动的推理智能体实现空间分子成像数据的动态可编程分析，提升自动化与定制化能力，并在多种组织类型数据集上验证其优于基线方法。

详情

AI中文摘要

传统的组织图像分析软件为细胞分析提供了基础功能，包括分割、基本形态特征提取和空间组织分析。然而，这些工具通常需要手动干预，且与代码驱动的自动化集成不佳，限制了复杂空间组织研究的效率和可扩展性。此外，它们对自定义分析的灵活性有限，通常只支持一组固定的预实现空间细胞特征。为了解决这些限制，我们提出了CodeCytos，一个基于编码的推理智能体框架，能够实现与空间分子成像数据的动态、可编程交互，以提高自动化和定制化。CodeCytos旨在简化自定义空间细胞特征的探索，并适应多样化的研究需求。我们通过四个来自不同组织类型（额叶皮层、非小细胞肺癌、胰腺和扁桃体）的专家精选数据集案例研究展示了其实用性。我们在现实的最小提示设置下评估CodeCytos，其中生物科学家提出简单问题，没有任务特定指令或关于空间细胞分析的上下文信息，并基准测试了多个具有强大编码能力的LLM骨干。我们进一步表明，结合定制的、领域无关的少样本上下文编码推理示例（空间分析领域外随机采样的演示）可以显著提高性能，而无需昂贵的、专家制作的领域内演示。总体而言，CodeCytos优于基线方法，突显了代码动作智能体在空间分子成像中辅助自定义特征探索和加速生物标志物发现的潜力。

英文摘要

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.00471 2026-06-02 cs.CV

MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory Forecasting

MUSCLE-NET：面向行人轨迹预测的预测多尺度感知网络

Yu Liu, Ming Huang, Xiao Ren, Zhijie Liu, Youfu Li, He Kong

发表机构 * Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), Shenzhen（广东省全主动系统控制理论与技术重点实验室，自动化与智能制造学院，南方科技大学（SUSTech），深圳）； Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China（香港城市大学机械工程系，香港特别行政区，中国）

AI总结提出MUSCLE-NET，通过多尺度多模态特征提取和尺度自适应预测机制，解决现有方法对观测信息利用不足及忽视未来运动尺度依赖的问题，在JAAD和PIE数据集上取得竞争性能。

详情

Comments: This manuscript has been accepted to the IEEE Transactions on Intelligent Transportation Systems as a regular paper

AI中文摘要

准确的行人轨迹预测对于自动驾驶和智能交通系统中的安全导航至关重要。尽管近期方法取得了显著进展，但大多数现有方法在充分利用多样化观测方面存在局限，且往往忽视未来运动的尺度依赖性，无论底层运动动态如何，都统一处理多尺度特征。这限制了它们在多样化行人行为中的鲁棒性。为解决这些挑战，我们提出了一种用于行人轨迹预测的预测多尺度感知网络（MUSCLE-NET），该网络将互补的多模态线索与尺度自适应预测机制相结合。所提出的框架基于多尺度多模态特征提取（MMFE）模块，该模块结合了多尺度表示、模态感知重校准和方向性跨模态融合，从边界框、速度和姿态信息中构建语义对齐的表示。基于这些特征，多尺度增强层次预测（MEHP）模块通过概率粗预测器、尺度对齐融合和渐进细化，执行预测感知的未来运动细化，自适应地选择尺度相关线索以减轻空间漂移。在JAAD和PIE基准上的大量实验表明，所提出的MUSCLE-Net与最先进的轨迹预测方法相比，取得了竞争性能并持续改进。

英文摘要

Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.

URL PDF HTML ☆

赞 0 踩 0

2606.00470 2026-06-02 cs.RO cond-mat.soft

A passive universal grasping mechanism based on an everting shell

基于外翻壳体的被动通用抓取机构

Mythra V. S. Balakuntala, Safvan Palathingal, G. K. Ananthasuresh

发表机构 * Indian Institute of Science（印度科学研究院）

AI总结提出一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构，通过梁段构成的抓取臂与外翻壳体协同工作，实现对任意形状刚性物体的包络抓取。

详情

DOI: 10.1007/978-981-15-4477-4_43

AI中文摘要

概念化了一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构。它由梁段构成的抓取臂与外翻壳体协同工作。该抓取器能够抓取任意形状的刚性物体，最大尺寸和重量受限于机构设计。双稳态壳体在接触物体时外翻，使抓取臂包裹物体形成封闭空间。机构保持该构型直到再次被驱动，使壳体恢复原始构型，从而打开封闭空间释放物体。臂的刚度决定机构的有效载荷，臂的尺寸决定可抓取的最大物体。臂具有分布式柔性，可适应物体形状而不施加过大压力。

英文摘要

A passive monolithic compliant grasping mechanism that works based on the eversion of an elastically deformable bistable shell is conceptualized. It comprises grasping arms made of beam segments that work in conjunction with the everting shell. The grasper is capable of picking up a stiff object of any shape up to a maximum size and weight. The bistable shell everts upon contact with the object to enable the grasping arms envelop the object forming an enclosure. The mechanism then stays in that configuration until it is actuated again to turn the shell back to its original configuration and thereby opening the enclosure to release the object. The stiffness of the arms decides the payload of the mechanism. The size of the arms decides the largest object that can be grasped and held. The arms have distributed compliance so that they can conform to the shape of the object without applying undue force on it.

URL PDF HTML ☆

赞 0 踩 0

2606.00467 2026-06-02 cs.CL cs.AI cs.LG stat.ML

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

论大语言模型适应性的局限：模型内化先验对标注任务性能的影响

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

发表机构 * University of Washington（华盛顿大学）

AI总结通过毒性检测实验，研究大语言模型内化先验与指令交互的三个维度，发现近三分之二的零样本错误难以通过提示纠正，并引入定义特定熟悉度（DSF）指标，证明其与性能正相关，而文本记忆指标则无此关联。

详情

Comments: Accepted at ICML 2026 (Oral & Spotlight); PMLR vol. 306. 9 pages, 4 figures

AI中文摘要

大语言模型（LLMs）越来越多地用于零样本标注和LLM-as-a-judge任务，但其可靠性取决于模型内化先验与用户提供指令的交互方式。我们研究了这种交互的三个维度：（1）LLM对数据和任务定义的熟悉程度如何影响性能；（2）提示中的额外信息能在多大程度上纠正零样本错误（“决策粘性”）；（3）模型对错误任务定义的敏感性。通过在多种数据集（涵盖社交媒体、游戏、新闻和论坛）上进行毒性检测实验，使用密集模型和混合专家模型，我们发现近三分之二的零样本错误难以纠正，提示纠正的总体挽救率（初始错误中被纠正的比例）仅为34.8%。高置信度错误尤其难以纠正。当给出错误定义时，LLM会遵循这些定义，同时保持与正确定义条件下相同的置信水平。关键的是，我们引入了定义特定熟悉度（DSF），它衡量模型内部概念与任务定义之间的一致性。在控制数据集层面的混杂因素后，DSF与模型性能呈正相关（偏相关系数r=+0.41），而三种不同的记忆指标（ROUGE-L、BERTScore和嵌入余弦相似度）均未显示正相关。这些发现揭示了基于提示的纠正在标注任务中的局限性，强调了定义对齐比文本级记忆更重要。

英文摘要

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

URL PDF HTML ☆

赞 0 踩 0

2606.00462 2026-06-02 cs.CL cs.AI cs.LG

Short-form Text Rewriting with Phi Silica

短文本改写与 Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan

发表机构 * IEEE ICAD

AI总结本研究通过数据集整理、提示蒸馏、参数高效微调和评估，将小语言模型 Phi Silica 适配于短文本改写任务，结果表明微调提高了语义保真度、减少了幻觉并提升了与 GPT-5-chat 改写的偏好胜率。

详情

Comments: 6 pages

AI中文摘要

短文本改写是释义的一种受限变体，其中有限的上下文和高语义密度几乎没有留下变化空间。虽然大型语言模型在一般释义任务上表现良好，但小语言模型（SLM）在短文本场景中常常在语义保真度和幻觉鲁棒性方面遇到困难。在这项工作中，我们提出了一项实证研究，通过数据集整理、提示蒸馏、参数高效微调和评估，将小语言模型 Phi Silica 适配于短文本改写。我们从公开的幻灯片中整理了一个简短的演示风格文本数据集，并使用 GPT-5-chat 来生成改写监督以及进行 LLM 作为评判者的评估。我们的结果表明，微调提高了语义保真度，减少了幻觉，并提高了与 GPT-5-chat 改写的偏好胜率。这些发现表明，针对 SLM 的定向适配可以显著缩小与云模型的差距，并为将 SLM 适配于精度关键的改写任务提供实用指导。

英文摘要

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00460 2026-06-02 cs.CL eess.AS

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

SALSA: 通过学习的引导激活向量实现语音感知的LLM适配

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier, Chang D. Yoo, Mark A. Hasegawa-Johnson

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出SALSA方法，通过监督学习优化逐层引导向量，在儿童语音、多语种语音和普通话-英语代码切换基准上显著提升零样本推理和语音上下文学习性能，最高相对提升46.8%。

详情

AI中文摘要

语音感知的大语言模型通常在域外场景中泛化能力较差。我们提出SALSA（通过学习的引导激活实现语音感知的LLM适配），一种轻量级适配方法，学习逐层引导向量。与通常依赖对比激活差异的引导方法不同，SALSA直接使用监督目标优化引导向量。在儿童语音、多语种语音和普通话-英语代码切换基准上，SALSA相比零样本推理和语音上下文学习基线显著提升性能，相对于零样本最高实现46.8%的相对改进。进一步分析表明，引导编码器（尤其是后层）比引导LLM主干更有效。这些发现表明，引导通过调整高层声学和语音表示以更好地与预训练语言模型表示空间对齐，而不是通过修改解码器本身，从而提升下游ASR性能。

英文摘要

Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children's speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.

URL PDF HTML ☆

赞 0 踩 0

2606.00459 2026-06-02 cs.RO cs.SY eess.SY

Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction

物理人机交互中节能控制的自适应PD增益

Danyal Saqib, Francisco Andrade Chavez, Marie Charbonneau

发表机构 * University of Calgary（卡尔加里大学）； University of Waterloo RoboHub（多伦多大学罗布hub）

AI总结提出一种自适应PD控制器，通过限制机器人动能和势能实现安全物理人机交互，并给出稳定性证明与实验验证。

详情

DOI: 10.21428/d82e957c.37d70c9b
Journal ref: Proceedings of the 23rd Conference on Robots and Vision, 2026

AI中文摘要

柔顺力或力矩控制是常被研究以实现安全物理人机交互（pHRI）的方法。然而，这些方法存在局限性。力控制要求机器人配备外部力传感器以跟踪施加力的幅度和方向。力矩控制需要在每个关节进行力矩感知或估计。由于并非所有机器人都具备这些条件，基于能量的方法提供了一种有前景的替代方案。此类方法旨在通过限制机器人的机械能来实现安全的pHRI。当前利用基于能量方法的方案往往实现复杂，且部分可能需要进一步稳定性验证。因此，我们提出一种自适应比例-微分（PD）控制器，能够在任意给定限制下限制机器人的能量，以实现安全的pHRI。所提出的控制器可以同时限制机器人的动能和势能，并且控制器增益的行为可通过多种参数进行塑造，精确界定截止限制和锐度。我们为控制器构建了稳定性证明，并定义了确保控制器稳定性的条件。所提出控制器的行为和柔顺性在PAL Robotics的TALOS机器人上进行了仿真和硬件测试，验证了控制器预期的柔顺和能量限制行为。

英文摘要

Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.

URL PDF HTML ☆

赞 0 踩 0

2606.00452 2026-06-02 cs.CV cs.GR

Beyond Static Gaussians: An Empirical Investigation of Architectural Paradigms for Dynamic 3D Scene Reconstruction

超越静态高斯：动态3D场景重建架构范式的实证研究

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结本文通过实证比较结构引导与高斯中心两种动态3D高斯溅射范式，揭示重建质量/紧凑性与渲染速度之间的根本权衡。

详情

DOI: 10.15353/jcvis.v11i1.10019
Journal ref: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025, p. 99
Comments: Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

AI中文摘要

通过3D高斯溅射（3DGS）进行动态场景重建已成为表示演化环境的一种引人注目的方法，但理解不同方法之间的权衡仍然至关重要。本文对动态3DGS方法进行了全面分析，将其分为两种范式：结构引导方法，利用辅助表示（变形场、规范空间、网格）来建模时间变化；以及高斯中心方法，通过连续函数或4D表示将动态直接编码到基元中。我们在D-NeRF基准上评估了两种范式的代表性方法。我们的发现表明，结构引导方法实现了优越的重建保真度和紧凑的模型大小，而高斯中心方法则表现出显著更高的渲染速度，能够实现实时性能，但质量变异性更大且可能产生大量存储开销。该分析突出了重建质量/紧凑性与渲染速度之间的根本权衡，为动态场景重建的未来研究和应用开发提供了见解。

英文摘要

Dynamic scene reconstruction via 3D Gaussian Splatting (3DGS) has emerged as a compelling approach for representing evolving environments, yet understanding trade-offs between methodologies remains crucial. This paper presents a comprehensive analysis of dynamic 3DGS methods, categorizing them into two paradigms: structure-guided methods employing auxiliary representations (deformation fields, canonical spaces, grids) to model temporal changes, and gaussian-centric methods encoding dynamics directly into primitives via continuous functions or 4D representations. We evaluate representative methods from both paradigms on the D-NeRF benchmark. Our findings reveal that structure-guided methods achieve superior reconstruction fidelity and compact model sizes, while gaussian-centric approaches demonstrate significantly higher rendering speeds enabling real-time performance, though with greater quality variability and potentially substantial storage overhead. This analysis highlights a fundamental trade-off between reconstruction quality/compactness versus rendering speed, providing insights to guide future research and application development in dynamic scene reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.00451 2026-06-02 cs.CL

ProtStructQA: A Denotation Threshold in Protein Structural Reasoning

ProtStructQA: 蛋白质结构推理中的指称阈值

Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar, Khaled Rasheed, Samuel E. Aggrey

发表机构 * University of Georgia（佐治亚大学）

AI总结提出可执行基准ProtStructQA，通过将自然语言问题编译为DSL程序并在AlphaFold结构上执行来评估蛋白质语言模型，发现模型在1.7B到4B参数之间存在指称阈值，低于该阈值时工具辅助推理占优，高于该阈值时思维链成为最强策略。

详情

AI中文摘要

蛋白质语言系统通常通过是否生成合理的生物学文本来评估，但结构问题具有更清晰的语义：它表示3D坐标系中的测量值。我们引入ProtStructQA，一个可执行的蛋白质结构问答基准，其中每个自然语言问题由隐藏的类型化领域特定语言（DSL）程序生成，答案通过在该程序上对AlphaFold预测的结构执行获得。ProtStructQA发布了382.2K个问题，涵盖置信度、距离、预测对齐误差（PAE）、溶剂暴露、二级结构、拓扑和接触，以及保留的组合：一个包含来自四个物种的10K个蛋白质的330K活跃基准，加上一个52.2K的硬负例鲁棒性池。无需微调，我们在直接提示、思维链、语法约束可执行投票、带思维链的可执行投票以及多轮ReAct风格工具使用下评估了Qwen3模型（0.6B至8B），并在Gemma-3-1B和Gemma-3-12B上复现了主要发现。我们发现Qwen3-1.7B和Qwen3-4B之间存在一个能力依赖的指称阈值：低于该阈值时，工具中介的ReAct占主导，因为模型常常无法生成可执行的指称；高于该阈值时，思维链从大多有害转变为强烈有益，并成为大多数分割上的最强策略。解析失败和家族级分析表明，该阈值是从不可解析语言到可执行结构指称的转变，而语法和执行对PAE和二级结构查询仍然具有选择性价值。ProtStructQA将科学问答重新定义为从语言到测量的编译，并为语言模型何时能将单词映射到可执行的3D结构测量提供了诊断测试平台。

英文摘要

Protein-language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural-language question is generated from a hidden typed domain-specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold-predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held-out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard-negative robustness pool. Without fine-tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain-of-thought, grammar-constrained executable voting, executable voting with chain-of-thought, and multi-turn ReAct-style tool use, and replicate the headline finding on Gemma-3-1B and Gemma-3-12B. We find a capability-dependent denotation threshold between Qwen3-1.7B and Qwen3-4B: below it, tool-mediated ReAct dominates because models often fail to produce executable denotations; above it, chain-of-thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse-failure and family-level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary-structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.

URL PDF HTML ☆

赞 0 踩 0