arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.25888 2026-05-26 cs.LG math.OC

Optimal and Order-optimal Gated Priority-based Greedy Policies for Two-layer Multi-item Order Fulfillment

两层多物品订单履约的最优和阶最优门控优先级贪婪策略

Xi Chen, Yuze Chen, Ziyi Chen, Yuan Zhou

AI总结针对电商在两层分销网络中实时履约决策问题，提出门控优先级贪婪策略，证明其竞争比最优性，并通过数值实验验证性能。

详情

AI中文摘要

我们研究当多物品客户订单顺序到达且未来需求未知时，电商企业如何在两层分销网络中做出实时履约决策。核心管理矛盾在于：是否使用稀缺的前端配送中心（FDC）库存来节省当前履约成本，还是保留该库存用于未来可能更有价值的本地服务订单。我们构建了一个对抗性在线模型，包含多个FDC、一个区域配送中心（RDC）、多单位多物品订单以及物品特定且时变的可变成本。理论目标是刻画简单、可解释且可实施的履约规则何时能够达到与最优先知规划者几乎相同的性能。我们提出了一类门控优先级贪婪策略，在时变和时不变成本结构下推导了竞争比保证，并为任何在线算法建立了匹配或接近匹配的下界。数值实验表明，所提策略相对于广义短视和基于预测的基准方法表现强劲。分析提供了管理指导：何时应保护本地库存，何时拆分订单值得承担固定成本负担，以及固定成本和可变成本的相对大小如何决定更复杂优化的价值。

英文摘要

We study how an e-commerce firm should make real-time fulfillment decisions in a two-layer distribution network when multi-item customer orders arrive sequentially and future demand is unknown. The central managerial tension is whether to use scarce front distribution center (FDC) inventory to save current fulfillment cost or preserve that inventory for future orders that may be more valuable to serve locally. We formulate an adversarial online model with multiple FDCs, one regional distribution center (RDC), multi-unit multi-item orders, and item-specific and time-varying variable costs. Our theoretical objective is to characterize when simple, interpretable, and implementable fulfillment rules can perform nearly as well as an optimal clairvoyant planner. We develop a family of Gated Priority-based Greedy policies, derive competitive-ratio guarantees under both time-varying and time-invariant cost structures, and establish matching or near-matching lower bounds for any online algorithm. Numerical experiments show that the proposed policies perform strongly relative to generalized myopic and forecast-based benchmarks. The analysis yields managerial guidance on when local inventory should be protected, when splitting orders is worth the fixed-cost burden, and how the relative magnitudes of fixed and variable costs determine the value of more sophisticated optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.25882 2026-05-26 cs.LG

Conformalised imprecise inference for robust extrapolation under limited data

基于共形化的不精确推断在有限数据下的鲁棒外推

Yu Chen, Scott Ferson

AI总结提出一种模型无关的共形化不精确推断框架，通过引入不精确性和距离感知，在分布偏移下保持覆盖并自适应扩展不确定性，实现有限数据下的鲁棒外推。

详情

Comments: 10 pages, 5 figures

AI中文摘要

最近不确定性量化的进展越来越强调机器学习中偶然不确定性和认知不确定性之间的区别，这激发了对更统一框架的需求。然而，尽管在产生可靠预测方面取得了很大进展，现有方法在泛化到训练领域之外时往往缺乏严格的保证。我们提出了一种用于鲁棒外推的共形化不精确推断框架，该框架是模型无关的，并为预测模型增加了不精确性和距离感知。所提出的方法产生不精确预测（概率盒），这些预测在分布偏移下仍然有效，在外推区域中保持覆盖的同时自适应地扩展不确定性。在合成和基准数据集上的实验表明，与标准概率方法相比，特别是在有限数据下，该方法具有更好的鲁棒性和可靠的覆盖。

英文摘要

Recent advances in uncertainty quantification increasingly emphasise the distinction between aleatory and epistemic uncertainty in machine learning, motivating the need for more unified frameworks. However, despite much progress in producing reliable predictions, existing methods often lack rigorous guarantees when generalising beyond the training domain. We propose a conformalised imprecise inference framework for robust extrapolation, which is model-agnostic and augments predictive models with imprecision and distance awareness. The proposed approach yields imprecise predictions (probability boxes) that remain valid under distributional shift, maintaining coverage while adaptively expanding uncertainty in extrapolation regimes. Experiments on synthetic and benchmark datasets demonstrate improved robustness and reliable coverage compared to standard probabilistic approaches, particularly under limited data.

URL PDF HTML ☆

赞 0 踩 0

2605.25880 2026-05-26 cs.LG

The Quantization Benefits of Residual-Free Transformers

无残差Transformer的量化优势

Yiping Ji, Mahalakshmi Sabanayagam, Peyman Moghadam, Hemanth Saratchandran, Simon Lucey

AI总结本文通过对比残差与无残差Transformer，发现残差连接导致激活值非高斯性增强，从而增加量化误差；而无残差Transformer通过正交初始化等技术保持近高斯激活值，显著提升低比特量化鲁棒性，揭示了精度与可压缩性之间的权衡。

详情

Comments: Under review

AI中文摘要

大规模Transformer的训练和部署日益受到跨加速器传输激活值、梯度和优化器状态的限制。低比特量化提供了一种自然的补救措施，但Transformer的激活值通常具有重尾和异常值主导的特点，使得简单量化损失严重。我们表明，这种困难不仅是量化器的属性，也是架构的属性。具体来说，残差连接在训练过程中可能使Transformer激活值偏离高斯性。通过残差和无残差Transformer之间的受控比较，我们证明这种效应导致残差模型在低精度下量化误差和精度下降显著更高。我们通过超额峰度分析解释这一现象，表明残差混合可以放大非高斯性，而无残差中的密集混合则压缩非高斯性。然后我们展示，使用正交初始化、谱或二阶优化以及注意力温度的深度感知缩放，可以使无残差Transformer可训练。在语言任务中，虽然全精度性能略有下降，但这些模型保持近高斯激活值，并对低比特量化表现出显著改善的鲁棒性。我们的结果揭示了Transformer设计中的精度-可压缩性权衡，并激发了面向量化的基础模型的架构级方法。

英文摘要

Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quantization highly lossy. We show that this difficulty is not only a property of the quantizer, but also of the architecture. Specifically, residual connections can drive transformer activations away from Gaussianity during training. Using controlled comparisons between residual and residual-free transformers, we demonstrate that this effect leads to substantially higher quantization error and accuracy degradation at low precision in residual models. We explain the phenomenon through an excess kurtosis analysis, showing that residual mixing can amplify non-Gaussianity, whereas dense mixing in residual-free contracts non-Gaussianity. We then show that residual-free transformers can be made trainable using orthogonal initialization, spectral or second-order optimization, and depth-aware scaling of attention temperature. In language tasks, while there is a small drop in full precision performance, these models retain near-Gaussian activations and exhibit significantly improved robustness to low-bit quantization. Our results identify an accuracy--compressibility trade-off in transformer design and motivate architecture-level approaches to quantization-friendly foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.25878 2026-05-26 eess.IV cs.CV

A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation

临床验证的基础模型用于全面肺部病理解读

Zhengrui Guo, Zhengyu Zhang, Jiabo Ma, Yihui Wang, Fengtao Zhou, Yingxue Xu, Ling Liang, Chenglong Zhao, Qi Xie, Jinbang Li, Shujing Guo, Fangyi Han, Zhijian Cen, Ziyi Liu, Cheng Jin, Junlin Hou, Zhixuan Chen, Yu Cai, Lijuan Qu, Shifu Chen, Yueping Liu, Zhe Wang, Xiuming Zhang, Muyan Cai, Li Liang, Hao Chen

AI总结提出PulmoFoundation，一种基于Virchow2和约4万张H&E染色全切片图像进行亚专科预训练的肺部病理基础模型，通过32项临床任务和前瞻性随机对照试验验证，在诊断准确性、效率和一致性上显著提升。

详情

AI中文摘要

病理评估指导肺癌诊断、治疗选择和预后评估，但当前的CPath方法依赖于针对孤立目标的任务特定模型。尽管泛癌基础模型提供了多功能性，但它们缺乏亚专科深度，且未在临床工作流程中评估或在真实世界环境中进行前瞻性验证。我们介绍了PulmoFoundation，这是一个多中心、前瞻性验证、随机对照试验（RCT）评估的基础模型，用于术前、术中和术后护理的全面肺部病理评估。PulmoFoundation基于Virchow2，通过使用约40,000张诊断性H&E染色全切片图像（WSI）进行亚专科特定预训练构建，并在约26,000张WSI上系统评估了32项临床相关任务。除了准确预测分子标记和患者生存率外，我们的模型在活检、冰冻切片和手术切除切片的核芯诊断任务中达到了临床级性能。在一项针对1,357名患者、涵盖11项诊断任务的注册前瞻性研究中，我们的模型实现了平均AUC 92.3%。使用预设的分诊阈值，PulmoFoundation可以减少68.8%的活检和83.0%的冰冻切片的额外二次复核负担，并推迟44.5%的IHC染色订单，阳性预测值分别为1.0、0.991和0.966。除了前瞻性验证，我们还进行了一项交叉RCT，涉及八名病理学家，AI辅助在4,928个病例-阅片者对中提高了诊断准确性（有AI为91.7%，无AI为83.8%）。AI辅助还使中位诊断时间减少了19.6%，诊断信心提高了8.7%，并将阅片者间一致性从中等（kappa=0.56）提高到显著（kappa=0.76）。这些评估共同支持PulmoFoundation作为临床验证的肺部病理决策支持系统。

英文摘要

Pathological assessment guides lung cancer diagnosis, treatment selection, and prognostic evaluation, yet current CPath approaches rely on task-specific models for isolated objectives. Although pan-cancer foundation models offer versatility, they lack subspecialty-level depth and have not been evaluated across clinical workflows or prospectively validated in real-world settings. We introduce PulmoFoundation, a multi-center, prospectively validated, randomized controlled trial (RCT)-evaluated foundation model for comprehensive lung pathology assessment across pre-operative, intra-operative, and post-operative care. Built upon Virchow2 via subspecialty-specific pretraining using ~40,000 diagnostic H&E-stained whole-slide images (WSIs), PulmoFoundation was systematically evaluated on ~26,000 WSIs across 32 clinically relevant tasks. In addition to accurately predicting molecular markers and patient survival, our model achieves clinical-grade performance in core diagnostic tasks across biopsy, frozen section, and surgical resection slides. In a registered prospective study of 1,357 patients across 11 diagnostic tasks, our model achieved an average AUC of 92.3%. Using pre-specified triage thresholds, PulmoFoundation could reduce additional second-review burden for 68.8% of biopsies and 83.0% of frozen sections, and defer 44.5% of IHC stain orders, with PPVs of 1.0, 0.991, and 0.966. Beyond prospective validation, we conducted a crossover RCT with eight pathologists, in which AI assistance improved diagnostic accuracy across 4,928 case-reader pairs (91.7% w/ AI vs. 83.8% w/o AI). AI assistance also reduced median diagnostic time by 19.6%, increased diagnostic confidence by 8.7%, and improved inter-rater agreement from moderate (kappa = 0.56) to substantial (kappa = 0.76). Together, these evaluations support PulmoFoundation as a clinically validated decision-support system for lung pathology.

URL PDF HTML ☆

赞 0 踩 0

2605.25876 2026-05-26 cs.CV

DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation

DyCoRM: 面向文本到图像生成的动态准则感知奖励建模

Jiaying Qian, Ziheng Jia, Qian Zhang, Zicheng Zhang, Jiayi Guo, Junqi Zhang, Guangtao Zhai, Xiongkuo Min

AI总结针对用户对文本到图像生成中动态、细粒度评价准则的需求，提出DyCoRM动态准则感知奖励模型，并构建数据集DyCoDataset-20K和基准DyCoBench-1K，通过准则感知偏好比较和DyCoPick选择方法，实现首个动态细粒度奖励建模框架。

详情

AI中文摘要

随着文本到图像（T2I）生成的持续进步，生成高质量图像变得越来越容易；因此，用户需求转向更符合其特定要求的图像。由于奖励模型在评估生成图像是否符合用户偏好方面扮演着越来越重要的角色，这一趋势为奖励建模带来了一个重要挑战：奖励模型不应仅依赖静态和通用的评价维度，而应考虑用户评估生成图像是否满足其特定要求时所用的任务相关且细粒度的准则。为应对这一挑战，我们提出了DyCoRM，一种动态的、准则感知的奖励模型，它能够基于任务相关准则并进行准则感知的偏好比较。为支持这一设定，我们构建了DyCoDataset-20K，提供动态准则及准则级标注，并进一步推导出DyCoBench-1K，一个用于在动态准则下系统评估奖励模型的基准。我们还引入了DyCoPick，它将准则感知奖励建模应用于T2I图像选择。我们的贡献建立了首个用于动态和细粒度评估以及在T2I生成中实际应用的奖励建模框架。

英文摘要

With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly important role in assessing whether generated images align with user preference, this trend introduces an important challenge for reward modeling: rather than relying solely on static and general evaluation dimensions, reward models should account for the task-relevant and fine-grained criteria through which users assess whether generated images meet their specific requirements. To address this challenge, we propose DyCoRM, a dynamic, criterion-aware reward model that grounds task-relevant criteria and performs criterion-aware preference comparison. To support this setting, we construct DyCoDataset-20K, which provides dynamic criteria together with criterion-level annotations, and further derive DyCoBench-1K, a benchmark for systematically evaluating reward models under dynamic criteria. We further introduce DyCoPick, which applies criterion-aware reward modeling to selecting T2I images. Our contributions establish the first reward modeling framework for dynamic and fine-grained evaluation and practical application in T2I generation.

URL PDF HTML ☆

赞 0 踩 0

2605.25874 2026-05-26 cs.CV

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

WBench：面向交互式视频世界模型评估的综合多轮基准

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding

AI总结提出WBench，一个包含五个维度、289个测试用例和1058轮交互的综合多轮基准，用于系统评估交互式世界模型，并发现现有模型在不同维度上表现不一。

详情

Comments: Technical report of WBench. Homepage: https://meituan-longcat.github.io/WBench/

AI中文摘要

交互式世界模型正在快速发展，但现有基准仅覆盖部分所需能力，缺乏统一标准进行系统评估。为填补这一空白，我们引入了WBench，一个全面的多轮基准，用于沿五个维度（视频质量、设置遵循、交互遵循、一致性和物理合规性）评估交互式世界模型。WBench包含289个测试用例和1,058个交互轮次，每个用例指定一个世界设置和多轮交互序列，涵盖多样场景、风格、主体以及第一人称和第三人称视角，同时包括四种交互类型：导航、主体动作、事件编辑和视角切换。对于导航，WBench统一了文本、6自由度姿态和离散动作控制，使得具有不同原生输入接口的模型都能被评估。评估使用22个自动子指标，结合了专业视觉模型和大规模多模态模型，所有指标均通过人类判断进行验证。在20个最先进的模型上，我们发现没有单个模型在所有维度上都表现良好。我们提供了关于每个模型特征性优势、劣势和开放挑战的详细诊断见解。代码和数据可在 https://github.com/meituan-longcat/WBench 获取。

英文摘要

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

URL PDF HTML ☆

赞 0 踩 0

2605.25869 2026-05-26 cs.CL

Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

通过类型化记忆表示缓解长期智能体中的来源-角色崩溃

Zhengda Jin, Bingbing Wang, Jing Li, Ruifeng Xu, Min Zhang

AI总结提出MemIR类型化记忆中间表示，通过结构约束实现来源监控，解决长期智能体中因无结构存储导致的来源-角色崩溃问题，在LoCoMo和BEAM-100K上优于现有基线。

详情

AI中文摘要

长期记忆对于持久化LLM智能体至关重要，但现有架构将历史交互存储为非结构化的平面文本。这种无约束存储会导致来源-角色崩溃，即智能体出现来源监控错误的关键失效模式。为了在架构层面解决这一认知脆弱性，我们提出MemIR，一种类型化记忆中间表示，将来源监控作为结构约束来操作化。MemIR将长期记忆写入基础原子，这些原子分离原始证据、检索线索和承载真相的声明，事实授权仅限于受支持的声明原子。然后，它应用多路径原子投影和来源范围利用，将异构检索结果转化为以声明为中心的候选包，以及用于答案生成的归一化事实接口。在LoCoMo和BEAM-100K上的实验表明，MemIR持续优于现有记忆基线，特别是在需要来源追踪、时间锚定和碎片证据聚合的任务上。

英文摘要

Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.25868 2026-05-26 cs.HC cs.LG

The Timing Dependencies of Trust: Speed, Accuracy, and cBCI Neuro-Decoupling in Human-AI Teams

信任的时间依赖性：人机团队中的速度、准确性与cBCI神经解耦

Christopher Baker, Stephen Hinton, Akashdeep Nijjar, Riccardo Poli, Caterina Cinel, Tom Reed, Stephen Fairclough

AI总结本研究通过比较快速低准确率（FLA-AI）与慢速高准确率（SA-AI）两种AI助手，利用协作脑机接口（cBCI）和自适应黎曼Oracle，揭示了AI响应时间决定了团队失败机制：快速AI引发盲目服从，慢速AI导致延迟认知冲突，并通过混合融合方法有效提升了团队性能。

详情

AI中文摘要

人工队友的速度和准确性从根本上改变了人机集成的失败状态。高速AI干预可能诱发反射性盲目服从，而延迟干预则可能引发模糊的认知冲突。本研究调查了任务内AI助手的基本特征——快速/低准确率（FLA-AI）与慢速/高准确率（SA-AI）——如何影响虚拟现实无人机任务中协作脑机接口（cBCI）团队的协同效应。17名操作员在高认知负荷下完成连续搜索任务，同时使用二维自适应黎曼Oracle映射其空间协方差。结果数学上证明，AI时间决定了团队失败机制。快速AI引发即时盲目服从；欺骗下的人类准确率降至50.2%，纯行为团队（N=8）无法超过74.1%。相反，慢速AI引发延迟认知冲突；人类犹豫（准确率61.1%），但N=8的行为团队最终恢复到100.0%。关键的是，黎曼Oracle数学上适应这些状态：它严格限制时间窗口（<0.8秒）以拦截快速反射性服从，同时扩大窗口（>1.2秒）以捕获延迟认知冲突。通过混合融合集成这些孤立的真实信号，成功挽救了快速AI团队（N=8时+7.6%），并显著加速了较小慢速AI团队的恢复（N=4时+6.9%）。这些发现证明，cBCI协同效应高度依赖于信任的时间动态，为设计动态门控的人机系统提供了关键框架。

英文摘要

The speed and accuracy of an artificial teammate fundamentally alter the failure states of Human-AI integration. While high-speed AI interventions risk inducing reflexive blind compliance, delayed interventions can induce ambiguous cognitive conflict. This study investigates how the fundamental characteristics of an in-task AI assistant, Fast/Less-Accurate (FLA-AI) versus Slow/Accurate (SA-AI) impact the synergy of Collaborative Brain-Computer Interface (cBCI) teams in a Virtual Reality drone task. Seventeen operators completed continuous search tasks under high cognitive workload while their spatial covariance was mapped using a 2D Adaptive Riemannian Oracle. The results mathematically demonstrate that AI timing dictates the mechanism of team failure. Fast AI induced instant, blind compliance; human accuracy under deception collapsed to 50.2%, and pure behavioural teams (N=8) failed to scale beyond 74.1%. In contrast, Slow AI induced delayed cognitive conflict; humans hesitated (61.1% accuracy), but N=8 behavioural teams eventually recovered to 100.0%. Crucially, the Riemannian Oracle mathematically adapted to these states: it heavily restricted temporal windows (< 0.8s) to intercept fast reflexive compliance, while widening windows (> 1.2s) to capture delayed cognitive conflict. Integrating these isolated veridical signals via Hybrid Fusion successfully rescued the Fast AI team (+7.6% at N=8) and significantly accelerated the recovery of smaller Slow AI teams (+6.9% at N=4). These findings prove that cBCI synergy is heavily contingent on the temporal dynamics of trust, providing a critical framework for designing dynamically gated Human-AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.25866 2026-05-26 cs.LG cond-mat.mtrl-sci physics.class-ph

UNATE: UNsupervised ATomic Embedding for crystal structures property prediction

UNATE：用于晶体结构性质预测的无监督原子嵌入

Laura Solà-Garcia, Àlex Solé, Javier Ruiz-Hidalgo

AI总结提出UNATE框架，通过无监督去噪自编码器和自监督对比学习从无标签晶体结构中学习鲁棒原子表示，用于下游性质预测，在有限标签数据下提升高达10%。

2605.25864 2026-05-26 cs.LG cs.CL

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

当自我信念误导：面向可验证奖励的强化学习的主动标签获取

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin

AI总结提出RLAVR框架，通过主动获取少量真实标签并与伪标签结合，利用CAG指标和CARE策略稳定训练并提升有限标注预算下的性能。

详情

AI中文摘要

大型语言模型（LLM）通过可验证奖励的强化学习（RLVR）在推理能力上取得了显著进展。然而，RLVR本质上依赖于真实标签进行奖励计算，而在实际场景中获取这些标签通常成本高昂。虽然无监督的RLVR范式试图通过训练伪标签来规避这一问题，但它们极易发生训练崩溃。此外，不同样本往往具有不同的标注价值。在本文中，我们提出了主动可验证奖励的强化学习（RLAVR），它主动获取少量选定样本的真实标签，并将其与伪标签相结合，从而稳定训练动态并在有限标注预算下提高性能。为了识别有价值的样本，我们提出了纠正优势差距（CAG）指标，并分析了样本级别的监督价值。在此基础上，我们引入了用于RLAVR的纠正感知可靠性估计（CARE），它将理想的CAG准则转化为实用的预查询获取策略，以显著提高训练稳定性。跨不同领域、模型家族和模型规模的大量实验证明了我们方法的有效性和通用性。我们的代码可在https://github.com/Lumina04/CARE获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

URL PDF HTML ☆

赞 0 踩 0

2605.25860 2026-05-26 cs.CV

SAM3-Assisted Training of Lightweight YOLO Models for Precision Pig Farming

SAM3辅助训练的轻量级YOLO模型用于精准养猪

Marcos Vinicius Mendes Faria, Thiago Borges Pereira, Isabella C. F. S. Condotta, Thiago Meireles Paixão, Francisco de Assis Boldt

AI总结提出利用SAM 3自动生成伪标签训练YOLOv8检测器，无需人工标注，在PigLife数据集上达到79.4% mAP，推理速度比教师模型快约200倍。

详情

Comments: Accepted for publication at the IEEE Sensors Applications Symposium (SAS 2026)

AI中文摘要

基于深度学习的物体检测彻底改变了精准畜牧业（PLF），但仍存在一个关键障碍：高性能基础模型（如SAM 3）计算量过大，无法在边缘部署，而轻量级模型（如YOLO）需要大量人工标注。本文提出了一种全自动知识蒸馏流程，利用Segment Anything Model 3（SAM 3）生成零样本伪标签，用于训练高效的YOLOv8检测器。通过将SAM 3视为离线自动标注器，消除了手动标注瓶颈，生成的模型能够在资源受限的硬件上实现实时推理。我们在PigLife数据集上系统评估了该方法，将SAM 3监督模型与人工标注基线进行了比较。结果表明，无需人工干预，SAM 3训练的YOLOv8m平均精度（mAP）达到79.4%，同时推理延迟比教师模型降低约200倍。此外，分层分析显示，在低遮挡场景下，自动流程的检测率与人工基准相当（AP50 > 99%）。这些发现表明，基础模型可以作为有效的零标注成本监督器，为智慧农业提供可扩展的边缘计算解决方案。

英文摘要

Deep learning-based object detection has revolutionized Precision Livestock Farming (PLF), yet a critical barrier remains: high-performance Foundation Models (such as SAM 3) are too computationally intensive for edge deployment, while lightweight models (like YOLO) require prohibitive manual annotation efforts. This work proposes a fully automated knowledge distillation pipeline that leverages the Segment Anything Model 3 (SAM 3) to generate zero-shot pseudo-labels for training efficient YOLOv8 detectors. By treating SAM 3 as an offline auto-annotator, we eliminate the manual labeling bottleneck, producing models capable of real-time inference on resource-constrained hardware. We systematically evaluate this approach on the PigLife dataset, comparing SAM 3-supervised models against human-annotated baselines. Results demonstrate that a SAM 3-trained YOLOv8m achieves a mean Average Precision (mAP) of 79.4% without human intervention, while reducing inference latency by approximately 200$\times$ compared to the teacher model. Furthermore, stratified analysis reveals that in low-occlusion scenarios, the automated pipeline achieves detection rates comparable to human benchmarks ($AP_{50} > 99\%$). These findings indicate that foundation models can serve as effective, zero-annotation-cost supervisors, enabling scalable edge computing solutions for smart agriculture.

URL PDF HTML ☆

赞 0 踩 0

2605.25859 2026-05-26 math.ST cs.LG stat.TH

Minimax Limits of k-Fold Cross-Validation via Majority

k折交叉验证的极小极大极限：多数投票算法

Ido Nachum, Rüdiger Urbanke, Thomas Weinberger

AI总结本文通过分析二元分类中多数投票算法的交叉验证均方误差，揭示了k折交叉验证的极小极大极限，证明当折数k随样本数n增长时，任何经验风险最小化算法的均方误差下界为Ω(√k/n)。

详情

AI中文摘要

我们研究了$k$折交叉验证作为风险估计量的均方误差，特别关注其精度如何依赖于折数$k$。尽管交叉验证被广泛使用，但关于如何选择$k$的原则性指导基本缺失，这主要是由于折间误差估计的复杂依赖性。为了获得清晰且可解释的结果，我们聚焦于二元分类中的多数投票算法，这是一个最小但非平凡的经验风险最小化过程。我们对其交叉验证行为进行了细粒度分析，表明即使这个简单算法也表现出微妙而精细的现象，现有理论对此给出的界是宽松甚至无效的。借助这一分析，我们引入了交叉验证风险估计的极小极大框架，并证明当折数随样本数$n$增长时，没有任何经验风险最小化算法能够达到$O(1/n)$的极小极大均方误差；相反，一个$Ω(√k/n)$阶的下界是不可避免的。我们的结果揭示了交叉验证作为数据重用策略的根本局限性，澄清了先前理论工作中的空白和不准确之处，并将多数投票算法定位为一个自然的基准，任何对交叉验证的紧致分析都应能够解释它。

英文摘要

We study the mean-squared error of $k$-fold cross-validation as a risk estimator, with particular emphasis on how its accuracy depends on the number of folds $k$. Despite the widespread use of cross-validation, principled guidance for choosing $k$ is largely absent, mainly due to the complex dependence between fold-wise error estimates. To obtain sharp and interpretable results, we focus on the majority algorithm in binary classification, a minimal yet nontrivial empirical risk minimization procedure. We provide a fine-grained analysis of its cross-validation behavior, showing that even this simple algorithm exhibits subtle and delicate phenomena for which existing theory provides loose and even vacuous bounds. Leveraging this analysis, we introduce a minimax framework for cross-validation risk estimation and prove that no empirical risk minimization algorithm can achieve an $O(1/n)$ minimax mean-squared error when the number of folds grows with the number of samples $n$; instead, a lower bound of order $Ω(\sqrt{k}/n)$ is unavoidable. Our results reveal fundamental limitations of cross-validation as a data-reuse strategy, clarify gaps and inaccuracies in prior theoretical work, and position the majority algorithm as a natural benchmark that any tight analysis of cross-validation should be able to explain.

URL PDF HTML ☆

赞 0 踩 0

2605.25856 2026-05-26 cs.HC cs.AI

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

解释过多？理解大型语言模型推理轨迹如何影响性能和元认知

Daniela Fernandes, Daniel Buschek, Lev Tankelevitch, Thomas Kosch, Robin Welsch

AI总结通过用户实验，研究大型语言模型展示推理轨迹（完整或摘要）对任务性能、信任、愉悦感和自我评估校准的影响，发现轨迹提升主观体验但无性能增益，且导致过度自信。

详情

Comments: 27 pages, 5 figures, 9 tables

AI中文摘要

大型语言模型界面日益冗长，在最终答案之外暴露中间推理轨迹。轨迹被框架化为透明机制，但尚不清楚人们如何利用它们解决问题。我们报告了一项预注册的组间研究（N = 559），参与者在三种条件下解决十个LSAT式推理问题：仅答案基线、答案前显示完整轨迹、答案旁显示摘要轨迹。摘要轨迹在无轨迹基线上保持了任务性能，同时显著提升了信任和愉悦感，表明轨迹暴露改变了交互的主观评价，但未带来性能收益。在使用暴露冗长中间输出的开放权重推理模型时，完整轨迹相对于仅答案基线还损害了性能。在所有条件下，参与者大幅高估了自己的表现，且没有轨迹格式支持校准的自我评估。进一步分析表明，愉悦感（而非信任）承载了通向高估的间接路径，与处理流畅性解释一致。推理轨迹最好被理解为面向用户的界面工件，而非模型认知的透明窗口，校准不太可能从轨迹本身产生，最好通过首先引发用户自身推理的交互来支撑。

英文摘要

Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.

URL PDF HTML ☆

赞 0 踩 0

2605.25854 2026-05-26 cs.AI

From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

从核算到协调：面向数据中心调度的虚拟水感知电-算-水关联框架

Haiyang You, Chengwei Lou, Jin Zhao, Yue Zhou, Lu Zhang, Jin Yang

AI总结提出一个将虚拟水影响内化到电力系统调度的可微优化框架，通过深度学习实现端到端协调策略学习，在IEEE 30/118节点系统上实现约3-5%的淡水取水减少。

详情

AI中文摘要

数据中心的扩张驱动了电力需求的持续增长以及发电站点相关的取水量增加。这些取水发生在发电站点，并根据网络潮流虚拟分配给负荷。因此，特定负荷的实际水足迹随发电调度和网络条件动态变化。现有方法通常依赖静态统计核算来量化这些水足迹。然而，这种静态方法无法捕捉调度优化和负载迁移如何动态影响取水。结果，静态统计核算方法仍与优化过程脱节，无法指导负载迁移或电力调度以缓解水压力。为解决这一局限，本文开发了一个可运行的电-算-水（ECW）关联框架，将虚拟水影响直接内化到电力系统调度中。该框架将调度优化表示为嵌入深度学习架构中的可微优化层，能够在保持运行可行性的同时实现协调策略的高效端到端学习。结合不动点协调，该框架确保了虚拟水归因与物理发电侧取水之间的一致性。在IEEE 30节点和118节点测试系统上的案例研究展示了可靠的收敛性、精确的功率-水一致性，以及在受水约束条件下发电相关淡水取水减少约3-5%。

英文摘要

The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.25851 2026-05-26 cs.RO

RePlan-Bot: Multi-Level Replanning for Embodied Instruction Following

RePlan-Bot：面向具身指令跟随的多级重规划

Xicheng Gong, Guozheng Sun, Peiran Xu, Yadong Mu

AI总结提出RePlan-Bot，通过多级连续重规划（高层LLM审计器、常识引导搜索、轻量级ViT校正器）解决具身指令跟随中的长时规划和不可逆状态变化问题，在ALFRED基准上取得最佳性能。

详情

Comments: 10 pages

AI中文摘要

具身指令跟随（EIF）要求智能体在交互式3D环境中理解和执行复杂的自然语言命令。尽管近期取得了进展，现有方法在长时规划和应对不可逆状态变化方面常常失败，导致任务成功率低。为解决这些挑战，我们引入了RePlan-Bot，一种新颖的EIF智能体，它在任务执行过程中执行多级、连续的重规划。RePlan-Bot集成了一个高层基于LLM的审计器，用于根据环境反馈动态调整子目标；一个基于常识引导的搜索机制，基于多层实例地图实现精确且结构化的对象定位；以及一个轻量级基于ViT的校正器，用于预先修复有风险的低层动作。在ALFRED基准上的评估显示，RePlan-Bot在已知和未知环境中均达到了最先进的性能，展示了卓越的适应性和可靠性。

英文摘要

Embodied instruction following (EIF) requires agents to understand and execute complex natural language commands within interactive 3D environments. Despite recent advances, existing methods often fail in long-horizon planning and handling irreversible state changes, resulting in low task success rates. To address these challenges, we introduce RePlan-Bot, a novel EIF agent that performs multi-level, continuous replanning throughout task execution. RePlan-Bot integrates a high-level LLM-based auditor for dynamic sub-goal adjustments guided by environmental feedback, a commonsense-guided search mechanism based on a multi-layered instance map for precise and structured object localization, and a lightweight ViT-based corrector to preemptively fix risky low-level actions. Evaluated on the ALFRED benchmark, RePlan-Bot achieves state-of-the-art performance in both seen and unseen environments, demonstrating superior adaptability and reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.25850 2026-05-26 cs.CL cs.AI cs.LG

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

TIAR：基于轨迹信息的优势重加权用于大语言模型弃权学习

Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh, Vijaykrishnan Narayanan, Rui Zhang

AI总结本文提出TIAR方法，利用GRPO中的多条轨迹作为自然弃权信号，动态重加权弃权奖励，在六个评估类别中的五个上取得最优弃权F1分数，同时保持基线准确率。

详情

Comments: 10 pages, 1 figure, 4 tables

AI中文摘要

本文研究大语言模型（LLM）的弃权学习，特别是使用三元奖励来激励大语言模型中的真实性。本文将该思想从三元奖励扩展到基于轨迹信息的优势重加权（Trajectory-Informed Advantage Reweighting），在组相对策略优化（GRPO）训练期间动态重加权弃权奖励。本工作的目标聚焦于弃权学习而非提升真实性，作为减少幻觉的探索。本文的新颖之处在于方法论创新、优势重加权和基准选择。利用GRPO的多条轨迹作为自然弃权信号，该方法使用奖励信号探索知识边界并鼓励一致性。通过证明轨迹可以作为策略相对于查询的置信度指标，进而用于动态计算弃权优势。使用AbstentionBench作为评估基准，因为本工作旨在为弃权学习领域做出贡献。对该基准上的所有数据集，均使用本方法和各种基线进行了测试。实证结果表明，TIAR在六个评估类别中的五个上取得了最优弃权F1分数，在31个基准数据集中的17个上优于静态三元基线，同时完全保持基线准确率。

英文摘要

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.25848 2026-05-26 cs.LG cs.AI

Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

几何演化图：从Transformer残差流中提取稳定概念探针

James Henry

AI总结提出几何演化图（GEM）方法，通过追踪残差流中概念的方向轨迹并识别旋转停止的交接层，提取稳定的概念探针，在391个概念×模型对中优于峰值层探针的比例达66.2%。

详情

Comments: 24 pages, 3 figures. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433)

AI中文摘要

从Transformer残差流中提取的概念探针的可靠性取决于提取层。常见的做法是在固定的后期层或分离得分函数的峰值处进行探测，这忽略了一个基本的结构特征：概念表示在其组装阶段经历显著的方向旋转，直到主要概念分配区（CAZ）之后的一个特征交接层才稳定下来。我们引入了几何演化图（GEM），它通过残差流激活追踪概念的完整方向轨迹，识别旋转停止的交接层，并从该层提取稳定的探针方向。在跨越70M到14B参数的23种架构和17种概念类型中，CAZ内入口到出口的余弦相似度平均为0.233，表明CAZ入口处的探针方向不能可靠地预测出口处的探针方向。在391个概念×模型对（23个模型×17个概念）上的消融实验表明，GEM提取的探针在268/391次试验（68.5%）中至少与峰值层探针一样精确，并在259/391次试验（66.2%）中严格优于峰值层探针。架构差异显著：MHA模型在173/221次试验（78.3%）中偏好交接层；GQA模型仅在56/119次试验（47.1%）中偏好交接层。模型级Wilcoxon检验：W=214, N=23, p=0.010（单侧）。一个自适应消融宽度规则针对79/391个近最终层情况：在60/79个触发情况（75.9%）中提高了探针质量，平均增益+7.44个百分点。方向特异性控制证实消融效果是概念方向特异性的：与随机方向消融相比，中位数抑制率为377倍（99.1%的概念方向击败了所有10个随机种子）。参考实现：rosetta_tools v1.3.1（doi:10.5281/zenodo.20361433）。

英文摘要

Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

URL PDF HTML ☆

赞 0 踩 0

2605.25846 2026-05-26 cs.CL

On the Limits of Model Merging for Multilinguality in Pre-Training

论预训练中模型合并对多语言能力的限制

Seth Aycock, Fedor Vitiugin, Aleksandr Umnov, Christof Monz, Khalil Sima'an

AI总结通过控制实验比较混合预训练、模型合并和单语预训练，发现合并单语模型会导致性能崩溃，表明表示相似性是模型合并的前提。

2605.25836 2026-05-26 cs.CR cs.AI cs.CL

TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification

TTPrint：通过发散-收敛验证实现基于证据的TTP提取

Yutong Cheng, Changze Li, Raihan Sultan Pasha Basuki, Qian Cui, Wei Ding, Peng Gao

AI总结提出TTPrint方法，采用先广泛提取后严格验证的发散-收敛设计，结合确定性证据定位与权威定义验证，在文档级TTP提取任务上显著提升宏F1分数。

详情

Comments: Preprint

AI中文摘要

从网络威胁情报（CTI）报告中提取MITRE ATT&CK技术是一个开放集、多标签问题，需要高召回率（不遗漏技术）和高精确率（不虚构未支持的技术）。现有方法——基于规则、监督学习和基于LLM的方法——难以同时实现两者：基于规则和监督方法缺乏跨多种攻击描述的泛化能力，而基于LLM的方法将候选生成和验证耦合在单一推理步骤中，导致召回率和精确率同时受限。我们提出TTPrint，通过受人类分析师工作方式启发的发散-收敛设计来解决这一挑战：首先广泛提取，然后严格验证。在发散阶段，报告被分解为原子行为，并广泛提出候选技术。然后，确定性跨度定位阶段将每个候选锚定到源文本中的特定证据窗口。收敛验证阶段仅保留由定位证据和权威MITRE定义支持的候选。我们贡献了两个评估资源——清理后的TRAM基准（TRAM-Clean）和一个新的注释数据集（TTPrint-Bench）——以解决现有基准中的已知注释噪声，并将任务提升到文档级TTP提取。在TRAM-Clean和TTPrint-Bench上，TTPrint分别达到76.48%和87.39%的宏F1，比领先基线高出63.5%和29.4%。跨六个LLM的多骨干分析和阈值敏感性研究进一步证明了跨模型选择的泛化能力，并为参数选择提供了实用指导。

英文摘要

Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods--rule-based, supervised, and LLM-based--struggle to achieve both: rule-based and supervised approaches lack generalizability across diverse attack descriptions, while LLM-based approaches that couple candidate generation and validation within a single inference step suffer from limited recall and precision simultaneously. We propose TTPrint, which addresses this challenge through a diverge-then-converge design inspired by how human analysts work: first extracting broadly, then verifying rigorously. In the divergent phase, reports are decomposed into atomic behaviors and candidate techniques are proposed broadly. A deterministic span localization stage then anchors each candidate to a specific evidence window in the source text. A convergent verification stage retains only candidates supported by both the localized evidence and the authoritative MITRE definition. We contribute two evaluation resources--a cleaned TRAM benchmark (TRAM-Clean) and a new annotated dataset (TTPrint-Bench)--to address known annotation noise in existing benchmarks and elevate the task to document-level TTP extraction. On TRAM-Clean and TTPrint-Bench, TTPrint achieves 76.48% and 87.39% macro-F1 respectively, outperforming the leading baseline by 63.5% and 29.4%. A multi-backbone analysis across six LLMs and a threshold sensitivity study further demonstrate generalizability across model choices and provide practical guidance for parameter selection.

URL PDF HTML ☆

赞 0 踩 0

2605.25835 2026-05-26 cs.LG cs.AI

Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

面向Kubernetes清单生成的上下文-工具数据蒸馏方法及实验评估

Andrey Kozachok, Anatoliy Bakaev, Aleksandr Kozachok, Shamil Magomedov, Artem Noev

AI总结提出上下文-工具数据蒸馏方法，通过合成生成和反向指令生成构建语料库，结合外部验证器过滤，在资源受限条件下微调1.5B参数小语言模型生成Kubernetes清单，实验表明严格输出格式比增加训练样本更关键。

详情

Comments: 15 pages, 4 figures, 2 tables

AI中文摘要

本文研究了参数高达40亿的小语言模型（SLM）在领域特定语言（DSL）中生成工件的专业化。选择Kubernetes清单作为目标领域。我们提出了上下文-工具数据蒸馏方法：源语料库通过合成生成形成，在扩展方案中通过从真实Kubernetes YAML文件进行反向指令生成，仅当通过外部验证器并匹配领域上下文模型时，才将配对包含在训练中。与经典的KL散度知识蒸馏不同，基线实现简化为在工具验证示例上进行监督微调。实验部分在资源受限条件下展示了试点实现：DeepSeek-V4 Flash API作为教师模型进行合成生成，而Qwen2.5-Coder-1.5B-Instruct通过LoRA在CPU上进行微调。在K8s-Distill-Pilot语料库（训练1200，验证100，测试200）上，我们以更严格的提示公式和max_new_tokens=768实现了full-pass@1 = 91.5%（183/200）。关键经验发现是，对于Kubernetes YAML，试点中的结果质量更多地取决于严格的输出格式要求，而不是简单地增加训练样本数量。

英文摘要

This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-specific languages (DSL). Kubernetes manifests are chosen as the target domain. We propose the context-instrumental data distillation method: the source corpus is formed through synthetic generation and, in an extended scheme, through reverse instruction generation from real Kubernetes YAML files, with pairs included in training only upon passing external validators and matching the domain context model. Unlike classical KL-divergence knowledge distillation, the baseline implementation reduces to supervised fine-tuning on instrumentally verified examples. The experimental section presents a pilot implementation under resource-constrained conditions: the DeepSeek-V4 Flash API serves as the teacher for synthetic generation, while Qwen2.5-Coder-1.5B-Instruct is fine-tuned via LoRA on CPU. On the K8s-Distill-Pilot corpus (train_1200, validation_100, test_200), we achieved full-pass@1 = 91.5% (183/200) with a stricter prompt formulation and max_new_tokens=768. The key empirical finding is that for Kubernetes YAML, result quality in the pilot depended more on strict output format requirements than on simply increasing the number of training examples.

URL PDF HTML ☆

赞 0 踩 0

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆：将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

AI总结提出Auto-Robotist，一种自进化LLM代理，通过将形态搜索轨迹提炼为自然语言技能库，实现可迁移的机器人设计知识，在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

详情

Comments: 20 pages, 8 figures

AI中文摘要

大型语言模型（LLMs）越来越多地被用作进化机器人设计的提案生成器，但大多数循环仍然是无记忆的：模拟结果塑造下一代种群，但并未作为可复用的设计知识保留。我们提出Auto-Robotist，一种自进化的LLM代理，它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计，使设计记忆可检查而非隐含在种群中。在搜索过程中，代理检索技能以调节LLM对精英主体的编辑，同时保留遗传算法（GA）突变路径以进行探索；评估后，通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中，Auto-Robotist改善了冷启动5x5搜索，并将学到的技能迁移到10x10设计空间，其中参考条件迁移在每个任务上都优于GA。这些结果表明，LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.25831 2026-05-26 cs.CL cs.AI cs.LG

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

澄清、弃权或回答？基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

AI总结提出信念增强生成（BAG）方法，通过将大语言模型自身的信念状态注入提示，使其推理多个采样响应并决定对话策略（回答、澄清或弃权），从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情

AI中文摘要

大语言模型（LLMs）定义了文本上的分布，这可以视为不确定性的概率表示：采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务，通常需要手动干预，无法直接控制生成。我们提出信念增强生成（BAG）：通过提示将LLMs锚定在其自身的信念状态中，并让它们推理这K个样本以决定对话策略：回答、澄清或弃权。在多轮模糊问答设置中，我们发现LLMs默认很少澄清或弃权，忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性，并产生了比仅提示基线更忠实于信念状态的策略决策。然而，区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

URL PDF HTML ☆

赞 0 踩 0

2605.25829 2026-05-26 cs.RO cs.AI

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

OASIS: 通过SE(3)轨迹预测实现机器人操作中的观测-动作空间对齐

Xinzhe Chen, Sihua Ren, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

AI总结提出OASIS视觉运动策略，通过SE(3)末端执行器轨迹预测对齐中间表示与动作空间，在仿真和真实实验中优于VLA和WAM基线。

详情

AI中文摘要

最近的视觉-语言-动作（VLA）模型和世界动作模型（WAMs）通过用辅助空间特征或未来视觉状态预测丰富中间表示来推进机器人操作。然而，这些表示在很大程度上仍停留在观测空间内，不共享动作空间的刚体几何，迫使动作解码器隐式恢复该几何。我们提出OASIS，一种通过$SE(3)$末端执行器轨迹预测将中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言和度量深度特征的3D感知特征编码器与生成相机帧末端执行器轨迹的$SE(3)$轨迹预测器耦合。以预测器的姿态监督隐藏状态为条件，动作解码器生成与刚体运动一致的动作块。在仿真和真实世界实验中，OASIS在成功率和分布外泛化方面优于VLA和WAM基线。我们的项目页面位于https://npuhandsome.github.io/OASIS_web。

英文摘要

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

URL PDF HTML ☆

赞 0 踩 0

2605.25826 2026-05-26 math.NA cs.CE cs.LG cs.NA

Branched Signature Kernel Solvers for ODEs with rough Single-Trajectory signals

带粗糙单轨迹信号的常微分方程的分支签名核求解器

Munawar Ali, Qi Feng, Charlie Pyle, George Xu

AI总结针对由单个粗糙信号驱动的ODE，提出基于计数采样和核配置的分支签名核求解器，实现准确稳定的预测。

详情

Comments: 39 pages, 12 figures

AI中文摘要

我们开发了一种分支签名核求解器，用于求解由可能粗糙的强迫信号的\emph{单个观测轨迹}驱动的线性和非线性常微分方程——这种设置自然出现在地震工程、金融、生物学和结构健康监测中，其中强迫信号仅被观测一次，求解器必须尊重底层物理定律而不依赖集合实现。两个成分是新的。首先，一个\emph{计数采样}构造将单个观测转化为一个由$N+1$个嵌套训练路径组成的层次族，在这些路径上可以评估分支签名核；这使得原本为多实现回归问题设计的签名核机制能够处理单轨迹观测。其次，一个核配置框架将假设置于解的最高阶导数上（通过积分核恢复低阶导数）或解本身（在对ODE进行$m$次积分之后）。我们证明了分支签名核的通用逼近定理，利用Hairer–Kelly同态通过时间扩展路径的几何签名来表达分支签名评估。离线求解器被扩展为流式测试/训练/重训练协议，在线性情况下具有闭式在线更新，在非线性情况下具有标量牛顿步。在六个基准（El-Centro地震位移、Solow资本存量模型、fBM驱动的二阶ODE、强迫Duffing振子、路径依赖的Arias强度退化变系数振子以及含噪Kuramoto相位振子系统）上的数值实验表明，分支签名核求解器在所有情况下都能提供准确、稳定的预测。

英文摘要

We develop a branched signature kernel solver for linear and nonlinear ordinary differential equations driven by a \emph{single observed trajectory} of a possibly rough forcing signal -- a setting that arises naturally in earthquake engineering, finance, biology, and structural health monitoring, where the forcing is observed exactly once and the solver must respect the underlying physical law without recourse to an ensemble of realizations. Two ingredients are new. First, a \emph{count-sampling} construction turns the single observation into a hierarchical family of $N+1$ nested training paths on which the branched signature kernel can be evaluated; this allows the signature kernel machinery, originally designed for multi-realization regression problems, to operate on a single-trajectory observation. Second, a kernel-collocation framework places the ansatz either on the highest-order derivative of the solution (with lower derivatives recovered by integrating the kernel) or on the solution itself (after $m$-fold integration of the ODE). We prove a universal approximation theorem for the branched signature kernel, leveraging the Hairer--Kelly morphism to express branched signature evaluations through geometric signatures of time-extended paths. The offline solver is extended to a streaming Test/Train/Retrain protocol with closed-form online updates in the linear case and scalar Newton steps in the nonlinear case. Numerical experiments on six benchmarks (El-Centro earthquake displacement, the Solow capital-stock model, an fBM-driven second-order ODE, a forced Duffing oscillator, a path-dependent Arias-intensity-degraded oscillator with variable coefficients, and a noisy Kuramoto phase-oscillator system) show that the branched signature-kernel solver delivers accurate, stable predictions across all regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.25821 2026-05-26 cs.CV

[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

[CLS] 还不够：基于补丁级推理与自适应聚合的多标签识别

Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

AI总结针对CLIP等视觉语言模型在多标签识别中因[CLS]全局表征不足的问题，提出PIAA框架，通过补丁级推理和自适应聚合实现无训练的多标签识别，在NUS-WIDE上mAP提升超6%。

详情

AI中文摘要

视觉语言模型（如CLIP）通过将图像与文本概念对齐展现出强大的零样本识别能力，但在多标签识别（多个目标共存）中表现不佳。一个关键瓶颈是[CLS]标记作为单一的全局视觉表征，不足以忠实编码具有不同尺度、上下文和共现模式的多样目标。为解决这一局限，我们提出一个新的多标签图像识别框架PIAA，将预测公式化为补丁级推理后接自适应聚合。具体来说，我们首先从两个互补角度增强补丁级预测：（i）缓解视觉编码器中的语义纠缠以获得更具判别性的补丁表征，（ii）学习无监督视觉分类器以缩小视觉-语言模态差距。然后我们引入一个自适应聚合模块，将补丁级分数整合为最终的多标签预测。值得注意的是，整个流程完全无需训练，不需要梯度更新或参数微调。实验表明，我们的方法以最小的额外计算实现了显著改进，在具有挑战性的NUS-WIDE基准上相比代表性基线mAP提升超过6%。代码可在https://github.com/akang-wang/PIAA获取。

英文摘要

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.

URL PDF HTML ☆

赞 0 踩 0

2605.25819 2026-05-26 cs.LG cs.CR

On Reliability of Efficient Membership Inference Vulnerability Evaluation

关于高效成员推断脆弱性评估的可靠性

Joonas Jälkö, Gauri Pradhan, Ossi Räisä, Antti Honkela

AI总结本文揭示了高效成员推断攻击评估中两个关键缺陷：跨样本FPR未校准导致差分隐私审计不可靠，以及有限总体偏差导致样本脆弱性高估，并提出了后处理校准方法。

详情

Comments: 14 pages, 10 figures

AI中文摘要

成员推断攻击（MIA）是通过从数据中学习的模型或统计量来经验性评估训练数据中敏感信息泄露的流行方法。MIA脆弱性通常通过二元分类器的假阳性率（FPR）和真阳性率（TPR）来评估，该分类器试图预测特定样本是否在训练数据中。然而，为了可靠估计TPR，尤其是对于低FPR值，需要大量观测，这在MIA中意味着许多目标模型，导致巨大的计算成本。为避免过高的计算需求，MIA分数通常跨多个个体和多个目标模型进行平均。我们展示了这种高效MIA评估流程中的两个关键弱点。首先，我们表明基于跨多个个体拼接的MIA分数评估TPR（常用于研究极低FPR机制下的脆弱性）在跨样本FPR上未校准。这使得它作为差分隐私审计工具不可靠。为解决此问题，我们提出了一种后处理方法，以有效校准不同样本的FPR。其次，我们识别了Carlini等人2022年提出的常用高效似然比攻击（LiRA）实现中的有限总体偏差，导致样本脆弱性的正向偏差。

英文摘要

Membership inference attacks (MIAs) are popular methods for empirically assessing the leakage of sensitive information in the training data through models or statistics learned from the data. The MIA vulnerability is often evaluated through false positive rate (FPR) and true positive rate (TPR) of a binary classifier that tries to predict whether a particular sample was in the training data. However, in order to reliably estimate the TPR especially for low FPR values, a lot of observations are needed, which in case of MIA translates to many target models, leading to large computational cost. To avoid excessive compute requirements, the MIA scores are often averaged over multiple individuals and multiple targeted models. We demonstrate two key weaknesses in this efficient MIA evaluation pipeline. First, we show that evaluating the TPR based on MIA scores concatenated across multiple individuals, commonly used to study vulnerabilities in the very low FPR regime, is not calibrated across the per-sample FPRs. This makes it unreliable as a tool for auditing differential privacy. To solve this, we propose a post-processing method to effectively calibrate the FPR across different samples. Second, we identify a finite population bias in the commonly used efficient likelihood-ratio attack (LiRA) implementation proposed by Carlini et al. 2022, leading to a positive bias in the per-sample vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2605.25816 2026-05-26 cs.CL cs.AI

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调：基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结本研究通过微调DeBERTa模型，在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测，直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情

DOI: 10.5281/zenodo.20379635

AI中文摘要

个人身份信息（PII）检测系统通常在狭窄的源或领域边界内训练，当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调，该数据跨越十个源数据集，涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法：直接令牌分类微调、源条件层次模型（SC+H）和三阶段课程扩展（SC+H+Curr）。在可重复的5,000条记录保留子集（test_5k）上，与八个已发表的比较系统相比，直接微调的DeBERTa达到F1 0.6476，而SC+H和课程变体分别达到0.5899和0.2772；最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H，我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越，达到F1 0.6455，而SC+H为0.5894。实体级分析表明，直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜（按支持加权实体F1），而SC+H在28个类型上保持局部优势。结果表明，多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.25814 2026-05-26 cs.CL cs.AI

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

AI总结提出Alper框架，通过迭代概率标签传播整合匹配与聚类，自适应融合图传播弱信号与LLM强查询，在预算约束下最大化边际增益，实现高效实体解析。

详情

AI中文摘要

脏实体解析（ER）从单个杂乱数据集中识别指向同一真实世界实体的记录，是数据管理和挖掘中的基本任务。然而，ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图，由于阻塞失败导致缺失边，由于匹配错误导致噪声链接，造成错误传播并产生次优聚类，特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的，两者都优化理想实体图的构建。基于这一见解，我们提出Alper，一个统一框架，将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同，Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询，动态优化图结构和标签。为了提高成本效益，我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题，通过我们的贪心算法求解，并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明，Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25813 2026-05-26 cs.RO

Extending Embodied Question Answering from Perception to Decision

将具身问答从感知扩展到决策

Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu

AI总结提出大规模具身问答数据集EQA-Decision和基线模型RoboDecision，系统覆盖静态场景构建、空间理解、任务动态推理和即时决策四个维度，以统一框架评估具身环境中的感知、推理和行动级决策。

详情

Comments: 11 pages,4 figures

AI中文摘要

具身问答（EQA）连接了具身环境中的感知、推理和交互。然而，现有的数据集和基准仍然分散，每个都侧重于有限的推理技能子集，如空间理解或程序推理，而没有提供一个统一的、大规模的综合评估框架。我们提出了EQA-Decision，一个大规模具身问答数据集，系统地涵盖了具身推理的四个互补维度：静态场景构建、空间理解、任务动态推理和即时决策。该数据集包含超过四百万个问答对，并在多样化的具身场景中具有分层注释。此外，我们开发了RoboDecision，一个与EQA-Decision基准对齐的强基线模型，提供了一个统一框架，共同评估具身环境中的感知、推理和行动级决策。结果表明，EQA-Decision有效地基准测试并增强了VLM在空间和交互推理方面的能力，为推进具身智能研究提供了坚实基础。

英文摘要

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

URL PDF HTML ☆

赞 0 踩 0

2605.25811 2026-05-26 stat.ME cs.LG stat.ML

Geometry Adaptive Counterfactual Distribution Learning with Diffusion-Guided Smoothing

几何自适应反事实分布学习与扩散引导平滑

Kwangho Kim

AI总结针对高维反事实分布学习，提出两种基于扩散引导的几何自适应平滑估计器，通过有效维度降低误差，并在CelebA实验验证。

详情

AI中文摘要

我们研究了高维结果的反事实分布学习，其反事实律可能集中在低维结构附近。标准各向同性平滑对所有环境方向一视同仁，导致不利的缩放和不稳定的局部推断。我们提出了两种基于半参数去偏的扩散引导估计器：用于反事实密度的扩散知情平滑和用于反事实得分的扩散知情得分平滑。这些估计器将因果干扰调整与由扩散得分信息驱动的几何自适应定位相结合，在去除一阶干扰偏差的同时使平滑与局部结果几何对齐。我们建立了平滑密度和基于得分目标的渐近展开、风险界限和推断程序，并在额外近似条件下获得了环境密度推断。在结构几何条件下，主导随机误差由扩散引导核诱导的有效维度控制，而非环境维度。基于CelebA的半合成实验显示几何自适应方法的误差衰减更陡峭，支持了所提出的有效维度理论。

英文摘要

We study counterfactual distribution learning for high-dimensional outcomes whose counterfactual law may concentrate near lower-dimensional structure. Standard isotropic smoothing treats all ambient directions equally, leading to unfavorable scaling and unstable local inference. We propose two diffusion-guided estimators based on semiparametric debiasing: diffusion-informed smoothing for counterfactual densities and diffusion-informed score smoothing for counterfactual scores. The estimators combine causal nuisance adjustment with geometry-adaptive localization driven by diffusion score information, removing first-order nuisance bias while aligning smoothing with local outcome geometry. We establish asymptotic expansions, risk bounds, and inference procedures for smoothed density and score-based targets, with ambient density inference obtained under additional approximation conditions. Under structural geometry conditions, the leading stochastic error is governed by an effective dimension induced by the diffusion-guided kernel, rather than by the ambient dimension. Semi-synthetic experiments based on CelebA show steeper error decay for geometry-adaptive methods, supporting the proposed effective-dimension theory.

URL PDF HTML ☆

赞 0 踩 0