arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10305 2026-06-10 cs.RO 新提交

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

SARM2: 多任务阶段感知奖励建模用于自我改进的机器人操作

Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager

AI总结 提出多任务阶段感知奖励模型RM,结合动作基元阶段估计器和多门控专家混合值头,为机器人操作任务提供密集逐步奖励,并基于RM构建SPIRAL框架,通过廉价自主轨迹改进VLA策略,在10任务基准上显著提升成功率。

详情
AI中文摘要

微调视觉-语言-动作(VLA)策略以进行长程操作仍然严重依赖于行为克隆,这需要昂贵的高质量演示,并使策略保持在演示分布附近。奖励模型可以通过重新加权演示并为机器人上的强化学习(RL)提供密集监督来减少这种依赖,但它们必须密集、准确且通用。现有方法存在不足:特定任务的阶段感知模型准确但需要每任务注释,而通用视觉-语言模型(VLM)奖励模型适用范围广但对于细粒度的长程进展过于粗糙。我们引入了RM,一种多任务阶段感知奖励模型,它将基于动作基元的阶段估计器与多门控专家混合(MMoE)值头相结合,以在操作任务中产生密集的每步奖励。基于RM,我们进一步提出了SPIRAL(通过奖励对齐学习进行自策略改进),一种在策略奖励引导框架,通过廉价的自主轨迹改进VLA策略。在一个10任务基准上,RM将值估计MSE比最强基线降低了80%;当在SPIRAL中使用时,它将任务成功率从约50%提高到近乎完美,例如折叠短裤(58%到100%)和清洁白板(50%到90%),表明高质量密集奖励是稳定机器人数据飞轮的关键。项目网站:此https URL。

英文摘要

Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.

2606.10304 2026-06-10 cs.CL 新提交

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

MIRAGE: LLM智能体中的极性翻转编码子空间

Pratibha Revankar, Kargi Chauhan, Jihye Kim, Sadiba Nusrat Nur, Vincent Siu, Chenguang Wang

AI总结 发现LLM智能体在隐蔽编码敏感数据时,残差流中存在共享的低维编码子空间,通过逻辑回归探针可高精度检测,并构建MIRAGE实时监控器,在126个场景中AUC达0.918,远超仅输出检测。

详情
AI中文摘要

当LLM智能体被迫隐蔽编码敏感数据(Base64、ROT13、藏头诗、同义词链等)时,生成的输出逃避了输出端检测,但底层计算并未逃脱。在来自五个架构家族的八个模型的九个编码家族中,该计算由残差流中共享的低维编码子空间支持。在八个编码家族上训练的逻辑回归探针能够以AUC 0.975-1.000恢复被排除的第九个家族,读取的是计算而非表面特征。同一方向在规划标记处表现出第二个机制特征:当模型将在线模拟编码时极性翻转正向激活,当模型将其外包给工具调用时负向激活,在编码文本存在之前区分两种执行策略。我们构建了MIRAGE(模型内部读取智能体生成外泄),一个利用这两个信号的双通道实时监控器。在126个智能体外泄场景中,其AUC达到0.918,大幅优于仅输出检测(AUC=0.518)。监控器性能本质上是宿主模型几何结构的属性:良性编码假阳性率从Qwen-7B的0%到Phi-3.5的100%,表明探针忠实读取了模型的几何结构是否区分隐蔽与公开编码。在所有测试的对抗预算下,每个抑制子空间的攻击也破坏了编码保真度,这报告为评估预算上的经验规律,而非结构性不可能性声明。

英文摘要

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

2606.10302 2026-06-10 cs.CL 新提交

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

注入多样性的位置至关重要:统一框架下的多样化生成

Cheng Zhang, Rui Xin, Chudi Zhong

AI总结 提出统一框架,通过多样性源和传输分数衡量测试时多样化生成方法,并基于此提出全自动规范级方法,在五个开放任务中提升输出多样性且保持质量。

详情
AI中文摘要

开放式生成任务通常需要一组有意义的不同的输出,然而大型语言模型往往产生相似的生成结果。现有的测试时多样性方法在生成的不同阶段操作,效果各异,但尚不清楚哪些设计选择能导致输出中有意义的多样性。我们引入了一个框架,通过生成过程中引入的多样性源来表征测试时多样化生成方法,并提供了一个传输分数来衡量源中的变化在多大程度上有效传递到最终输出。在该框架指导下,我们提出了全自动规范级生成方法,首先生成多样化的中间规范,然后以它们为条件生成最终响应。在五个开放任务和四个骨干模型上,规范级注入在保持可比质量的同时,提高了输出多样性,超过了测试时基线。我们的分析表明,成功的多样性注入既取决于源的多样性,也取决于它们向输出的传输,这突显了源设计和源到输出的实现是构建更多样化生成系统的两个关键杠杆。

英文摘要

Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output. We introduce a framework that characterizes test-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output. Guided by this framework, we propose fully automated specification-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses. Across five open-ended tasks and four backbone models, specification-level injection improves output diversity over test-time baselines while maintaining comparable quality. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source-to-output realization as two key levers for building more diverse generation systems.

2606.10299 2026-06-10 cs.AI cs.CV cs.MA 新提交

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

空间记忆必须存储什么:遮挡作为语言-智能体记忆的测试

Doeon Kwon, Junho Bang

AI总结 本文通过实验证明,在空间查询场景中,几何信息必须主导记忆召回,而可见性判断需要独立于记忆召回,并提出了基于射线-体素DDA的可见性谓词计算方法。

详情
Comments
23 pages, 6 figures
AI中文摘要

语言智能体的“记忆宫殿”系统将每条记忆锚定到世界坐标,其直觉是几何提供了文本无法提供的信息。我们使这一直觉可测试,并报告三个结果。首先,记忆宫殿默认将空间邻近性折叠成与近期性和重要性线性混合的做法没有帮助甚至有害:在一个预注册的召回实验中,现有的混合在其自身冻结测试中失败(平均Delta-Hit@5 -0.0375,Wilcoxon p=0.306),处于位置盲基线水平,而几何主导的加权则取得决定性胜利(+0.3208,p<10^-15):当查询模式是空间时,几何必须主导召回。其次,记忆召回和可见性必须分离:召回在设计上对遮挡不敏感(你能正确记住墙后下一个房间),而可见性是对存储几何的感知谓词,实时系统从未计算过。一行射线与体素的数字微分分析器(DDA),从智能体已经投射的视线射线重新指向,提供了这一点:文本和实时视锥在849个墙后目标上得分均为0.000,而锥体加DDA达到0.982(精确McNemar p<10^-6);坐标召回分别解决了余弦空值无法解决的近重复位置(1.000 vs 0.533,n=150)。第三,可见性谓词在git提交的预注册下得到实时确认(SPMEM-OCC-LIVE-v1:八个脚本化世界,自动oracle评分,96个墙后目标,假可见从1.000降至0.000,合并精确McNemar p=2.5x10^-29),该运行发现并修复了一个真实的中继锚点缺陷。我们承认遮挡需要几何几乎是同义反复;贡献在于测量和隔离,将空间记忆必须存储的内容与其读取方式分开。这些试验为一个冻结的确认性研究(SPMEM-ZERO-REAL-PREREG-v1)提供动力;完整的人类作者多世界研究(含盲评者)仍是未来工作。

英文摘要

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

2606.10298 2026-06-10 cs.AI cs.CL 新提交

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

从上下文感知到冲突感知:泛化对比解码以应对LLMs中的知识冲突

Runze Jiang, Taiqiang Wu, Yan Wang, Bingyu Zhu, Longtao Huang

AI总结 针对大语言模型生成时外部上下文与参数先验之间的知识冲突,提出冲突感知范式,通过动态分配先验与上下文的权重,并设计自适应机制解决不同冲突状态下的不对称问题。

详情
Comments
27 pages, 9 figures
AI中文摘要

当大语言模型从检索或增强的上下文中生成时,外部上下文与参数先验之间的冲突仍然是核心可靠性瓶颈。现有的对比解码方法遵循一种\emph{上下文感知}范式,单方面放大上下文而压制参数先验,当上下文错误时会覆盖正确的先验。我们将其泛化为\textbf{冲突感知}范式,基于冲突信号动态分配先验与上下文的权威,而非预设上下文的可信度。我们证明,先验和上下文logits的仿射组合产生一个\textbf{幂族},具有固有的\textbf{状态不对称性}:当先验正确时外推会无界放大错误,当上下文正确时内插会纠正不足,且没有静态状态能同时覆盖两者。现有的对比解码方法是该族实例,大多为外推型。为评估两种冲突方向,我们提出TriState-Bench,一种模型感知的评估协议,校准每个模型的先验知识以测量三种冲突状态:纠正、抵抗和一致。为解决不对称性,我们提出自适应状态路由(ARR),在每一步在状态间路由,将抵抗EM从低于6提升至16-33,且不牺牲纠正或一致。我们的代码可在该https URL获取。

英文摘要

When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.

2606.10296 2026-06-10 cs.CL cs.AI 新提交

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

自信的撒谎者:利用对数概率和LLM作为评判诊断多智能体辩论

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

AI总结 研究多智能体辩论中令牌级对数概率、LLM评判分数与任务准确性的关系,发现信心与推理质量在构造者上关联更强,且信心可检测关键推理失败。

详情
Comments
15 pages, 7 figures, 1 table, ACL proceedings
AI中文摘要

多智能体辩论系统通常仅根据最终答案是否正确来评估,忽略了辩论旨在产生的中间推理的质量。本文研究了多智能体辩论中三种信号之间的关系:推理令牌上的令牌级对数概率分布、分配给这些令牌的LLM作为评判的评分标准分数以及最终任务准确性。我们考察了内部信心信号是否预测外部评估的推理质量,以及任一信号是否与任务正确性一致,涵盖三个领域:基于评分标准的评分、数学推理和事实问答。我们的框架将双智能体辩论架构——一个构造者(Constructor)和一个审计者(Auditor)——与一个LLM作为评判配对,该评判根据指令遵循、理由质量和证据基础对每个智能体的推理进行评分,并附带一个关键失败标志。在评分标准评分领域的实验揭示了一致的四阶段信心轨迹和显著的角色不对称性:构造者的信心与评判推理质量的相关性大约是审计者的两倍,并且基于信心的关键推理失败检测对构造者(AUROC 0.804)明显比审计者(0.634)更可靠。这些发现推动了本文提出的更广泛的跨领域研究。

英文摘要

Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

2606.10288 2026-06-10 cs.RO 新提交

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

MARCH: 模型辅助强化学习实现人形机器人稀疏立足点的感知控制

Codrin Crismariu, Ryan K. Cosner

AI总结 提出模型辅助强化学习框架,结合简化模型生成安全参考轨迹、基于控制李雅普诺夫函数的奖励引导教师策略训练以及视觉学生策略蒸馏,实现人形机器人在稀疏立足点上的稳健感知行走。

详情
AI中文摘要

在稀疏地形上的感知双足行走仍然是一个困难的挑战:基于模型的方法精确但对不确定性脆弱,而基于无模型的方法鲁棒但难以发现安全关键型行走所需的精确、受约束的运动,其中小错误可能导致灾难性故障。我们提出了一个模型辅助强化学习(RL)框架,通过三个步骤结合两种视角:(1)使用简化模型生成安全参考轨迹;(2)训练一个特权教师策略,该策略由围绕安全参考轨迹构建的控制李雅普诺夫函数(CLF)奖励引导;(3)将教师策略蒸馏为基于视觉的学生策略。我们表明,这种模型辅助过程产生了物理基础的运动,提高了样本效率,减少了对复杂学习课程的需求,并实现了更平滑的行走行为,同时在与无模型基线相当的踏脚石性能上。我们在仿真中验证了我们的方法,并展示了在Unitree G1人形机器人上成功部署,该机器人导航具有横向约束的稀疏立足点。

英文摘要

Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.

2606.10287 2026-06-10 cs.LG cs.CL 新提交

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

当指标不一致时:知识图谱补全模型基准测试的元分析

Haji Gul, Ajaz Ahmad Bhat

AI总结 针对KGC模型评估中指标冲突问题,提出多准则决策框架,通过元分析发现Z-score是最平衡的聚合器,并识别出不同预测任务下的最优模型。

详情
AI中文摘要

评估知识图谱补全(KGC)模型仍然具有挑战性,因为标准评估依赖于孤立的基于排名的指标,如MRR、Hits$@$k和Mean Rank,这些指标通常在不同数据集上产生冲突的模型排序。一个在MRR上领先的模型可能在Hits@1上落后,而在一个数据集上的强性能可能无法推广到另一个数据集。这种碎片化阻碍了比较,使得选择性报告成为可能,并掩盖了真正的进展。我们将KGC评估重新定义为多准则决策(MCDM)问题,并提出了一个对七个聚合器在五个测试上的元分析:一致性、跨数据集稳定性、指标独立性、噪声下的鲁棒性和泛化性。每个测试通过留一模型(LOMO)和留一组(LOGO)移除进行平均,以便可靠性反映聚合器在不同模型子集上的行为。在尾部$(h,r,?)$和关系$(h,?,t)$预测中,帕累托最优分析确定Z-score是最平衡的聚合器,它在尾部预测中排名DualE最高,在关系预测中排名FMS(流调制评分)最高。使用相同移除的测试敏感性分析表明,一致性和稳定性在很大程度上是移除不变的,而泛化性和独立性是最敏感的。该框架解决了评估不一致性,并为KGC中的聚合器选择和模型基准测试提供了基于证据的指导。

英文摘要

Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail $(h,r,?)$ and relation $(h,?,t)$ prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

2606.10286 2026-06-10 cs.AI 新提交

Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Sim2Schedule: 一种模拟器引导的LLM框架用于自主露天矿调度

Mustavi Ibne Masum, Thiago Eustaquio Alves de Oliveira, Mahzabeen Emu

AI总结 提出模拟器引导的LLM框架,将地质约束编码到动作生成中,零样本生成可解释调度方案,在保持线性计算时间下恢复MILP最优NPV的94%-99%。

详情
AI中文摘要

露天矿调度是在复杂的地质和运营约束下最大化经济回报的关键过程。虽然混合整数线性规划(MILP)提供了数学上的最优基线,但其指数级计算复杂性和无法实时适应限制了其在动态工业环境中的实际部署。本文引入了一种模拟器驱动的大语言模型(LLM)调度框架,其中LLM作为自主决策代理,在每一步由定制模拟器引导,该模拟器将地质优先关系、开采-加工耦合和动态容量约束直接编码到动作生成机制中。该框架在封闭、数据安全的环境中完全零样本运行,无需基于云的推理、领域特定微调或重新训练,即可生成完整、可解释的开采和加工调度。为了提供可信的性能基准,我们开发了一种新的MILP公式,纳入了现实的操作和地质约束。在不同规模和时段的开采实例上进行评估,基于LLM的框架恢复了MILP最优NPV的94%至99%,同时计算时间呈线性增长。这些结果表明,在复杂运营约束下的长期工业调度中,模拟器约束的LLM代理可作为经典优化的实用且可扩展的替代方案。

英文摘要

Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision-making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction-processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero-shot within a closed, data-secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud-based inference, domain-specific fine-tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM-based framework recovers between 94\% and 99\% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator-constrained LLM agents as a practical and scalable alternative to classical optimization for long-horizon industrial scheduling under complex operational constraints.

2606.10285 2026-06-10 cs.CL 新提交

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

OpenRTLSet: 基于大语言模型的Verilog模块设计的完全开源数据集

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen

AI总结 提出最大完全开源硬件设计数据集OpenRTLSet,包含13万+多样Verilog代码样本,结合GitHub代码、VHDL和C/C++翻译,利用DeepSeek-R1生成自然语言描述,支持多种语言模型微调,证明开源方法在硬件设计中的优越性。

详情
Journal ref
2025 IEEE International Conference on LLM-Aided Design (ICLAD), Stanford, CA, USA, 2025, pp. 212-218
Comments
Accepted by ICLAD'25
AI中文摘要

OpenRTLSet引入了硬件设计中最大的完全开源数据集,为研究界和工业界提供了超过131,000个多样化的Verilog代码样本。我们的数据集独特地结合了来自GitHub仓库的Verilog代码(102k模块)、VHDL翻译(5k模块)和可综合的C/C++翻译(24k模块),所有内容均可自由访问,无专有限制。使用推理模型DeepSeek-R1,我们为每个代码样本生成了配对的自然语言描述,从而能够微调各种语言模型家族(例如Qwen和Granite)以进行Verilog代码生成。我们的数据集探索了多种选项,包括在标注过程中将Verilator生成的C++文件作为额外上下文、量化技术(INT4 vs. BF16)以及不同模型规模(7B-32B参数)之间的性能差异。OpenRTLSet证明了开源方法在硬件设计任务中可以实现优越的性能,为该领域的可访问研究和商业用途建立了新的基础。

英文摘要

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

2606.10284 2026-06-10 cs.LG 新提交

Revisiting Positive Samples in Graph Contrastive Learning: From the Perspective of Message Passing

重新审视图对比学习中的正样本:从消息传递的角度

Lianze Shan, Ningchong Wang, Jitao Zhao, Di Jin, Dongxiao He

AI总结 本文从Dirichlet能量角度理论发现消息传递机制使正样本最大化变得平凡,导致图对比学习难以从正样本中有效学习,并提出SPGCL方法通过仅传播高能量特征并利用低能量特征构建概率矩阵来恢复正样本的学习效能。

详情
Comments
24 pages,6 figures
AI中文摘要

图对比学习(GCL)通过最大化正样本之间的相似性并最小化负样本之间的相似性来训练图编码器,已成为主流的图预训练范式。普遍认为正样本在GCL中至关重要。理想情况下,最大化正样本的相似性使图编码器能够捕捉图数据的内在语义和模式。然而,我们发现一个有趣的现象:即使没有正样本,GCL也能取得有竞争力的性能。这促使我们重新审视GCL中正样本的基本机制。从Dirichlet能量的角度,我们理论上发现,消息传递(图编码器中的关键机制)使正样本的最大化变得平凡,从而阻止GCL从正样本中有效学习。为了解决这个问题,我们提出SPGCL来减轻消息传递导致的平凡化,并恢复正样本的学习效能。具体来说,我们发现高Dirichlet能量特征有助于正样本提供有效的学习信号,而低Dirichlet能量特征对正学习信号贡献很小,但对正采样有用。基于此,SPGCL仅传播高Dirichlet能量特征,并使用低能量特征构建概率矩阵以实现可靠的正采样。大量实验证明了SPGCL的有效性。

英文摘要

Graph Contrastive Learning (GCL), which trains graph encoders by maximizing similarity between positive samples and minimizing it between negative ones, has emerged as a mainstream graph pre-training paradigm. It is widely recognized that positive samples are essential in GCLs. Ideally, maximizing the similarity of positive samples enables graph encoders to capture intrinsic semantic and patterns of graph data. However, we discover an interesting phenomenon: GCLs can achieve competitive performance even without positive samples. This motivates us to revisit the fundamental mechanism of positive samples in GCLs. From the perspective of Dirichlet energy, we theoretically finds that message passing, a key mechanism in graph encoders, trivializes the maximization of positive samples, preventing GCLs from effectively learning from positive samples. To address this, we propose SPGCL to mitigate the trivialization caused by message passing and restore the learning efficacy of positive samples. Specifically, we find that high Dirichlet energy features help positive samples provide effective learning signals while low Dirichlet energy features contribute little to positive learning signal but is useful for positive sampling. Based on this, SPGCL propagates only high Dirichlet energy features and uses low energy features to construct a probability matrix for reliable positive sampling. Extensive experiments demonstrate the effectiveness of SPGCL.

2606.10281 2026-06-10 cs.CR cs.CL 新提交

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

基准测试与探索LLM在攻击调查中的能力

Aniket Anand, Yiwei Hou, Daniel Fields, Alex Kantchelian, David Tao, Kurt Thomas, Grant Ho

AI总结 提出AuditBench基准数据集,评估LLM在安全审计日志分析中的性能,涵盖四种常见调查任务,揭示模型在不同设计选择下的表现差异与错误类型。

详情
AI中文摘要

本文提出了AuditBench,一个新的基准数据集,用于评估LLM在调查安全相关系统审计日志方面的能力。我们设计并使用该基准来探索LLM在事件响应团队通常执行的四种日志调查任务上的表现,范围从对检测器生成的警报进行分类到识别受损系统上的持久性机制。AuditBench包含从Linux和Windows机器收集的系统审计日志,涵盖50多种不同的安全调查场景,包括恶意和良性活动。利用我们的基准,我们评估并分析了五个前沿LLM在分析审计日志以进行攻击调查方面的性能。我们的分析揭示了LLM性能和错误概况如何根据不同的设计选择而变化,例如模型大小、数据表示、提示构建和特定调查任务的差异。此外,我们描述了LLM生成的解释质量以及模型在我们的基准中犯的错误类型。总的来说,我们的工作为评估LLM调查安全日志的能力提供了基础,为在安全运营中使用LLM的从业者提供了新颖的见解,并为未来研究指明了重要方向。

英文摘要

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

2606.10279 2026-06-10 cs.AI cs.CL cs.LG 新提交

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

使用合成理由数据进行监督微调损害真实世界疾病预测

Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

AI总结 研究发现,在临床预测任务中,使用合成理由数据进行监督微调反而显著降低模型性能,根本原因在于叙事合理性与判别优化之间的结构性冲突。

详情
AI中文摘要

监督微调中使用合成理由数据被广泛认为能通过教导模型不仅预测什么而且预测原因来提升语言模型在临床预测任务上的性能。我们在基于纵向健康史进行五年阿尔茨海默病及相关痴呆症(ADRD)预测的任务上检验了这一假设。通过一项包含504种配置的大规模对照实验,我们发现,与仅使用标签的微调相比,基于理由的SFT始终且显著地损害了预测性能。这种退化在多个模型系列和数据规模中持续存在,并且无法通过使用面向推理的基础模型来解决。关键的是,这种失败并非由理由质量差所致:人类专家注释证实生成的理由在医学上是准确的,并且忠实于患者特定的证据;少样本实验表明,当相同的理由作为推理时的演示而非训练目标使用时,能提升性能。我们确定根本原因在于叙事合理性与判别优化之间的结构性冲突。我们希望我们的工作能为更精确地理解理由监督何时以及如何有帮助、何时无帮助铺平道路,从而指导在高风险临床预测中负责任地开发语言模型。

英文摘要

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

2606.10278 2026-06-10 cs.SD cs.AI 新提交

Towards Robust Arabic Speech Emotion Recognition with Deep Learning

基于深度学习的鲁棒阿拉伯语音情感识别

Youcef Soufiane Gheffari, Samiya Silarbi

AI总结 针对阿拉伯语音情感识别中方言多样、数据稀缺等问题,提出CNN-Transformer混合架构,在EYASE和BAVED数据集上达到98.1%准确率。

详情
Comments
21 pages, 16 figures, 11 tables. Submitted manuscript
AI中文摘要

语音情感识别(SER)旨在从音频信号中识别说话者的情感状态。尽管深度学习的最新进展显著提高了印欧语系语言的SER性能,但由于方言多样性、标注数据集有限以及难以同时建模局部频谱线索和长程时间依赖性,阿拉伯语SER仍然探索不足且具有挑战性。为解决这些限制,本研究探讨了联合建模空间和上下文信息的混合架构是否能改善阿拉伯语音的情感识别。我们提出并评估了一个包含三种架构的比较框架:CNN-LSTM模型、CNN-Transformer模型和微调的wav2vec 2.0模型。前两种模型利用MFCC和基于频谱图的表示,而wav2vec 2.0通过自监督表示直接对原始音频进行操作。在EYASE和BAVED数据集上进行的实验表明,所提出的CNN-Transformer架构显著优于其他模型,达到了98.1%的准确率。这一结果凸显了将卷积特征提取与基于Transformer的全局上下文建模相结合的有效性。本工作的主要贡献在于为阿拉伯语SER提供了混合方法和自监督方法的系统比较,并证明了CNN-Transformer架构在低资源和方言多样性环境中为捕捉频谱和长程依赖性提供了鲁棒解决方案。

英文摘要

Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.

2606.10277 2026-06-10 cs.LG 新提交

A Unified Adaptive Feature Composition Framework for Multi-Task Generalization in Wireless Foundation Models

无线基础模型中多任务泛化的统一自适应特征组合框架

Yuxuan Shi, Tingting Yang, Kangning Ma, Liwen Jing, Yuwei Wang, Mengfan Zheng, Li Sun

AI总结 提出RAFC路由适配器,通过轻量级任务驱动网络动态组合Transformer各层隐藏特征,实现无线基础模型的多任务泛化,仅增加少于50K参数。

详情
AI中文摘要

尽管无线基础模型(WFM)在学习通用信道表示方面展现出强大潜力,但其适应各种下游任务仍受现有范式限制。微调策略引入了大量计算和存储开销,而冻结特征提取则导致跨不同下游任务的次优性能。为解决此问题,我们提出了一种用于WFM多任务泛化的统一自适应特征组合框架,其关键组件是用于特征组合的路由适配器(RAFC)。该路由器并非仅提取最后一层输出,而是将来自不同Transformer深度的隐藏状态视为可复用的多级隐藏特征池,并采用轻量级任务驱动特征组合网络生成逐层聚合权重,然后通过加权求和自适应地组合层次化表示。这种设计使每个下游任务能够访问合适的低、中、高级无线特征混合,而无需修改预训练骨干网络。在四个代表性无线任务上的大量实验表明,RAFC在引入少于50K额外参数的情况下,始终优于传统的适应基线。此外,学习到的路由权重提供了任务特定层偏好的可解释证据,使所提框架成为将WFM适应于各种下游场景的低复杂度、可扩展且可解释的接口。

英文摘要

Though wireless foundation models (WFMs) have shown strong potential in learning universal channel representations, their adaptation to various downstream tasks remains constrained by existing paradigms. Fine-tuning strategies introduces substantial computational and storage overhead, while frozen feature extraction leads to sub-optimal performance across diverse downstream tasks. To address this issue, we propose a unified adaptive feature composition framework for multitask generalization in WFMs, where the key component is the Routing Adapter for Feature Composition (RAFC). Instead of extracting only the final-layer output, this router treats the hidden states from different Transformer depths as a reusable pool of multi-level hidden features, and employs a lightweight task-driven feature composition network to generate layer-wise aggregation weights, then adaptively combine hierarchical representations through weighted summation. This design enables each downstream task to access suitable mixture of low-, mid-, and high-level wireless features without modifying the pretrained backbone. Extensive experiments on four representative wireless tasks demonstrate that RAFC consistently outperforms conventional adaptation baselines while introducing fewer than 50K additional parameters. Moreover, the learned routing weights provide interpretable evidence of task-specific layer preferences, making the proposed framework a low-complexity, scalable, and explainable interface for adapting WFMs to diverse downstream scenarios.

2606.10276 2026-06-10 cs.RO cs.AI 新提交

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

基于语言和自我中心人类信号的分层策略用于自然人机交互

Dongjun Lee, Juheon Choi, Dong Kyu Shin, Sinjae Kang, Kimin Lee

AI总结 提出EDITH框架,通过智能眼镜捕捉人类第一人称视角、注视和语言信号,设计分层策略将非语言信号与语言指令结合,实现更自然的人机交互,减少用户表达意图的负担。

详情
Comments
We provide video demos and code in: https://project-edith.github.io
AI中文摘要

为了实现自然的人机交互,机器人必须理解人类不仅通过语言,还通过手势和注视等非语言信号表达的意图。然而,当前的机器人策略仅依赖语言指令作为传达意图的唯一接口,忽略了非语言信号,将全部沟通负担放在语言上。在这项工作中,我们提出了EDITH,一个机器人框架,通过智能眼镜的连续第一人称视角和注视流捕捉人类的非语言信号,并将其与语言指令一起作为机器人策略的输入。我们的硬件系统实时将人类的第一人称视角、注视和语音传输给机器人,并将语音转录为语言指令。为了处理这些丰富但嘈杂的信号,我们设计了一个分层策略,其中高层策略推断人类的意图并生成一系列子任务,每个子任务表示为一个细粒度指令,配有一个关键帧,将意图锚定在场景中(例如,人类指向目标物体的帧)。然后低层策略执行这些子任务。在我们的人机交互任务实验中,即使意图仅被短暂表达,EDITH也能使机器人根据人类的非语言信号行动,并且与仅使用语言指令相比,显著减少了用户传达意图的努力。请访问我们的项目页面获取源代码和真实机器人演示视频。

英文摘要

For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

2606.10275 2026-06-10 cs.CV 新提交

FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

FoA-SR: 忠实还是美观?面向真实世界图像超分辨率的轮廓感知偏好优化

Amjad Mahdi Alqarni, Peizhong Ju

AI总结 提出FoA-SR,基于偏好优化实现真实世界图像超分辨率,通过忠实和美观两种轮廓分别优化适配器,在RealSR和DIV2K上验证了可分离的恢复策略。

详情
Comments
17 pages, 6 figures, 9 tables. Preprint
AI中文摘要

真实世界图像超分辨率(SR)通常设计为单一恢复目标,尽管当前生成模型能够为同一输入产生多个高质量重建。本文认为,最佳恢复策略取决于特定的恢复轮廓:忠实恢复优先考虑参考一致性、结构保持和幻觉抑制,而美观恢复优先考虑视觉愉悦和自然细节。我们提出FoA-SR,一种基于轮廓的新型真实世界SR偏好优化方法。为实现此目标,FoA-SR从我们的监督式FLUX.2-based SR适配器(Flux2SR)开始,该适配器通过LR潜在条件、流匹配和图像空间重建损失进行配对LR到HR图像超分辨率训练。在开发共享监督式超分辨率适配器后,FoA-SR为每个输入图像生成共享随机候选池,并使用轮廓特定的忠实和美观奖励对相同候选进行排序,以挖掘胜者-败者对。这些对用于微调单独的LoRA适配器,同时保持基础模型冻结。在RealSR和DIV2K上的实验表明,FoA-SR可以将同一SR适配器导向不同的恢复目标:忠实适配器改善参考一致性指标,而美观适配器提升无参考感知质量指标。我们的候选池分析显示,忠实和美观奖励经常选择不同的胜者,而Hybrid-LoRA消融表明,将两个轮廓合并为一个奖励会产生隐式折衷,而非显式轮廓控制。

英文摘要

Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

2606.10273 2026-06-10 cs.RO 新提交

Locomotion analysis of a quadruped interacting with the lunar granular surface

四足机器人月球颗粒表面交互的运动分析

Yash J Vyas

AI总结 通过强化学习训练四足机器人在模拟月球颗粒表面运动,对比刚性与软接触环境下的步态和能耗,发现软接触增加训练难度、改变步态并提高能量消耗。

详情
AI中文摘要

在星外环境中部署腿式机器人面临许多挑战,包括复杂的地形交互、能量和热约束。为了有效设计月球探测四足机器人的机械结构,需要仔细考虑电机扭矩、能量消耗和运输成本。月球表面由颗粒状风化层组成,这会影响腿式机器人的运动及其性能。基于刚性接触假设训练的运动算法在应用于软接触环境(如颗粒表面)时也无效,可能导致不稳定和跟踪不良。在本报告中,将月球颗粒表面-机器人足部接触的物理建模应用于使用强化学习训练运动的仿真环境。对在刚性接触和软接触环境下训练的策略进行比较,分析步态和运动性能指标。分析表明,模拟风化层表面的软接触给基于强化学习的训练带来了额外挑战,导致步态定性差异,并增加了总体能量消耗。

英文摘要

Deploying legged robots in extra-terrestrial environments includes many challenges due to complex terrain interactions, energy, and thermal constraints. For effective mechanical design of a lunar exploration quadrupedal robot, careful consideration of motor torques, energy expenditure, and cost of transport is required. The lunar surface is composed of granular regolith, which impacts the locomotion of legged robots and their performance. Locomotion algorithms trained with rigid contact assumptions are also ineffective when applied to environments with soft contacts, such as granular surfaces, which can result in instability and poor tracking. In this report, the physical modelling of the granular lunar surface-robot foot contacts is applied to a simulation environment with locomotion trained using Reinforcement Learning. A comparison is conducted between the policy trained on rigid contact and soft contact environments, analysing the gait and locomotion performance metrics. The analysis demonstrates that soft contacts simulating regolith surfaces pose additional challenges for Reinforcement Learning based training, result in a qualitatively different gait, and increase the overall energy expenditure.

2606.10267 2026-06-10 cs.RO cs.AI cs.LG 新提交

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

机器人策略编排的关键因素:分层VLA智能体的系统研究

Jiaheng Hu, Mohit Shridhar, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jie Tan, Annie Xie

AI总结 系统研究分层视觉-语言-动作(Hi-VLA)系统的设计原则,通过统一框架分析规划器、控制器及接口机制对短时、长时及推理密集型任务性能的影响,提出构建更强健分层VLA智能体的实用原则。

详情
AI中文摘要

分层视觉-语言-动作(Hi-VLA)系统已成为复杂机器人操作的一种有前景的范式,它通过使用高层VLM规划器将任务分解为语言子目标,由低层VLA控制器执行。尽管近期取得了实证进展,但这些系统缺乏统一的设计原则:现有的Hi-VLA系统在选择和连接规划器、控制器、两者之间的切换机制以及规划器中观测和记忆的表示方式上存在差异。在本文中,我们对机器人操作的Hi-VLA设计进行了系统研究。我们将代表性的Hi-VLA智能体统一在一个选项式控制框架下,并在短时、长时和推理密集型任务上基准测试核心设计选择。我们的分析提炼出构建Hi-VLA系统的实用原则,展示了模型选择和接口机制如何共同塑造性能。应用这些原则,在仿真和真实ALOHA机器人上的实验中,我们得到了一个比平面VLA控制或朴素设计的分层系统都显著更强的系统。总体而言,我们的结果为构建更强大、更鲁棒且更有原则的分层VLA智能体奠定了基础。更多信息和视频请访问此http URL。

英文摘要

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.

2606.10254 2026-06-10 cs.AI cs.CL 新提交

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval:为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

AI总结 提出RealMath-Eval基准,评估LLM裁判对真实学生数学解答的评分能力,发现与人类评分存在高均方误差,而合成数据上表现更好,揭示评估差距源于人类错误空间的多样性和高信息熵。

详情
Comments
Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval
AI中文摘要

尽管大型语言模型(LLM)在\emph{解答}高中数学方面已接近完美,但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距,我们引入了\textbf{RealMath-Eval},一个严格标注的基准,包含224份来自高中的真实考试答卷。我们的初步评估显示,即使是最先进的LLM裁判在此任务上也表现不佳,与人类专家评分相比呈现出高均方误差($\sim$2.96)。为探究可能的原因,我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”:裁判在合成文本上准确性和一致性显著更高(MSE $\sim$1.17),但难以泛化到真实学生推理。通过语义嵌入分析,我们发现合成错误会“结构坍缩”为可预测的低维线性子空间,而人类错误则形成更多样的错误空间。此外,生成概率探测表明,人类推理涉及显著更高的信息论惊喜度,表明学生推理转换对当前模型而言更加分布外。最后,我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明,当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

2606.10250 2026-06-10 cs.LG cs.AI 新提交

Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

联邦学习中不平衡的多层次分析以解决非独立同分布问题

Haengbok Chung, Jae Sung Lee

AI总结 提出FedBB算法,通过PNB损失函数和CBR重加权分别解决本地训练中的类内/类间不平衡和客户端间不平衡,在X射线和自然图像数据集上优于现有方法。

详情
Journal ref
Neurocomputing, Volume 626, 2025, Article 129528
Comments
27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted Manuscript
AI中文摘要

类别不平衡是深度学习中常见的问题,会严重降低性能。在联邦学习(FL)中,它是导致非独立同分布数据(non-IID)的关键因素。基于先前的一些尝试,我们在三个层次上定义并分析了FL中的不平衡问题:案例间、类别间和客户端间。案例间不平衡处理每个单一类别内的不平衡;类别间不平衡比较不同类别之间的数据数量。客户端间不平衡表示不同客户端之间本地数据的偏斜程度。基于这些概念,我们提出了FedBB,它由两个主要部分组成:(1)正负平衡(PNB)损失函数解决了本地训练中的案例间和类别间不平衡,增强了高度偏斜的本地客户端数据集上的泛化能力。它通过为少数案例或类别分配更高的权重来优化多标签和多类分类。(2)客户端平衡重加权(CBR)在模型聚合期间根据客户端间不平衡重新加权客户端,为在偏斜较小的数据集上训练的模型赋予更大的权重。在X射线和自然图像数据集上的各种实验表明,FedBB在性能和效率上均优于其他算法。此外,它只需要有限的统计信息,这有利于隐私保护。通过消融研究,我们证明了PNB损失和CBR独立地贡献于性能。由于FedBB旨在构建一个能准确分类所有类别的全局模型,它可以作为通用和个性化FL的基线。

英文摘要

Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.

2606.10249 2026-06-10 cs.LG cs.SI 新提交

When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

当设计规则失效:基准组成决定标签信息性是否预测GNN聚合器选择

Neha Sharma, Ritesh Sharma

AI总结 研究图神经网络聚合器选择(sum/mean/max)在24个节点分类数据集上的泛化性,发现标签信息性仅在传统基准上有效,在Facebook-100密集图中失效,且谱间隙能区分这些图。

详情
AI中文摘要

我们通过研究在24个节点分类数据集(涵盖引文、异嗜、LINKX Facebook-100、共同购买和共同作者图)上的聚合器选择(sum、mean、max),检验图神经网络(GNN)设计规则是否跨基准族泛化。边同嗜性仅能微弱预测GIN-Sum与GIN-Mean的性能差距。标签信息性在传统基准上能很好地预测这一差距,但当包含Facebook-100图时,预测能力大幅下降。在这些密集的朋友关系网络中,接近零的标签信息性与对sum聚合的强烈偏好共存,在扩展训练下产生7-10%的提升,最高达13%。随机块模型消融实验(包括匹配Facebook-100度规模的度修正变体)未能重现这一行为,表明平均度本身不能解释该效应。在若干与标签无关的图统计量中,谱间隙唯一地将这些图与其他低信息性数据集区分开来,该效应局限于单跳邻域并在不同架构中复现。我们进一步识别了与聚合器选择交互的训练机制,并表明PNA在标准引文基准上可能不如最佳单聚合器GIN。我们的结果表明,决定设计规则是否看似泛化的是基准组成而非数值不足,并且Facebook-100基准为未来的自适应聚合方法提供了具体目标。

英文摘要

We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX Facebook-100, co-purchase, and co-authorship graphs. Edge homophily is only weakly predictive of the GIN-Sum versus GIN-Mean performance gap. Label informativeness predicts this gap well on legacy benchmarks but degrades substantially when Facebook-100 graphs are included. In these dense friendship networks, near-zero label informativeness coexists with a strong preference for sum aggregation, producing gains of 7-10% and up to 13% under extended training. Stochastic block model ablations, including degree-corrected variants matching Facebook-100 degree scales, fail to reproduce this behavior, indicating that mean degree alone does not explain the effect. Among several label-independent graph statistics, the spectral gap uniquely distinguishes these graphs from other low-informativeness datasets, with the effect localized to one-hop neighborhoods and replicated across architectures. We further identify training regimes that interact with aggregator choice and show that PNA can underperform the best single-aggregator GIN on standard citation benchmarks. Our results suggest that benchmark composition, rather than numerical insufficiency, determines whether design rules appear to generalize, and that the Facebook-100 regime provides a concrete target for future adaptive aggregation methods.

2606.10246 2026-06-10 cs.SD cs.AI cs.LG 新提交

Linguistically Augmented Audio Speech Data (LinguAS)

语言增强音频语音数据 (LinguAS)

Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, Vandana P. Janeja

AI总结 提出LinguAS数据集,通过专家定义的语言特征(EDLFs)增强音频数据,显著提升深度伪造语音检测模型性能。

详情
AI中文摘要

恶意创建的伪造语音,包括深度伪造和欺骗音频,正以惊人速度扩散,检测模型竞相保持领先。然而,大多数检测模型仅基于帧级音频特征进行推理,未利用更大时间尺度上的有价值语言线索。为弥补这一空白,我们提出语言增强音频语音数据(LinguAS),这是一个包含真实和深度伪造音频样本的数据集,标注了五种策略性选择的、专家定义的语言特征(EDLFs),这些特征在英语口语中频繁出现且是自然人类语音的特征。LinguAS包含超过800个音频样本,每个样本都标注了EDLFs。数据集包含四种欺骗音频攻击类型的平衡数量以及相应数量的真实语音样本。我们还包含说话者性别和每个欺骗音频样本的生成器/来源元数据,为模型训练提供更细粒度信息。我们发现,使用EDLFs增强数据训练的模型性能显著超过ASVspoof 2021深度学习基线和HuBert、XLSR等SSL模型。LinguAS增强的语言、性别和生成器元数据为音频深度伪造研究者提供了一个强调真实人类语言特征的数据集,以改进伪造语音的模型推理。数据和代码已公开。

英文摘要

Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

2606.10244 2026-06-10 cs.RO cs.AI 新提交

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

YUBI:面向大规模双手灵巧操作的通用双指接口

Takehiko Ohkawa, Jumpei Arima, Yuki Noguchi, Masatoshi Tateno, Makoto Sugiura, Takuya Okubo, Kengo Ikeuchi, Yuma Shin, Hiroki Nishizawa, Naoaki Kanazawa, Yuki Wakayama, Daiki Fukunaga, Koshi Makihara, Tomohiro Motoda, Floris Erich, Yukiyasu Domae, Tatsuya Matsushima, Yohishiro Okumatsu, Kei Ota

AI总结 提出YUBI手指对齐夹爪,通过屈服式手指驱动映射实现直观、符合人体工学的双手灵巧操作数据采集,构建8434小时/120万集/119任务数据集,单策略跨多机器人迁移。

详情
Comments
Project page: https://yubi.airoa.io/
AI中文摘要

我们引入了Yielding Universal Bidigital Interface (YUBI),一种手指对齐的夹爪,旨在实现双手灵巧操作的直观、符合人体工学且可扩展的数据采集。虽然手持数据采集系统(如Universal Manipulation Interface (UMI))实现了低成本数据采集,但其笨重的手枪式握把设计可能给精细灵巧操作任务带来人体工学和使用性挑战。为此,YUBI提出了一种独特的设计原则:屈服式手指驱动,将人类手指运动直接映射到夹爪钳口运动。使用YUBI设备,我们建立了一个集成基于VR的6自由度夹爪跟踪的数据采集系统,确保高保真轨迹数据获取。我们整理了一个前所未有的基于UMI的数据集:8434小时,涵盖120万集和119个任务。实验表明,YUBI在复杂双手任务的通用性、灵巧性和操作效率方面优于UMI夹爪。通过在多个平台上安装夹爪,在YUBI数据集上训练的单一策略可迁移到多个双手机器人(UR、Franka和ELEY),证实采集的数据可直接作为策略监督执行。我们发布了夹爪硬件、数据采集软件和数据集作为集成堆栈,为开放社区提供可复现的大规模数据采集路径,以推动机器人基础模型的发展。

英文摘要

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

2606.10243 2026-06-10 cs.LG 新提交

DUET -- Dual User Embedding Transformers for Offsite Conversion Prediction

DUET -- 用于站外转化预测的双用户嵌入变换器

Reazul Hasan Russel, Mingwei Tang, Rostam Shirani, Xinlong Liu, Navid Madani, Leo Ding, Yawen He, Xiangyu Wang, Mustafa Acar, Ashish Katiyar, Yuhai Li, Alan Yang, Metarya Ruparel, Derek Qiang Xu, Rupert Wu, Rui Yang, Liang Tao, Xinyi Zhao, Larry Zhang, Sri Reddy, Rob Malkin

AI总结 针对点击信号丰富但转化信号稀疏、延迟的问题,提出DUET框架,通过为点击和转化流分别预训练专用变换器编码器,生成互补嵌入,在服务延迟约束下提升站外转化率预测精度。

详情
AI中文摘要

站外转化率(OCVR)预测是计算推荐系统中一个重要的排序问题。该任务面临建模挑战:点击信号丰富且时间跨度短,而转化信号本质稀疏、延迟长且常无法归因。尽管存在这些统计差异,两种信号都必须为在严格服务延迟约束下运行的模型提供信息。先前的预训练方法使用单一、无差别的编码器统一应用于两个数据流。我们提出DUET(双用户嵌入变换器),该框架明确将用户行为数据划分为两个领域一致的流——点击和转化——并为每个流预训练专用变换器编码器,其架构针对各流的统计特征定制:密集点击流使用多层自注意力,稀疏转化流使用交错交叉和自注意力。生成的互补嵌入由下游排序器联合使用,而不超出服务延迟预算。评估显示,相对于最强基线,归一化熵(NE)降低高达0.38%,A/B测试显示OCVR预测精度持续提升。

英文摘要

Offsite conversion rate (OCVR) prediction is an important ranking problem in computational recommendation systems. This task presents a modeling challenge: click signals are abundant and exhibit short temporal horizons, whereas conversion signals are inherently sparse, long-delayed, and frequently unattributed. Despite these statistical disparities, both signal types must inform models that operate within strict serving-latency constraints. Prior pre-training approaches address this heterogeneity with a single, undifferentiated encoder applied uniformly across both data streams. We propose DUET (Dual User Embedding Transformers), a framework that explicitly partitions user behavioral data into two domain-coherent streams -- clicks and conversions -- and pre-trains dedicated transformer encoders with architectures tailored to each stream's statistical characteristics: multi-layer self-attention for the dense click stream and interleaved cross- and self-attention for the sparse conversion stream. The resulting complementary embeddings are jointly consumed by a downstream ranker without exceeding serving-latency budgets. Evaluation demonstrates up to 0.38% normalized entropy (NE) reduction relative to the strongest baseline, and A/B test shows consistent improvements in OCVR prediction accuracy.

2606.10241 2026-06-10 cs.AI 新提交

Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Regimes: 一种可审计的、保留验证集的改进循环——在ActiveGraph上的LongMemEval演示

Yohei Nakajima

AI总结 提出Regimes,一种基于事件溯源的可审计自主改进循环,通过ActiveGraph运行时实现故障记录、重放和保留集验证,在LongMemEval-S上提升0.05-0.10的准确率。

详情
Comments
30 pages, 5 figures. Code and committed runs: https://github.com/yoheinakajima/regimes
AI中文摘要

自主改进循环难以信任,因为改进过程通常是附加在智能体上的外部脚手架:故障未被记录,诊断无法重放,提升或丢弃决策落入侧数据库而非智能体自身历史。我们证明,事件溯源智能体运行时消除了这种摩擦,将受控改进转化为一等工作流。当智能体状态是仅追加事件日志的确定性投影时,故障被记录,运行从日志精确重放,候选补丁限定于类型化管道接缝,门控可审计,每次提升或丢弃本身也是一个事件。我们通过Regimes演示了这一点,这是ActiveGraph运行时上的一个循环,诊断失败的评估,在管道点提出修复,并仅在静态检查、沙盒执行、样本内评估和保留验证后提升。该循环与目标无关:相同的控制流通过通用接口针对不同任务运行。在LongMemEval-S上,主要失败不是检索而是调和:证据已在汇编上下文中,但阅读器回答错误。在五个种子保留集划分中,Regimes发现阅读器提示修复,在四个划分中将最终保留准确率提升+0.05至+0.10,在一个过度提升划分中提升+0.01;两个划分单独显著(种子5未针对其顺序提升结构调整),汇总计数仅为描述性,因为划分共享一个500问题池。持久贡献包括:ActiveGraph作为使受控改进循环可行的可审计基础,其支持的保留集门控循环,将每个故障路由到管道位置的失败机制分类(其相对于无路由基线的边际价值是主要开放问题),以及提示即发现探针的假设。

英文摘要

Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent's own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent's state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.

2606.10237 2026-06-10 cs.AI cs.LG 新提交

Minimalist Genetic Programming

极简遗传编程

Leonardo Trujillo

AI总结 提出极简遗传编程(MGP),借鉴语言学中的极简主义程序,用MERGE操作替代进化搜索,在符号回归任务中有效避免膨胀,稳定找到精确解。

详情
AI中文摘要

遗传编程(GP)基于两个重要见解。首先,任何学习任务从根本上都可以视为程序归纳问题,目标是构建表示为语法树的符号层次模型。其次,将此任务视为搜索问题,并使用进化来定位所需模型。自提出以来,GP在广泛的任务和问题领域中取得了显著成果。本文通过修改GP的第二个核心见解,将问题视为句法推导任务,提出了一种替代观点。具体来说,本文提出了极简遗传编程(MGP),该算法与GP一样受生物启发,但并非源自进化,而是从人类语言的极简主义程序中汲取灵感,其中句法被理解为连接其他两个心智系统的最优解决方案。在极简主义中,核心计算过程是一个称为MERGE的二元集合形成算子,它可以通过简单的马尔可夫过程逐步构建复杂的句法结构。MGP能够发现符号表达式的核心构建块,并使用MERGE逐步组合它们。所提出的系统在已知因膨胀倾向而难以用标准GP系统解决的符号回归任务上进行了基准测试。结果表明,当选择适当的原子句法对象词典时,MGP能够在一组标准GP难以做到同样任务的符号回归中一致地产生精确的真实模型。极简主义提供的见解被证明与程序归纳问题相关,并且基于MGP在这项工作中展示的潜力,应进一步探索。

英文摘要

Genetic programming (GP) is based on two important insights. First, that any learning task can fundamentally be posed as a program induction problem, where the goal is to construct a symbolic hierarchical model that is expressed as a syntax tree. Second, to pose this task as a search problem, and use evolution to locate the desired model. Since it was proposed, GP has produced notable results in a wide range of tasks and problem domains. This work presents an alternative view by modifying the second core insight of GP, posing the problem as a syntactic derivation task instead. In particular, this paper presents Minimalist Genetic Programming (MGP), an algorithm that like GP is biologically inspired, but instead of evolution it takes inspiration from the Minimalist Program to human language, in which syntax is understood as an optimal solution to the problem of linking two other mental systems. In minimalism, the core computational process is a binary set formation operator called $MERGE$, than can be used to incrementally construct complex syntactic structures using a simple Markovian process. MGP is able to discover the core building blocks of the symbolic expressions, and to incrementally combined them using $MERGE$. The proposed system is benchmarked on symbolic regression tasks that are known to be difficult to solve with standard GP systems because of the propensity for bloat. Results show that when a proper lexicon of atomic syntactic objects are chosen, MGP is able to consistently produce the exact ground truth model on a set of symbolic regression where standard GP struggles to do the same. The insights provided by minimalism are shown to be relevant to the problem of program induction, and should be explored further based on the potential exhibited by MGP in this work.

2606.10229 2026-06-10 cs.RO cs.LG 新提交

What Demonstration Curation Metrics Do to Your Policy

演示筛选指标对策略的影响

Aarav Bedi

AI总结 研究演示筛选指标在检测缺陷演示后,是否提升基于行为克隆的策略性能。发现指标检测缺陷的能力与策略性能严重脱钩,并揭示演示时长作为混淆变量的影响。

详情
Comments
6 pages, 1 figure, 2 tables
AI中文摘要

我们研究了检测缺陷训练演示的筛选指标是否也能改善基于筛选数据训练的行为克隆策略。在一个接触密集的LIBERO抓取放置基准任务中,通过引入受控结构缺陷(搬运阶段早期释放夹爪),我们发现这两个量是严重解耦的。具有最高缺陷检测AUROC(0.804)的指标产生了最差的筛选策略(任务成功率13.3%),而AUROC显著较低(0.638)的指标产生的策略几乎与在真实干净数据上训练的Oracle策略相匹配(90.0% vs. 93.3%)。我们进一步表明,我们评估的七个指标中有五个利用演示时长作为缺陷标签的琐碎代理,这种混淆因素将报告的AUROC膨胀到接近完美的值,并且在控制演示时长后消失。在所有条件下,受污染的基线仅在3.3%的测试中成功,而两种最佳的筛选方法将差距缩小到Oracle上限93.3%的3个百分点以内。我们的结果认为,筛选方法应根据其产生的策略来评估,而不是根据其标记的缺陷,并且任何筛选基准在报告检测准确性之前必须控制演示时长。我们发布了测试平台、所有指标实现和评估流程。

英文摘要

We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.

2606.10228 2026-06-10 cs.LG cs.AI cs.RO 新提交

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO: 面向安全探索的锐度感知策略优化

Kaustubh Mani, Yann Pequignot, Vincent Mai, Liam Paull

AI总结 提出SHAPO算法,通过锐度感知策略更新隐式重加权梯度,放大罕见不安全动作的影响,抑制安全动作的贡献,从而在欠探索区域实现保守行为,提升安全性与任务性能。

详情
Comments
ICLR 2026
AI中文摘要

安全探索是在安全关键领域部署强化学习(RL)智能体的先决条件。在本文中,我们通过认知不确定性的视角来探讨安全探索,其中智能体对参数扰动的敏感性作为高不确定性区域的实际代理。我们提出了锐度感知策略优化(SHAPO),一种锐度感知的策略更新规则,该规则在扰动参数处评估梯度,使得策略更新相对于智能体的认知不确定性变得悲观。分析表明,这种调整隐式地重新加权了策略梯度,放大了罕见不安全动作的影响,同时抑制了已安全动作的贡献,从而在欠探索区域将学习偏向于保守行为。在多个连续控制任务中,我们的方法在安全性和任务性能上均持续优于现有基线,显著扩展了它们的帕累托前沿。

英文摘要

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

2606.10227 2026-06-10 cs.LG 新提交

Spatiotemporal Graph Transformer for 3D Neighborhood Interaction and Quality Prediction in Metal Additive Manufacturing

时空图Transformer用于金属增材制造中的3D邻域交互与质量预测

Joyce Karen Pelaez, Siqi Zhang, Hoo Sang Ko

AI总结 提出一种时空图Transformer,通过加权网络表示和双注意力机制建模3D邻域交互,显著提升金属增材制造质量预测性能。

详情
Comments
Submitted to Journal of Intelligent Manufacturing, 23 pages, 10 figures, 2 tables
AI中文摘要

金属增材制造能够制造复杂零件,但由于重复的逐层熔化、凝固和再加热在3D构建中引起的交互作用,实现一致的构建质量仍然具有挑战性。先进传感技术为收集实际制造过程的丰富观测数据以实现实时质量监控和控制提供了巨大机会。然而,现有方法通常难以表示多层交互并量化其对质量的贡献。在本文中,我们开发了一种新颖的时空图Transformer,用于建模3D邻域交互并学习其对金属增材制造构建质量的影响。具体来说,我们首先引入制造过程的加权网络表示,其中熔合位置被建模为节点,其空间和过程依赖关系被编码为边权重。这种表示还允许将多模态数据(例如几何设计、工艺设置和原位传感数据)集成到统一结构中,用于下游学习任务。在此网络基础上,我们进一步设计了一种双注意力图Transformer,它同时捕获节点内特征依赖和跨节点邻域交互,用于质量表示学习。实验结果表明,所提出的框架在表征过程-质量关系方面显著优于基于图像、序列和图的模型。更重要的是,跨层交互的纳入对于提高质量预测性能至关重要。该框架广泛适用于涉及网络建模和基于图的表示学习的其他任务。

英文摘要

Metal additive manufacturing enables the fabrication of complex parts, but achieving consistent build quality remains challenging due to interactions induced by repeated layer-wise melting, solidification, and reheating across the 3D build. Advanced sensing provide a great opportunity to collect rich observations of the actual manufacturing process for real-time quality monitoring and control. Yet, existing methods often have limited ability to represent multi-layer interactions and quantify their contributions to quality. In this paper, we develop a novel spatiotemporal graph transformer for modeling 3D neighborhood interactions and learn their effects on build quality in metal additive manufacturing. Specifically, we first introduce a weighted network representation of the manufacturing process, where fusing locations are modeled as nodes, and their spatial- and process-dependent relationships are encoded as edge weights. This representation also enables the integration of multimodal data (e.g., geometric design, process settings, and in-situ sensing data) into a unified structure for downstream learning tasks. Building on this network, we further design a dual-attention graph transformer that captures both within-node feature dependencies and cross-node neighborhood interactions for quality representation learning. Experimental results show that the proposed framework significantly outperforms image-based, sequence-based, and graph-based models in characterizing process-quality relationships. More importantly, the incorporation of cross-layer interactions is critical for improving quality prediction performance. This framework is broadly applicable to other tasks involving network modeling and graph-based representation learning.