arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05316 2026-06-05 cs.AI

I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

我知道你的梗,即使它今天才出现:通过开放世界知识获取理解不断演变的梗

Shanhong Liu, Rui Cao, Pai Chet Ng, De Wen Soh

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Singapore Institute of Technology(新加坡理工学院)

AI总结 提出Query Retrieve Conclude零样本框架,通过识别缺失知识、检索开放网络证据并合成背景知识,以理解新兴梗并提升检测性能。

详情
AI中文摘要

多模态梗是动态的,通常需要最新的背景知识来进行解释。现有方法往往忽略此类知识,或依赖预训练模型的固定参数知识,这些知识可能不完整、过时或无法用于新兴梗。我们引入了Query Retrieve Conclude,一个零样本框架,用于识别缺失知识、检索开放网络证据并合成基于证据的背景知识,以进行梗的理解和检测。我们还引入了一个精心策划的梗理解基准,包含2024年至2026年的近期梗及其外部背景知识注释。在三个梗理解数据集和五个梗检测任务上的实验表明,我们的框架在知识恢复、梗理解和下游检测方面优于零样本基线。

英文摘要

Multimodal memes are dynamic and often require up to date background knowledge for interpretation. Existing methods often overlook such knowledge or rely on fixed parametric knowledge of pretrained models that may be incomplete, outdated, or unavailable for emerging memes. We introduce Query Retrieve Conclude, a zero shot framework that identifies missing knowledge, retrieves open web evidence, and synthesizes evidence grounded background knowledge for meme understanding and detection. We also introduce a curated meme understanding benchmark of recent memes from 2024 to 2026 with external background knowledge annotations. Experiments on three meme understanding datasets and five meme detection tasks show that our framework improves knowledge recovery, meme understanding and downstream detection over zero shot baselines.

2606.05315 2026-06-05 cs.CL cs.AI

LoRi: Low-Rank Distillation for Implicit Reasoning

LoRi: 用于隐式推理的低秩蒸馏

Ryan Solgi, Jiayi Tian, Zheng Zhang

发表机构 * University of California-Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出低秩蒸馏框架,通过对齐师生模型在共享低秩张量子空间中的隐状态推理轨迹,提升大型语言模型的隐式思维链推理能力。

详情
AI中文摘要

隐式思维链方法旨在将推理内化到大型语言模型中,但通常表现不如显式思维链提示。我们通过实验发现,隐状态推理轨迹具有低秩结构。基于此观察,我们提出了一种低秩蒸馏框架,通过使用一阶和二阶统计量,在共享的低秩张量子空间中对齐教师和学生轨迹来传递推理能力。得到的公式捕捉了推理的全局结构,同时支持紧凑的潜在推理过程。我们在多个模型家族(包括LLaMA和Qwen)上,在不同规模下对数学推理基准进行了评估。我们的方法持续提升了性能,尤其是在具有挑战性的多步任务上,接近显式思维链的准确率,并优于先前的隐式思维链蒸馏方法。

英文摘要

Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.

2606.05308 2026-06-05 cs.LG cs.AI cs.CL cs.IR stat.AP

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

基于预测驱动推断的统计可靠LLM排序评估

Abhishek Divekar

发表机构 * Amazon(亚马逊)

AI总结 提出PRECISE框架,将预测驱动推断扩展到排序评估指标,通过结合少量人工标注和大量LLM判断实现无偏估计,并在ESCI基准和实际系统中验证了有效性。

详情
Comments
Accepted at ACL 2026 - GEM Workshop
AI中文摘要

通过PRECISE,我们将预测驱动推断扩展到排序评估指标,通过结合少量人工标注集和大量LLM判断集,产生偏差校正的估计。PPI无论LLM判断器的错误分布如何,都是可证明无偏的。我们通过将输出空间计算从O(2^|C|)减少到O(2^K),使其适用于像Precision@K这样的分层指标,其中标注是按文档的,但指标是按查询的。在ESCI基准上,用Claude 3 Sonnet判断增强30个人工标注,将Precision@4估计的标准误差从4.45降低到3.50(相对减少21%)。在一个生产系统中,我们的框架从100个人工标签和2小时的领域专家标注中正确识别了三个系统变体中最好的一个;A/B测试确认了这一排序,日销售额增加了407个基点。

英文摘要

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

2606.05304 2026-06-05 cs.AI

What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

智能体应该说什么?面向高效多智能体系统的动作-状态通信

Chen Huang, Yuhao Wu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 针对多智能体系统中自由形式通信导致令牌膨胀和性能下降的问题,提出PACT协议,将通信视为公共状态更新问题,压缩为紧凑的动作-状态记录,在多种拓扑下实现性能与成本权衡的优化。

详情
Comments
13 pages, 5 figures
AI中文摘要

基于大语言模型的多智能体系统通常围绕角色、流水线和轮次调度进行组织,而智能体之间传递的内容往往被保留为无约束的自然语言。然而,这种自由形式的通信会迅速膨胀令牌使用量,消耗共享上下文窗口,并最终影响系统性能和推理成本。我们分析了两种多智能体系统拓扑中五种常见的智能体间通信策略,发现没有固定策略是普遍最优的。相反,有效的智能体间消息始终保留下游智能体所需的以动作中心的信息。基于此,我们提出了PACT(协议化动作-状态通信与传输),它将智能体间通信视为公共状态更新问题,并在每个原始智能体输出进入共享历史之前将其投影为紧凑的动作-状态记录。在不同的多智能体系统拓扑中,PACT持续改善了性能-成本权衡,以显著更少的令牌实现了相当或更强的任务性能。这些增益扩展到生产编码工具:PACT将OpenHands的解决率提升了-10%的每解决令牌数,并在SWE-agent上保持解决率中性,同时将输入令牌减半。我们的代码公开在https://github.com/iNLP-Lab/PACT。

英文摘要

Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.

2606.05296 2026-06-05 cs.LG cs.AI

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

智能体蒙特卡洛:黑盒智能体的强化学习模拟

Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Agentic Monte Carlo (AMC)方法,利用序贯蒙特卡洛从最优策略后验中采样,无需参数级优化即可对黑盒LLM智能体进行强化学习式优化,在AgentGym基准上超越提示基线并随测试时计算扩展优于GRPO。

详情
Comments
Accepted by ICML 2026
AI中文摘要

LLM智能体在两种不同的机制下运行:适用于强化学习(RL)的开权重智能体,以及其行为必须在测试时纯粹控制的黑盒智能体。尽管黑盒智能体通常由最先进的专有LLM支持,但仅API访问排除了参数级优化,使得大多数RL方法不适用。为解决这一限制,我们转向RL与贝叶斯推断之间的已知等价性。我们提出智能体蒙特卡洛(AMC),直接从黑盒智能体的最优策略中采样,而不是通过RL训练它。最优策略是轨迹上的后验,其先验我们定义为固定的黑盒LLM智能体。我们采用序贯蒙特卡洛从该后验中采样,通过学习一个价值函数来引导智能体,同时保持底层黑盒模型不变。我们在AgentGym基准的三个不同环境中验证了AMC,展示了相对于提示基线的显著改进,并且随着我们方法测试时计算的扩展,甚至优于组相对策略优化(GRPO)。AMC证明了执行黑盒LLM智能体的原则性RL式优化的可行性。代码可在https://github.com/layer6ai-labs/Agentic-Monte-Carlo获取。

英文摘要

LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at https://github.com/layer6ai-labs/Agentic-Monte-Carlo

2606.05290 2026-06-05 cs.CV cs.AI cs.MM

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

模型是否共享安全表示?面向安全视觉生成的跨模型引导

Tobia Poppi, Silvia Cappelletti, Sara Sarto, Florian Schiffers, Garin Kessler, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学) Amazon Prime Video(亚马逊prime视频)

AI总结 本文提出首个跨模型安全引导框架,通过源语言模型估计安全方向并迁移至目标生成器,无需目标侧不安全数据即可实现安全控制,且不牺牲生成质量。

详情
Comments
Project page: https://aimagelab.github.io/cross-model-safety-representations/
AI中文摘要

生成建模的最新进展使安全控制成为核心挑战,但现有方法大多针对特定模型,需要为每种新架构重新训练或定制干预。在这项工作中,我们探究安全是否可以被表示为一种可移植的潜在方向,一次性学习并在异构生成器之间重用。我们引入了首个跨模型安全引导框架,其中从成对的安全-不安全提示中在源大语言模型中估计安全方向,通过仅在良性数据上拟合的轻量级对齐传输到目标生成器,并在推理时应用。关键的是,我们的流程从未访问目标侧的不安全数据,从而隔离了安全是否可以通过共享表示几何进行转移。除了单个全局方向,我们还识别了一种多向量扩展,捕获类别特定的安全行为,实现更具选择性的控制。我们在文本到图像和文本到视频生成中评估了我们的方法,跨越不同的源-目标模型对。跨模型转移的安全方向实现了与在目标模型上使用不安全数据本地学习的方向相当的ASR降低和CLIP-Score/FID权衡,同时不需要目标侧的不安全数据。这表明安全改进不以生成质量为代价。我们的结果指向了一种模块化的安全观:安全相关行为并非纯粹模型局部,而是可以通过跨模型持续的潜在方向进行控制。这为轻量级、可重用的安全机制开辟了新路径,且无需目标侧不安全数据。

英文摘要

Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.

2606.05275 2026-06-05 cs.CV cs.AI

Personal AI Agent for Camera Roll VQA

个人AI代理用于相机胶卷VQA

Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Korea University(韩国大学) Adobe Research(Adobe研究院)

AI总结 本文提出camroll数据集和camroll-agent代理,通过层次化记忆和工具集解决个人相机胶卷中的长程、高度个性化的视觉问答问题。

详情
Comments
Project page, code, and demo: https://thaoshibe.github.io/camroll
AI中文摘要

我们研究了个人相机胶卷的视觉问答设定。在该设定中,一个对话式AI助手可以访问用户的个人相机胶卷并检索相关照片来回答查询,从简单的事实性问题(例如,“我昨天尝试的食物名称?”)到更开放的问题(例如,“推荐一些我从未吃过的菜肴”)。鉴于个人相机胶卷的庞大性质(即多年、数百到数千张照片),一个成功的AI助手需要理解长程、高度个性化的视觉内容流,以便导航和定位正确和/或相关信息。为此,我们收集并手动标注了模拟真实世界使用场景的问题。最终数据集camroll包含50个用户、31,476张图像和2,500个问答对。我们进一步设计了camroll-agent,一个配备层次化记忆和最小工具集的对话式AI代理,用于在大型个性化视觉记忆上高效导航。实验结果表明,camroll-agent在长上下文理解的AI代理系统中优于众多基线和方法。总之,camroll数据集和camroll-agent凸显了AI代理在长上下文推理中的差距:个性化视觉记忆需要与标准长上下文文本记忆不同的方法,尤其是在存在一致性、视觉细节和用户特定上下文时。

英文摘要

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

2606.05274 2026-06-05 cs.LG

Anomaly Detection for Electro-Hydrostatic Actuators using LSTM Autoencoder

基于LSTM自编码器的电液伺服作动器异常检测

Nehal Afifi, Abdelmonem Elhendawi, Felix Leitenberger, Nadine Piat, Sven Matthiesen

发表机构 * IPEK - Institute of Product Engineering, Karlsruhe Institute of Technology (KIT), Germany(产品工程研究所,卡尔斯鲁厄理工学院(KIT),德国) SUPMICROTECH-ENSMM, France(SUPMICROTECH-ENSMM,法国)

AI总结 针对电液伺服作动器传感器信号,提出基于LSTM自编码器的重构异常检测框架,在多种故障注入场景下达到99%平均准确率与极低误报率。

详情
Comments
8 pages, 6 figures, 3 tables, ESREL 2026 -European Safety and Reliability Conference, accepted paper to be published
AI中文摘要

电液伺服作动器(EHA)广泛应用于航空航天和工业系统,及时检测传感器异常对于确保安全可靠运行至关重要。然而,EHA传感器数据量大且采样频率高,给准确高效的异常检测带来了挑战。传统的统计和经典机器学习方法,如Z-score、四分位距(IQR)、中位数绝对偏差(MAD)、孤立森林、高斯混合和k-means,往往无法捕捉EHA信号中固有的时间依赖性,导致检测精度有限且误报率升高。此外,针对EHA系统的数据驱动异常检测方法的系统评估仍然很少,特别是在不同运行条件下。本研究提出了一种针对单变量EHA传感器信号的离线异常检测框架,重点关注从受控测试台收集的温度和压力数据。该方法采用基于重构的长短期记忆(LSTM)自编码器,通过验证集重构误差分布进行校准和评估。在多种故障注入场景下,使用准确率、精确率、召回率和F1分数评估性能,并辅以不同运行条件下的敏感性分析。LSTM自编码器在所有评估传感器上实现了平均准确率99.0%、精确率高达100%、召回率介于90.2%至99.6%之间、F1分数介于93.1%至99.8%之间,显示出高检测灵敏度和极低的误报率。这些结果凸显了数据驱动的离线异常检测在EHA中的可行性。未来工作将集中于将所开发的框架适配到在线(实时)环境。

英文摘要

Electro-Hydrostatic Actuators (EHAs) are widely used in aerospace and industrial systems, where timely detection of sensor anomalies is essential to ensure safe and reliable operation. However, the large volume and high sampling frequency of EHA sensor data pose challenges for accurate and efficient anomaly detection. Conventional statistical and classical machine-learning methods such as Z-score, Interquartile Range (IQR), Median Absolute Deviation (MAD), Isolation Forest, Gaussian Mixture, and k-means often fail to capture the temporal dependencies inherent in EHA signals, resulting in limited detection accuracy and elevated false-alarm rates. Furthermore, systematic evaluations of data-driven anomaly detection approaches for EHA systems remain scarce, particularly under varying operational conditions. This study presents an offline anomaly-detection framework for univariate EHA sensor signals, focusing on temperature and pressure data collected from a controlled test bench. The method employs a reconstruction-based Long Short-Term Memory (LSTM) autoencoder, calibrated and evaluated using validation-set reconstruction-error distributions. Performance is assessed across multiple fault-injection scenarios using accuracy, precision, recall, and F1-score, complemented by sensitivity analyses under varying operating conditions. The LSTM autoencoder achieved an average accuracy of 99.0\%, precision up to 100\%, recall between 90.2\% and 99.6\%, and F1-scores from 93.1\% to 99.8\%, demonstrating high detection sensitivity and a very low false-alarm rate across all evaluated sensors. These results highlight the feasibility of data-driven offline anomaly detection for EHAs. Future work will focus on adapting the developed framework for an online (real-time) environment.

2606.05272 2026-06-05 cs.LG

Learning Manifold and Itô Dynamics with Branched Neural Rough Differential Equations

学习流形与伊藤动力学:分支神经粗糙微分方程

Luke Thompson, Dai Shi, Lequan Lin, Junbin Gao, Andi Han

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学)

AI总结 提出分支神经粗糙微分方程(B-NRDE),通过Hopf代数框架统一处理欧几里得伊藤动力学、流形上的有序协变导数及经典Stratonovich情形,实现精确的粗步流形约束动力学和伊藤一致律匹配。

详情
Comments
Accepted at ICML 2026
AI中文摘要

神经粗糙微分方程(NRDE)在不规则采样下保持准确性,同时所需的积分步数远少于标准神经微分方程,它通过对数签名总结精细采样的驱动信号,并利用log-ODE方法在粗间隔上推进隐藏状态。这种效率依赖于洗牌代数,即Stratonovich微积分的代数对应。这种依赖性意味着NRDE无法暴露伊藤动力学所需的二次变分项,也无法处理带联络流形上控制伊藤流的有序协变导数。为改善这一点,我们引入了分支神经粗糙微分方程(B-NRDE),这是一个Hopf代数框架,将NRDE的log-ODE步骤重新解释为状态空间流形上的几何数值积分,使驱动代数与主导微积分相匹配:对于欧几里得伊藤动力学使用Grossman--Larson根树,对于流形上的有序协变导数使用Munthe-Kaas--Wright平面根树,在经典Stratonovich情形下使用洗牌代数。这产生了内在的粗步动力学,精确保持流形约束。最后,我们引入一个分支签名核目标,通过在训练过程中使二次变分项可见,实现伊藤一致律匹配。在粗糙Bergomi波动率、仿真到真实$\mathrm{SO}(3)$动力学预测以及SPD协方差动力学上,B-NRDE为欧几里得-Stratonovich设置之外的随机和流形值动力学提供了一种统一、有效的方法。

英文摘要

Neural rough differential equations (NRDEs) stay accurate under irregular sampling while taking far fewer integration steps than standard neural differential equations, summarising a finely sampled driver by its log-signature and advancing the hidden state over coarse intervals using the log-ODE method. This efficiency rests on the shuffle algebra, the algebraic counterpart of Stratonovich calculus. This reliance means NRDEs cannot expose the quadratic-variation terms Itô dynamics require, nor the ordered covariant derivatives that govern Itô flows on connection-equipped manifolds. Ameliorating this, we introduce Branched Neural Rough Differential Equations (B-NRDEs), a Hopf-algebraic framework that recasts the NRDE log-ODE step as geometric numerical integration on the state-space manifold, matching the driving algebra to the governing calculus: Grossman--Larson rooted trees for Euclidean Itô dynamics, Munthe-Kaas--Wright planar rooted trees for ordered covariant derivatives on manifolds, and the shuffle algebra in the classical Stratonovich case. This yields intrinsic coarse-step dynamics that exactly preserve manifold constraints. Finally, we introduce a branched signature-kernel objective to enable Itô-consistent law matching by making quadratic-variation terms visible during training. On rough Bergomi volatility, sim-to-real $\mathrm{SO}(3)$ dynamics forecasting, and SPD covariance dynamics, B-NRDEs offer a unified, effective approach to stochastic and manifold-valued dynamics beyond the Euclidean--Stratonovich setting.

2606.05266 2026-06-05 cs.LG cs.CC cs.DS math.CO math.PR math.ST stat.TH

Sharp Low-Degree Thresholds for Planted-vs-Planted Testing

植入vs植入测试的尖锐低度阈值

Anda Skeja, Daniel Gutiérrez Espinoza, Fiona Skerman, Alexander S. Wein

发表机构 * Department of Mathematics, University of California, Davis(加州大学戴维斯分校数学系)

AI总结 针对植入vs植入设置,建立了低度多项式测试的首个尖锐阈值,并证明在植入子矩阵和植入稠密子图模型中计数社区的匹配上下界,测试阈值与已知低度恢复阈值精确一致。

详情
AI中文摘要

我们在植入vs植入设置中建立了低度多项式测试的首个尖锐阈值,其中目标是以渐近消失的错误率确定两个结构化植入机制中的哪一个生成了观测数据。我们证明了在植入子矩阵和植入稠密子图模型中计数社区的匹配低度上下界。所得的测试阈值与已知的低度恢复阈值精确一致。相比之下,弱测试(即目标优于随机猜测)没有尖锐阈值,而是存在一个我们识别的平滑过渡。为了证明我们的结果,我们开发了一个基于低度恢复中潜在变量展开的植入vs植入测试框架,并采用新方法来识别和修剪非信号贡献。

英文摘要

We establish the first sharp thresholds for low-degree polynomial tests in planted-vs-planted settings, where the goal is to determine with vanishing error which of two structured planted mechanisms generated the observed data. We prove matching low-degree upper and lower bounds for counting communities in the planted submatrix and planted dense subgraph models. The resulting testing threshold coincides, down to the sharp constant, with the known low-degree recovery threshold. In contrast, the task of weak testing, where the goal is to outperform random guessing, does not have a sharp threshold but rather a smooth transition, which we identify. To prove our results, we develop a framework for planted-vs-planted testing that builds on a latent-variable expansion originating in low-degree recovery and employs new methods to identify and prune non-signal contributions.

2606.05265 2026-06-05 cs.LG

Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models

数据高效的洪水深度预测:通过领域感知的核心集选择与表格基础模型

Lipai Huang, Adithi Srinath, Manas Singh, Junwei Ma, Ali Mostafavi

发表机构 * Urban Resilience.AI Lab(Urban Resilience.AI实验室) Zachry Department of Civil and Environmental Engineering, Texas A&M University(Zachry土木与环境工程系,德克萨斯A&M大学) Department of Computer Science and Engineering, Texas A&M University(计算机科学与工程系,德克萨斯A&M大学) Resilitix Intelligence LLC Institute for a Disaster Resilient Texas, Texas A&M University(德克萨斯灾难韧性研究所,德克萨斯A&M大学)

AI总结 提出一种领域感知的核心集构建流程,结合表格基础模型,仅用0.7%的训练数据即可实现与监督模型相当的洪水深度预测精度,并支持跨流域迁移。

详情
AI中文摘要

近实时洪水深度预测需要替代模型具有准确性、快速性和跨流域可迁移性。监督替代模型在精度上可媲美基于物理的模拟器,但每个流域需要数百万训练行,且无法外推到原始网格之外。我们提出了一种领域感知的核心集构建流程,在推理时对表格基础模型进行条件化。该流程按重现期和受影响最严重的流域对风暴进行分层,然后使用目标感知的空间选择器采样六边形。使用每个流域训练池的0.7%,模型在休斯顿地区九个流域上实现了平均$R^2$为0.663,达到监督参考($R^2$=0.673)的98.5%。该模型无需特定任务重训练即可迁移到未见的流域,优于基于核心集训练的监督基线。在真实风暴上,模型在一个远分布外案例中超过了监督参考,在一个几乎分布内案例中略逊于监督参考。领域感知的核心集构建使表格基础模型能够实现数据高效、跨流域可迁移的洪水预测,无需每个流域的训练。

英文摘要

Near-real-time flood depth prediction demands surrogate models that are accurate, fast, and transferable across watersheds. Supervised surrogates can match physics-based simulators in accuracy but need millions of training rows per watershed and cannot extrapolate beyond their original mesh. We propose a domain-aware coreset construction pipeline that conditions a tabular foundation model at inference time. The pipeline stratifies storms by return period and most-affected watershed, then samples hexagons with a target-aware spatial selector. With 0.7% of the per-watershed training pool, the model attains a mean $R^2$ of 0.663 across nine Houston-area watersheds, within 98.5% of the supervised reference ($R^2$ = 0.673). It transfers to held-out watersheds without task-specific retraining, staying ahead of a coreset-trained supervised baseline. On real storms it exceeds the supervised reference on a far out-of-distribution case and trails it on a mostly in-distribution one. Domain-aware coreset construction lets tabular foundation models deliver data-efficient, watershed-transferable flood predictions without per-watershed training.

2606.05264 2026-06-05 cs.LG

REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting

REGEN:参考引导的合成多元时间序列生成用于预测

Moulik Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

发表机构 * Birla AI Labs, Office of Ananya Birla(Birla AI实验室,Ananya Birla办公室) Birla Institute of Technology and Science, Pilani(Birla理工学院与科学学院,Pilani) Kalinga Institute of Industrial Technology, Bhubaneswar(Kalinga工业技术学院,Bhubaneswar)

AI总结 提出参考引导生成管道ReGeN,通过将观测序列分解为周期骨干、随机残差和跨变量依赖三个可解释组件,实现可控合成,在低数据场景下生成的数据可替代真实数据并提升预测性能。

详情
AI中文摘要

训练鲁棒的多元时间序列预测模型需要大规模、多样化的语料库,然而许多现实领域仅提供少量观测序列。现有生成器无法解决这种不匹配:基于先验的方法(如CauKer、TimePFN)产生领域无关的样本,而数据驱动方法(如TimeGAN)将参考视为黑盒监督,丧失了对周期结构、局部变异和跨变量动态的显式控制。我们提出ReGeN,一种参考引导的生成管道,将观测序列视为可控合成的结构支架而非模仿示例。ReGeN将每个参考分解为三个可解释组件:捕获主导领域形态的相位对齐周期骨干;使用深核高斯过程建模的每变量随机残差;以及通过具有拟合耦合系数的结构因果模型注入的滞后感知跨变量依赖。以可控温度采样这些组件可拓宽分布覆盖,同时保留领域基础结构。我们表明,ReGeN生成的数据始终能替代真实兄弟数据,且预测性能下降极小,在交通等强周期领域中甚至能超越真实源数据。我们进一步表明,在ReGeN语料库上预训练的基础模型优于在基于先验和数据驱动的合成替代方案上预训练的模型。这表明,在低数据场景下,如何结构性利用参考数据可能与数据量同样重要。

英文摘要

Training robust multivariate time series forecasting models requires large, diverse corpora, yet many real-world domains provide only a handful of observed sequences. Existing generators fail to resolve this mismatch: prior-based approaches (e.g., CauKer, TimePFN) produce domain-agnostic samples, while data-driven methods (e.g., TimeGAN) treat references as black-box supervision, forfeiting explicit control over periodic structure, local variability, and cross-variable dynamics. We propose ReGeN, a reference-guided generative pipeline that treats observed sequences not as examples to imitate, but as structural scaffolds for controllable synthesis. ReGeN decomposes each reference into three interpretable components: a phase-aligned periodic backbone capturing dominant domain morphology; per-variable stochastic residuals modeled with a deep-kernel Gaussian process; and lag-aware cross-variable dependencies injected through a structural causal model with fitted coupling coefficients. Sampling these components at controllable temperature broadens distributional coverage while preserving domain-grounded structure. We show that ReGeN-generated data consistently substitutes for real sibling data with minimal forecasting degradation, and in strongly periodic domains such as traffic, can outperform the real source itself. We further show that a foundation model pretrained on ReGeN corpora outperforms those pretrained on prior-based and data-driven synthetic alternatives. This suggests that in low-data regimes, how reference data is structurally exploited can matter as much as how much data is available.

2606.05263 2026-06-05 cs.LG cs.AI

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

基于策略条件的反事实信用分配用于长周期语言智能体的可验证强化学习

Renwei Meng

发表机构 * stu.ahu.edu.cn(安徽大学)

AI总结 提出CVT-RL算法,通过策略条件反事实贡献估计和可验证奖励约束,解决长周期语言智能体在推理和工具使用中的虚假证据链、信念漂移和捷径行为问题,在多个任务上提升成功率并降低作弊率。

详情
Comments
16 pages, 6 figures
AI中文摘要

具有可验证奖励的强化学习改进了推理和工具使用,但长周期语言智能体仍然学习到无支持的证据链、信念漂移以及满足终端检查的捷径行为。现有的过程奖励大多是相关的:它们奖励类似检索、反思或验证的步骤,而不估计在指定干预下该步骤是否有助于最终验证的成功。我们提出CVT-RL,一种具有密集可验证奖励、干预有效性门控和策略条件反事实贡献(PCCC)估计器的约束策略梯度算法。删除、语义替换、证据替换和工具输出扰动定义了不同的受控干预;延续从冻结的参考策略中采样,并使用选择调整的双重稳健估计器增强优势。信念控制仅使用前缀可观察标签,而增广拉格朗日约束无支持的声明、跳过的验证、工具篡改和不安全调用。在长上下文问答、ALFWorld、ScienceWorld以及网页/工具任务上,CVT-RL将平均任务成功率从计算匹配的非因果强化学习的71.8%和信息匹配的反事实过程基线的75.4%提高到78.9%,证据F1分数从信息匹配基线的78.9提高到82.8,并将测量的作弊率从7.2%降低到3.9%。独立人工审计估计CVT-RL的作弊率为4.6%,而信息匹配基线为8.1%,自适应检测器规避攻击仅将作弊率提高到7.1%。分层自助法和混合效应检验在Holm校正后所有主要指标的p<0.01。精心范围的反事实信用,结合有效性门控、诊断和可验证约束,为语言智能体更可靠的长周期强化学习提供了一条可复现的路径。

英文摘要

Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity gating, and a policy-conditioned counterfactual contribution (PCCC) estimator. Deletion, semantic substitution, evidence substitution, and tool-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection-adjusted doubly robust estimator augments the advantage. Belief control uses only prefix-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls. On long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improves average task success from 71.8% for compute-matched non-causal RL and 75.4% for an information-matched counterfactual-process baseline to 78.9%, improves evidence F1 from 78.9 to 82.8 over the information-matched baseline, and reduces measured hacking from 7.2% to 3.9%. Independent human audit estimates 4.6% hacking for CVT-RL versus 8.1% for the information-matched baseline, and adaptive detector-evasion attacks raise hacking only to 7.1%. Stratified bootstrap and mixed-effects tests give p<0.01 after Holm correction for all primary metrics. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long-horizon RL for language agents.

2606.05259 2026-06-05 cs.CV

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR:迈向知识和推理密集型视频理解

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学)

AI总结 提出VideoKR,首个大规模训练语料库,通过人工参与的技能导向生成管道构建315K视频推理示例,增强知识和推理密集型视频理解,并在专家标注基准上验证其有效性。

详情
Comments
ICML 2026 Spotlight
AI中文摘要

我们介绍了VideoKR,这是第一个专门设计用于增强知识和推理密集型视频理解的大规模训练语料库。它包含315K个视频推理示例,覆盖145K个新收集的、CC许可的、专家领域的视频。我们开发了一个人工参与的、技能导向的示例生成管道,针对逐步深入的视频推理能力,同时确保示例及其CoT推理的难度、多样性和可靠性。我们还策划了VideoKR-Eval,一个新的专家标注基准,其中的问题需要真正的视频理解和知识密集型推理,而不是文本捷径。我们的实验表明,在标准SFT→GRPO流程下,基于VideoKR后训练的模型在知识密集型视频推理上优于先前的后训练方法,同时在通用视频推理上保持竞争力,突出了数据设计作为视频推理进展的关键驱动因素。我们进一步进行了全面的消融实验,以分离VideoKR的贡献,为未来工作提供可操作的见解。

英文摘要

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

2606.05257 2026-06-05 cs.LG cs.IR

Scaling Laws for Behavioral Foundation Models over User Event Sequences

用户事件序列上行为基础模型的缩放定律

Rickard Brüel Gabrielsson

发表机构 * Unbox AI

AI总结 研究行为基础模型在用户事件序列上的缩放定律,通过约600次实验发现小嵌入器参数最优,计算最优训练在低计算量时数据密集,且评估指标影响缩放定律。

详情
AI中文摘要

基础模型越来越多地在推荐、支付、欺诈和商务领域的用户行为序列上进行训练,但这些模型仍然缺乏语言模型缩放定律所提供的计算校准。我们研究了一种常见的两部件行为模型架构:基于特征的嵌入器将每个多模态项目映射为向量,解码器仅变换器从结果序列中预测下一个事件。在真实交互数据上进行约600次运行,涵盖$10^{15}$-$10^{19}$训练FLOPs,我们联合变化四个部署相关轴:两部件参数分配、临界批量大小、模型/数据分配以及冻结嵌入器后使用的采样负例数量。小嵌入器(参数占比$s^{\star}\!\approx\!2\%$)在我们测试的每个预算下都是计算最优的,因为嵌入器参数每步更昂贵,且暴露于比上下文器参数多得多的重复项目。计算最优训练在低计算量时相对于文本是数据密集的,但随着计算量增加,其$D/N$比率向Chinchilla启发式靠拢。采样训练目标和部署的排序指标以自身缩放的方式不一致:临界批量大小、冻结后的最优负例数量以及损失与排序质量之间的一致性都随计算量和所选评估指标而变化。对于负采样,更大的预算越来越偏好更多负例;到$10^{19}$ FLOPs时,活跃约束是候选轴内存而非FLOPs。在行为基础模型中,评估指标因此是缩放定律的一部分:改变它可能改变计算最优配方。

英文摘要

Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

2606.05254 2026-06-05 cs.LG cs.CV cs.RO

Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM:面向世界动作模型的模态感知蒸馏

Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of Georgia(佐治亚大学) EmbodyX Inc.(EmbodyX公司)

AI总结 针对世界动作模型联合生成视频和机器人动作时因多模态噪声分布不对称导致蒸馏失效的问题,提出模态感知步蒸馏框架Flash-WAM,通过为不同模态选择匹配噪声机制的参数化方法,实现单步推理并大幅加速。

详情
AI中文摘要

世界动作模型(WAMs)通过迭代扩散联合生成未来视频和机器人动作,在操作基准上表现出色,但需要数十个去噪步骤,这一成本阻碍了实时控制。步蒸馏已成为自然的补救措施,但现成的方法在联合视频-动作设置中失效,因为视频和动作流使用不同的信噪比偏移噪声调度,并以显著不同的边际噪声分布到达训练,这种不对称性是单模态蒸馏方法无法处理的。我们提出 extbf{Flash-WAM},一个受一致性蒸馏启发的模态感知步蒸馏框架,为每个模态选择一致性函数以匹配其噪声机制:针对动作流的低噪声机制采用线性梯度缩放参数化,针对视频流的高噪声机制采用方差保持参数化,该框架基于对一致性函数族的结构分析,该分析刻画了在一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化,Flash-WAM将每个模态的推理压缩到单步。在RoboTwin 2.0上,这将每个块延迟从8.1秒减少到NVIDIA L40S上的348毫秒,实现了23倍的加速,从而支持实时推理。Flash-WAM在模拟基准上保持了任务成功率(RoboTwin 2.0上85.5%,LIBERO上95.7%),并大幅恢复了真实世界性能(Unitree G1人形机器人上平均60%),而朴素的一致性蒸馏在相同步预算下降至24%。

英文摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

2606.05253 2026-06-05 cs.LG

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Alpha-RTL: 面向RTL硬件优化的测试时训练

Peilong Zhou, Zhirong Chen, Cangyuan Li, Haoyu Gao, Kaiyan Chang, Ziming Qu, Ying Wang

发表机构 * SKLP, Institute of Computing Technology, Chinese Academy of Sciences(SKLP,计算技术研究所,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) School of Advanced Interdisciplinary Sciences(先进交叉学科学院)

AI总结 提出TTT-RTL框架,通过测试时强化学习结合EDA反馈(语法检查、仿真和PPA奖励)优化LLM生成的RTL设计,在RTLLM v2.0和工业级C910 FPU单元上分别降低PPA乘积65.1%和ADP 59.4%。

详情
Comments
10 pages, 5 figures
AI中文摘要

大型语言模型(LLM)在生成功能正确的寄存器传输级(RTL)硬件设计方面展现出越来越大的潜力。最近的系统通过集成EDA的强化学习(结合语法、仿真和PPA奖励)进一步改进,但在部署前训练通用RTL生成器,而测试时方法使用冻结策略进行搜索。我们则在测试时执行强化学习,使LLM策略能够针对特定RTL问题适应可执行的EDA反馈。我们提出TTT-RTL,据我们所知,这是首个针对每个设计的测试时训练框架,它闭环了LLM策略与用于RTL优化的EDA流水线。TTT-RTL采样候选实现,通过语法检查和仿真验证它们,使用综合得出的PPA乘积对有效设计进行评分,通过PUCT索引的设计状态池重用高奖励变体,并使用熵策略梯度目标更新策略。为了在稀疏或平台期奖励下稳定策略更新,我们引入了一个自适应KL预算控制器,该控制器使用参考KL、有效样本量和奖励饱和信号来调整熵约束。在Nangate 45nm工艺下的RTLLM v2.0上,TTT-RTL将几何平均PPA乘积相对于参考降低了65.1%,优于最强的已发布冻结策略智能体基线(26.1%)。在Sky130工艺下的工业级XuanTie C910 FPU前导零预测单元上,TTT-RTL实现了59.4%的ADP降低,消融实验证实策略适应、状态重用和KL预算控制各自都有贡献。这些结果表明,带有可执行EDA反馈的测试时训练可以将基于LLM的RTL生成从功能正确性推向物理优化的硬件。

英文摘要

Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback for the specific RTL problem at hand. We propose TTT-RTL, to our knowledge the first per-design test-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis-derived PPA product, reuses high-reward variants through a PUCT-indexed design-state pool, and updates the policy with an entropic policy-gradient objective. To stabilize policy updates under sparse or plateaued rewards, we introduce an adaptive KL-budget controller that adjusts the entropy constraint using reference KL, effective sample size, and reward saturation signals. On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean PPA product by 65.1% over the reference, outperforming the strongest published frozen-policy agent baseline at 26.1%. On an industrial XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL achieves a 59.4% ADP reduction, and ablations confirm that policy adaptation, state reuse, and KL-budget control each contribute. These results suggest that test-time training with executable EDA feedback can move LLM-based RTL generation beyond functional correctness toward physically optimized hardware.

2606.05248 2026-06-05 cs.RO

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划与残差算子学习的逆操作

Yigit Yildirim, Giuseppe Rauso, Riccardo Caccavale, Alberto Finzi

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出一种混合框架,结合STRIPS-like符号规划与残差强化学习,实现机器人操作任务的逆操作,在ManiSkill3 PushCube任务中验证了将近似符号逆操作转化为物理可行的逆技能。

详情
Comments
To be presented in PlanRob26
AI中文摘要

逆推机器人任务需要的不仅仅是逆转符号状态转换或回放运动轨迹。在机器人操作任务中,在连续交互动力学下,符号逆计划通常无法完全恢复正向执行的效果。我们提出了一种用于逆操作的混合框架,该框架通过软几何谓词从演示中自动提取STRIPS-like算子,并推导出逆技能目标。对于每个提取的算子,我们构建一个逆恢复目标,该目标保留前提条件、恢复删除效果并否定添加效果。任务规划器首先尝试使用可用的动作原语来满足该目标。未解决的符号谓词随后引出一个残差算子学习问题,通过强化学习(RL)解决。我们在ManiSkill3 PushCube任务上评估了该框架。对于正向推动技能,符号逆操作执行粗略的抓取-放置恢复,而残差Soft Actor-Critic策略则细化立方体姿态以满足剩余的逆谓词。我们的结果表明,谓词导出的残差控制可以将近似的符号逆操作转化为物理上可行的逆技能。

英文摘要

Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.

2606.05247 2026-06-05 cs.LG stat.ML

DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables

DiffSlack: 通过可学习松弛变量在非线性不等式约束下学习

Ziqian Wang, Chenxi Fang, Zhen Zhang

发表机构 * State Key Laboratory of Tribology in Advanced Equipment, Tsinghua University(先进设备摩擦学国家重点实验室,清华大学) Beijing Key Laboratory of Transformative High-end Manufacturing Equipment and Technology, Department of Mechanical Engineering, Tsinghua University(transformative高端制造设备与技术北京市重点实验室,机械工程系,清华大学) Automotive Electronics Business Unit, Hirain Inc.(Hirain公司汽车电子事业部)

AI总结 提出DiffSlack,一种可微投影层,通过可学习松弛变量将非线性不等式约束转化为等式,结合阻尼高斯-牛顿投影实现端到端约束满足,在车辆路径规划中取得更高成功率和几何约束满足度。

详情
AI中文摘要

在神经网络中强制执行非线性不等式约束仍然具有挑战性,尤其是当输出受到许多耦合约束时。现有的硬约束方法通常对约束集施加结构限制,或者为大规模非线性问题引入大量计算开销。在此,我们提出DiffSlack,一种用于非线性不等式约束神经预测的可微投影层。DiffSlack将不等式重新表述为带有可学习松弛变量的等式,这些松弛变量作为增强网络输出的一部分被预测,并为阻尼高斯-牛顿投影提供数据驱动的热启动。投影层将原始预测映射到增强可行流形上,同时保持端到端可微性。两阶段课程进一步稳定训练并改善约束满足。我们在具有200个来自碰撞避免、曲率限制和航点间距的非线性不等式约束的车辆路径规划上评估DiffSlack。与现有的基于学习的基线相比,DiffSlack在相当的推理预算下实现了更高的规划成功率和更强的几何约束满足。消融研究进一步表明,硬投影层降低了对监督质量的敏感性。CARLA中的闭环跟踪和真实车辆实验证实了生成轨迹的可执行性。这些结果表明,DiffSlack为工程应用中将硬不等式约束嵌入神经网络提供了一种实用且可扩展的方法。

英文摘要

Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.

2606.05236 2026-06-05 cs.RO cs.LG

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新型四元数关节缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种4段8关节四元数关节缆驱动冗余机械臂配置,并利用残差强化学习实现比FABRIK算法高三个数量级的位置和方向精度控制。

详情
AI中文摘要

能够穿越任意空间路径的机械臂,特别是在高度阻塞的工作空间中,在多个行业中备受期待。四元数关节最近赋予了一类特定的机械臂——缆驱动冗余机械臂——超越其先前能力的新功能。具体来说,四元数关节减少了每个自由度所需的电机数量,为更紧凑的解决方案铺平了道路。一个持续的挑战是,四元数关节运动学模型的复杂性给机械臂配置的先验决策带来了困难,并对控制系统提出了更高的计算需求,其非线性放大了由于制造不精确而产生的设计与物理实物之间的所有差异。在这里,我们展示了一个4段、8关节的机械臂可以在更低的硬件成本下实现比现有配置更广阔的工作空间,并且残差强化学习在控制此类机械臂方面优于现有最先进的方法——特别是FABRIK算法。我们的结果表明,这种配置比先前设计更有效地利用工作空间,并且残差强化学习在位置和方向精度上比FABRIK高出三个数量级,实现了对新型4段、8关节机械臂的精确控制。此外,控制实现更简单:我们描述了完整的FABRIK控制过程及相应的学习实现。我们的方法适用于新系统的设计,为设计者提供了开发此类机械臂及新型配置相应控制系统的更多工具。

英文摘要

Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.

2606.05234 2026-06-05 cs.RO cs.LG

OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons

OLIVE: 面向高效自适应外骨骼的在线低秩增量学习

Dong Liu, Yanxuan Yu, Ben Lengerich, Tony Geng, Ying Nian Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Columbia University(哥伦比亚大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Rice University(里奇大学)

AI总结 提出OLIVE框架,通过低秩残差分解和奖励驱动策略梯度实现外骨骼控制的在线个性化自适应,在多种地形上提升步态平滑度、降低努力并增强稳定性。

详情
AI中文摘要

可穿戴外骨骼系统有望恢复身体障碍者的行动能力,但大多数现有控制器依赖于静态步态策略,缺乏适应动态真实环境或个体用户特征的能力。我们提出\olive(\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons),一种参数高效的在线自适应框架,在部署期间持续个性化外骨骼控制。\olive将控制策略的自适应组件分解为低秩残差形式~$\dW = \At\Bt^\top$,秩~$r!\ll!\min(d,k)$,将在线更新成本从$\mathcal{O}(dk)$降低到$\mathcal{O}(r(d{+}k))$,同时保持预训练基础控制器~$\Wz$的稳定性。参数通过奖励塑造的策略梯度更新,完全由身体传感器反馈(EMG、IMU、振动)驱动,消除了对离线参考轨迹的依赖。门控机制根据上下文状态调节个性化强度,动态秩调度器根据地形复杂度调整更新维度——在简单平坦地形上分配最小容量,在要求高的不平坦地形上扩展到更高秩更新——从而在多种活动中实现稳健性能:平地行走、楼梯导航、斜坡和不平坦地形。在可穿戴平台上的实验表明,\olive在步态平滑度、努力减少和运动稳定性上比最强基线分别提高了13、22和15个百分点,在大约1,800步内收敛,端到端延迟为7.4毫秒。我们的代码实现可在https://github.com/FastLM/OLIVE获取。

英文摘要

Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

2606.05232 2026-06-05 cs.LG cs.AI

Differentiable Efficient Operator Search

可微分高效算子搜索

Xiaohuan Pei, Jiyuan Zhang, Yuanfan Guo, Weiguo Feng, Tao Huang, Cho-Jui Hsieh, Chang Xu

发表机构 * The University of Sydney(悉尼大学) ByteDance(字节跳动) Shanghai Jiao Tong University(上海交通大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出可微分高效算子搜索框架,统一解释多种token缩减算子,通过联合搜索缩减位置、保留数量和算子行为,在预算约束下优化多模态模型性能。

详情
AI中文摘要

高效多模态基础模型通常依赖于手动设计的token缩减算子,如剪枝、合并、池化和自适应重加权。尽管这些算子看起来不同,但我们表明它们可以被解释为共享算子空间的不同区域。基于这一观点,我们引入了高效算子搜索,一个可微分框架,联合搜索在哪里缩减token、保留多少token以及如何处理缩减后的token信息。所提出的搜索空间参数化层激活、保留预算和算子行为,而搜索策略在单边预算和成本约束下优化任务性能。该公式将代表性手工设计基线作为特例恢复,并进一步发现超越孤立手动设计的混合算子。在多模态基准上的实验表明,搜索得到的算子在精度-效率权衡上具有竞争力,特别是在激进的视觉token缩减下。这些结果表明,高效多模态推理可以从手动算子设计重新构建为可微分算子搜索。

英文摘要

Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.

2606.05219 2026-06-05 cs.LG cs.AI

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

大步长梯度下降恢复多路径深度线性网络中的对称性

Hee-Sung Kim, Sungyoon Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究大步长离散梯度下降如何通过边缘稳定性振荡使多路径深度线性网络从对称性破坏转向信号重新分配,从而偏好共享表示而非单路径主导。

详情
Comments
ICML 2026
AI中文摘要

最近对多路径深度线性网络的分析使用梯度流预测了一种“赢家通吃”的专业化,其中路径对称性被破坏,每个特征集中在一个路径中。在这项工作中,我们表明具有大步长的离散梯度下降(GD)讲述了一个不同的故事。我们证明单路径解是尖锐最小值,而跨路径分布信号通过一个随路径数量和深度增加而减小的因子降低了尖锐度。因此,虽然早期训练再现了由GF预测的深度驱动的对称性破坏,但随后在稳定性边缘的振荡覆盖了这一趋势,并将网络驱动到重新平衡阶段,其中信号在路径间重新分布。总之,这些结果阐明了深度如何塑造路径竞争,并解释了大步长GD为何偏好共享表示而非持续的单路径主导。

英文摘要

Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.

2606.05201 2026-06-05 cs.LG

State commitment learning: training language models to distinguish computation from memory

状态承诺学习:训练语言模型区分计算与记忆

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出状态承诺学习目标及反事实擦除强化学习(CERL)方法,通过训练模型区分应保留的持久状态与可丢弃的临时计算,减少推理对隐藏思维的依赖,在数学、长链逻辑、科学问答和多轮工具使用任务中保持准确率的同时降低依赖。

详情
Comments
17 pages
AI中文摘要

推理语言模型不区分用于计算的token与构成持久状态的token:一旦生成,所有隐藏思维都保留在上下文中并影响未来预测。因此,下游推理可能依赖于不应在后续安全依赖的失败尝试、死胡同和私人草稿。我们将此现象重新定义为一种新的训练目标,即状态承诺学习:训练模型明确区分应作为持久状态提交的信息与可丢弃的临时计算。我们定义了一个反事实准则,即持久状态充分性,使得在隐藏思维被擦除后答案是否仍然可用变得可训练和可测量。然后,我们提出反事实擦除强化学习(CERL),它在相同前缀下评估保留隐藏思维的路径和擦除它们的路径,并仅在擦除路径保持正确时给予奖励。我们还引入了擦除依赖协议,并在数学、长链逻辑、科学问答和多轮工具使用评估中表明,CERL在不牺牲准确率的情况下显著降低了答案对隐藏思维的依赖,始终优于仅正确性强化学习和长答案SFT基线。

英文摘要

Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream reasoning may depend on failed attempts, dead ends, and private scratch work that should not be safely relied on later. We recast this phenomenon as a new training objective, state commitment learning: training models to explicitly distinguish information that should be committed as persistent state from temporary computation that can be discarded. We define a counterfactual criterion, persistent-state sufficiency, which makes it trainable and measurable whether an answer remains usable after hidden thoughts are erased. We then propose Counterfactual Erasure RL (CERL), which evaluates, under the same prefix, both a path that keeps hidden thoughts and a path that erases them, and gives reward only when the erasure path remains correct. We also introduce the Erasure Dependence Protocol and show across mathematics, long-chain logic, scientific QA, and multi-turn tool-use evaluation that CERL substantially reduces answer dependence on hidden thoughts without sacrificing accuracy, consistently outperforming correctness-only RL and long-answer SFT baselines.

2606.05194 2026-06-05 cs.LG cs.AI cs.CL

Temporal Preference Concepts and their Functions in a Large Language Model

时间偏好概念及其在大语言模型中的功能

Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk

发表机构 * AISC(AI Safety Camp) SPAR(Supervised Program for Alignment Research)

AI总结 通过因果定位和激活修补,本文发现大语言模型在中间到上层节点编码时间偏好几何结构,且行为分析表明模型对未来折扣比人类更平缓,但偏好不稳定,可通过引导向量调控。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署用于需要在近期收益与长期后果之间权衡的决策,然而关于它们如何在内部表示或解决这些权衡,我们知之甚少。在这项工作中,我们通过因果定位了一个蒸馏LLM(Qwen3-4B-Instruct-2507)中时间偏好的底层子图,通过来自梯度归因和激活修补的汇聚证据识别了中上层节点。我们发现时间跨度的几何结构在预期局部层的残差流中被编码。行为分析表明,未干预的LLM对未来折扣的陡峭程度比人类低几倍,但这种偏好跨上下文不稳定,这促使我们进行显式控制而非隐式依赖训练。最后,我们发现有暗示性证据表明引导向量可以改变时间偏好。我们的工作展示了机械可解释性如何使我们更接近对LLM规划和推理方式的可靠控制。

英文摘要

Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

2606.05186 2026-06-05 cs.LG cs.CL

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

预算受限的微预训练中的分阶段因子筛选

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(惠普企业)

AI总结 针对预算受限的微预训练,提出分阶段分数因子设计方法,通过短时筛选识别高惩罚方向并确认有效锚点,在共享加速器上实现高效配方筛选。

详情
Comments
23 pages, 4 figures
AI中文摘要

预算受限的微预训练通常需要在共享加速器上对许多候选配方进行分诊,然后才能花费更大的搜索预算。我们研究了分阶段分数因子工作流是否能在这种设置中恢复稳定的早期效应结构。在固定的自动研究衍生的单GPU训练循环上,我们运行了613个实验,包括在2、5和10分钟时的试点和后续筛选;5和10分钟时的完整16条件种子重运行;有针对性的种子锚点检查;同主机贪婪和匹配成本随机基线;一个60分钟的桥接包;以及通过24小时的有界Windows A100和Linux L40S锚点延续。总批次、深度和宽度的主要惩罚在短预算时最大,并随预算增加而放松。在预先声明的种子全屏系列中,D、A、B和C在预算内Benjamini-Hochberg校正后,在5和10分钟时保留非零估计,而E则没有。随机搜索可以在这个32条件空间中达到强当前最优,但反复在相同的低惩罚区域,且没有因子归因。60分钟桥接锚点具有最低均值,尽管该包没有将工作流改进与更大桥接模型的能力优势分开。在两个主机上的有界12小时和24小时三锚点延续中,桥接具有最低样本均值,而非桥接顺序保持主机敏感。因此,我们提出了一个有界方法结果:使用短设计筛选来识别高惩罚方向,在重复运行下确认有希望的锚点,并在缩减空间内局部细化。证据支持在24小时内两个主机上的以桥接为中心的推荐,而不是硬件不变的排名或通用超参数优化的优越性。

英文摘要

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

2606.05183 2026-06-05 cs.CL cs.AI cs.HC

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

粒度差距:Gemini 模型中谄媚行为的多维纵向审计

Patrick Keough

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多维度分级评估(Likert 0-4),揭示 Gemini 模型在连续尺度上的谄媚行为,发现粗粒度二值指标掩盖了大量社会顺从行为,且代际进步非单调,存在对齐税(谄媚与真实性负相关)。

详情
Comments
16 pages, 9 figures
AI中文摘要

大型语言模型越来越多地被部署为高风险顾问,但标准对齐基准将谄媚视为二值失败模式。我们引入粒度差距:粗粒度二值指标掩盖了大量社会顺从行为,即模型屈服于用户框架、验证可疑前提或软化事实纠正而不产生明显错误输出。我们在三个防护栏条件(控制、简单、协议)下,对跨越 2.0、2.5 和 3.0 代的六个 Gemini 变体在 73 个对抗性提示上进行了评估,得到 8,830 个分级响应。使用经过人类标注者三人组验证的 0-4 Likert 量表(Fleiss kappa = 0.71;与 AI 共识的 Cohen kappa = 0.78;95.9% 二值准确率,100% 特异性),我们将谄媚量化为连续而非二值。出现三个发现。第一,27.2% 的响应包含大量谄媚内容(Likert >= 2.0),22.7% 达到中度或严重水平(>= 3.0),而二值胜率框架仅报告适度的失败率;粗粒度指标仅解释 29% 的分级方差。第二,代际进步是非单调的:Gen 2.5 相对于 Gen 2.0(1.90)和 Gen 3.0(2.01)急剧倒退(平均控制 2.64),且 Gen 2.5 呈现逆缩放(Pro 1.94 比 Flash 1.71 更差),而 Gen 3.0 恢复了标准缩放。第三,我们记录了对齐税:谄媚与真实性之间的 Spearman rho = -0.63,表明社会顺从以事实准确性为代价。自我验证提示作为谄媚陷阱(平均 3.27),几乎是 unethical proposals(1.72)的两倍。简单防护栏在旗舰模型上优于复杂的协议脚手架,但蒸馏后的 Gen 3.0 Flash 反转了这一点,表明小模型可能在结构上需要思维链脚手架。我们发布了数据集和评分标准以支持连续谄媚测量。

英文摘要

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

2606.05182 2026-06-05 cs.CL cs.IR

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN: 用于长上下文LLM对话的分层存档与时间情节检索网络

Rahul Subramani

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 提出LANTERN,一种轻量级记忆层,通过混合检索主动存档对话轮次并恢复压缩后丢失的细节,无需LLM调用且延迟低于25ms,在94个多轮对话中恢复78.3%的可验证事实,优于MemGPT基线。

详情
AI中文摘要

当对话历史被压缩以适应有限的上下文窗口时,大型语言模型会丢弃关键细节。我们提出了LANTERN(分层存档与时间情节检索网络),一种轻量级记忆层,它主动存档每一轮对话,并通过混合检索在压缩后恢复相关细节——无需任何LLM调用,每轮延迟低于25ms。在94个真实多轮对话(1,894个真实事实,人工验证kappa=0.81)上,LANTERN-Rerank恢复了78.3%因压缩而丢失的可验证事实,显著优于忠实复现的MemGPT的LLM驱动提取与多查询搜索流水线(72.4%;Wilcoxon p<0.0001,95% CI [+3.1, +8.6] pp,d=0.43),且推理成本极低。即使没有重排序器,基础LANTERN在零LLM调用的情况下也能匹配或超越该LLM驱动基线(p=0.005)。当四个生产级LLM使用LANTERN恢复的上下文回答事实性问题时,准确率平均提升8.4个百分点(每个模型单独Wilcoxon p<0.05),表明恢复的上下文在不同模型架构上均有用。我们发布了完整的评估框架——包括配对显著性检验、失败分析、事实类型分层和压缩鲁棒性分析——以支持可重复性和未来工作。

英文摘要

Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p<0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p<0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

2606.05181 2026-06-05 cs.CL cs.AI

Multi-Granularity Reasoning for Natural Language Inference

自然语言推理的多粒度推理

Chunling Xi, Di Liang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出多粒度推理网络(MGRN),通过分层语义特征交互模拟人类认知过程,在多个基准上超越强基线模型。

详情
AI中文摘要

自然语言推理(NLI)是自然语言理解中的一项基本任务,需要确定前提和假设之间的逻辑关系。尽管基于Transformer的预训练模型取得了显著成功,但大多数现有方法主要依赖最后一层的token表示,这通常不足以捕捉有效推理所需的复杂分层语义交互。特别是,细粒度的词汇线索、短语组合和更高层次的上下文语义通常在单一表示空间中被纠缠或稀释。为了解决这些限制,我们提出了一种新颖的\emph{多粒度推理网络}(MGRN),它在交互式推理空间中显式利用分层语义特征。所提出的框架模拟了人类语言理解的认知过程,该过程自然地从浅层词汇匹配进展到更深层次的语义抽象和逻辑推理。通过以渐进和结构化的方式整合多个粒度的语义信息,MGRN能够揭示自然语言表达背后的复杂语义关系。在多个公开基准上的大量实验表明,MGRN始终优于强基线模型,验证了所提出方法的有效性和鲁棒性。

英文摘要

Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

2606.05180 2026-06-05 cs.CL cs.AI

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释:评估基于量规的教学质量评估中的SHAP和LLM理由

Ivo Bueno, Babette Bühler, Philipp Stark, Tim Fütterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Lund University(吕勒奥大学) University of Tübingen(图宾根大学) Stanford Graduate School of Education(斯坦福大学教育研究生院) Harvard Graduate School of Education(哈佛大学教育研究生院)

AI总结 提出一个结合SHAP和LLM理由的框架,用于基于量规的评分模型的可解释性,并在课堂转录数据上评估其忠实性和可迁移性。

详情
Comments
Accepted to Findings of ACL 2026
AI中文摘要

自动化评分模型越来越多地被用于为复杂的语言表现(包括课堂转录)分配基于量规的质量评级,但它们通常很少提供关于为什么产生特定分数的见解。我们提出了一个通用的框架,用于基于量规的评分的句子级可解释性,该框架将模型无关的Shapley值归因与大型语言模型(LLM)生成的理由相结合。在使用NCTE语料库的CLASS框架的反馈质量维度上实例化,该框架能够系统地比较微调的预训练语言模型(PLM)和提示的LLM在评分性能和解释忠实性方面的表现。在6k个带注释的转录片段中,微调的PLM在预测准确性上优于LLM,但表现出向中等尺度分数的标签压缩。基于删除的测试表明,SHAP识别出可靠驱动模型预测的句子,产生的预测变化通常比LLM生成的理由更大且更连贯。跨模型分析进一步揭示,SHAP归因在不同架构间稳健地迁移,而LLM理由的影响有限且不一致。总体而言,研究结果表明,SHAP为基于量规的评分提供了更忠实和可迁移的解释,并且所提出的框架为在高风险教育环境和其他基于量规的语言评估任务中评估评分模型及其解释提供了原则性基础。

英文摘要

Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.