arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪 全部专题

视觉与机器人

2606.18465 2026-06-18 cs.LG cs.AI 新提交

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

权重范数在Grokking中控制什么?交叉熵下的对数尺度中介作用

Truong Xuan Khanh

发表机构 * H&K Research Studio, Clevix LLC

AI总结 本文通过固定权重范数并改变输出温度,发现Grokking延迟主要由对数尺度(logit scale)决定,权重范数仅通过影响对数尺度间接起作用。

Comments 16 papges, 10 tables and 4 figures. Code and data to reproduce all numbers, tables, and figures: https://github.com/ClevixLab/grokking-logit-scale

详情
AI中文摘要

Grokking,即从记忆到泛化的延迟跳跃,通常与权重范数相关:范数越小,泛化越早。我们探究范数实际控制什么。通过钳位固定权重范数并仅改变输出温度,我们在交叉熵下将Grokking延迟滑动到其整个范数诱导范围;将有效对数尺度匹配回基线可恢复两个模数下约85%的延迟。在范数和温度的网格上,延迟仅由对数尺度决定(R2 = 0.97),范数仅额外贡献1-2%。该效应依赖于损失函数:在均方误差下,对数尺度被固定,范数通过不同路径起作用。记忆控制、float64 softmax崩溃审计和无LayerNorm的Transformer均指向同一通道。从同一状态分叉,延迟遵循钳位的范数值而非钳位操作本身,这排除了重缩放伪影。近端变量是对数尺度及其驱动的softmax饱和;权重范数仅是上游手柄。所有数字、表格和图表均可从发布的代码和数据中复现。

英文摘要

Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output temperature, we slide the grokking delay across its entire norm-induced range under cross-entropy; matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. Across a grid of norms and temperatures the delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer point to the same channel. Forking arms from one identical state, the delay follows the held norm value and not the clamp operation, which closes a rescaling-artifact concern. The proximal variable is the logit scale and the softmax saturation it drives; the weight norm is only an upstream handle. All numbers, tables, and figures reproduce from released code and data.

2606.18457 2026-06-18 cs.LG 新提交

Task-Restricted Symmetries in Recurrent Weight Space

循环权重空间中的任务限制对称性

Simon Dräger

发表机构 * Salk Institute for Biological Studies, La Jolla, CA, USA(索尔克生物研究所,拉霍亚,加利福尼亚州,美国)

AI总结 通过有序实Schur坐标分析单层tanh RNN,发现任务分布下循环矩阵存在功能冗余,特定非正常Schur耦合可被移除而不影响性能,揭示了任务限制的近似功能不变性。

Comments 6 pages, 2 figures. Accepted at the ICML 2026 Workshop on Weight-Space Symmetries

详情
AI中文摘要

循环网络在权重空间中可能包含大量的功能冗余:改变一个循环矩阵可能使输入-输出展开在任务分布上几乎不变,而类似尺度的变化可能破坏相同的行为。我们使用有序实Schur坐标研究单层tanh RNN中的这种冗余。Schur形式将谱块与定向非正常耦合分开,为保持输入和读出映射固定的结构化消融提供了诊断基础。在固定长度的复制任务中,一些训练好的解中可以选择性地移除非正常Schur耦合而损失很小,而其他耦合对于准确的自主回放是必要的。在触发器、正弦生成和上下文相关积分任务中,损失保持的消融轮廓因任务和训练解而异。这些结果识别了候选的近似功能不变性,而非循环权重空间的普遍对称性。Schur坐标消融提供了一种实用的诊断方法,用于判断哪些结构化扰动能保持训练好的循环解,哪些会破坏其计算。

英文摘要

Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input-output rollout nearly unchanged on a task distribution, while similar-scale changes can destroy the same behavior. We study this redundancy in one-layer tanh RNNs using ordered real Schur coordinates. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed. In a fixed-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay. Across flip-flop, sine generation, and context-dependent integration, the loss-preserving ablation profile varies across tasks and trained solutions. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space. Schur-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation.

2606.18454 2026-06-18 cs.LG cs.AI 新提交

Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

Veriphi: 基于攻击引导的神经网络验证与数据集依赖训练方法

Pratik Deshmukh, Kartik Arya, Vasili Savin

发表机构 * TU Wien(维也纳工业大学)

AI总结 提出Veriphi系统,结合快速对抗攻击与α,β-CROWN形式化边界验证,实验表明训练方法有效性依赖数据集特性,IBP在MNIST上有效但在CIFAR-10上失效,PGD对抗训练在小扰动下达到94%认证准确率,并实现5倍验证加速。

Comments 17 Pages, 8 Figures

详情
AI中文摘要

我们提出Veriphi,一个GPU加速的神经网络验证系统,它使用α,β-CROWN方法将快速对抗攻击与形式化边界认证相结合。通过在MNIST和CIFAR-10上使用三种训练方法(标准、对抗、认证)进行系统实验,我们证明了训练方法的有效性从根本上依赖于数据集。区间边界传播(IBP)在简单的MNIST(784维)上达到78%的认证准确率,但在更复杂的CIFAR-10数据集上提供的认证性能可忽略不计,而在小扰动下PGD对抗训练以94%的认证率占主导地位。我们通过攻击引导的伪造实现了5倍的验证加速,并将我们的方法扩展到生产规模模型(1.058亿参数),用于实际航空航天物流优化。我们的结果挑战了认证训练普遍优于对抗训练的假设,表明上下文对于验证策略选择至关重要。

英文摘要

We present Veriphi, a GPU-accelerated neural network verification system that combines fast adversarial attacks with formal bound certification using alpha,beta-CROWN methods. Through systematic experiments on MNIST and CIFAR-10 using three training methodologies (standard, adversarial, certified), we demonstrate that training method effectiveness is fundamentally dataset-dependent. Interval Bound Propagation (IBP) achieves 78% certified accuracy on simple MNIST (784 dimensions) but provides negligible certification performance on the more complex CIFAR-10 dataset, where PGD adversarial training dominates with 94% certification at small perturbations. We achieve 5x verification speedup through attack-guided falsification and scale our approach to production-size models (105.8M parameters) for real-world aerospace logistics optimization. Our results challenge the assumption that certified training universally outperforms adversarial training, showing context matters critically for verification strategy selection.

2606.18453 2026-06-18 cs.CL 新提交

LLM Parameters for Math Across Languages: Shared or Separate?

跨语言数学问题的LLM参数:共享还是分离?

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali, Markus Frey

发表机构 * Lamarr Institute(Lamarr研究所) University of Bonn(波恩大学) Fraunhofer IAIS(弗劳恩霍夫智能分析和信息系统研究所)

AI总结 通过跨语言机制分析,发现多语言LLM中数学相关参数存在部分跨语言重叠,且主要集中在中间层,英语参数集最大,低资源语言参数集较小。

Comments 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: https://github.com/luisavictor/math-across-languages Translated Datasets: https://huggingface.co/math-across-languages Webpage: https://math-across-languages.github.io

详情
AI中文摘要

大型语言模型(LLM)在数学推理性能上表现出显著的跨语言差异,但目前尚不清楚这些差异是反映语言特定参数,还是反映一种因语言不同而表现不同的共享机制。我们提出了一种跨语言的LLM数学推理机制分析,使我们能够定位和比较支持跨语言数学推理的模型参数。我们发现,提取的数学相关参数表现出部分跨语言重叠,最强的重叠集中在中间模型层。我们进一步观察到,英语始终产生最大的数学相关参数集,而低资源语言则显示出较小的相关参数集。这些结果表明,多语言LLM中与数学相关的行为既不是完全语言不变的,也不是完全语言特定的,而是表现出部分跨语言参数重叠,并伴有系统性的语言依赖差异。

英文摘要

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

2606.18451 2026-06-18 cs.LG 新提交

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

跨模型VLM评判协议用于单图像3D网格质量(以及为什么廉价代理方法不足)

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结 提出可重复的VLM评判协议评估单图3D网格质量,发现几何有效性和渲染CLIP等廉价代理方法无法替代VLM评判。

详情
AI中文摘要

单图像到3D生成器正在快速改进,但目前没有公认的、无需人工的方法来判断生成的网格是否优于另一个。从业者通常依赖廉价的自动代理方法(渲染空间的CLIP相似性和网格几何有效性统计),但这些方法在多大程度上跟踪感知质量尚未确定。我们做出两项贡献。首先,我们提出并验证了一个可重复的VLM评判评估协议:一个固定的24视角无头渲染装置、两个独立的视觉语言评判家族,以及一个强制的位置偏差校正,该校正查询两种呈现顺序并仅保留顺序一致的判决。两个评判家族彼此高度一致(Cohen's kappa = 0.66),远高于随机一致性基线。其次,以该协议为参考,我们证明廉价代理方法无法替代它。几何有效性平均而言仅是一个弱信号(因为,如我们所示,它是双峰的),且低于我们预先注册的目标,而渲染CLIP则处于随机水平。一个学习的Bradley-Terry头部坍缩到一个单一流形统计量(给渲染CLIP赋予负权重),并且与仅几何方法完全匹配,因此学习特征权重毫无收益。该代理方法也是双峰的:在具有可见几何缺陷的对比中显著高于随机水平,但在模糊对比中处于随机水平,这与几何有效性仅在缺陷视觉显著时跟踪评判者的行为一致。因此,我们推荐VLM评判协议作为在测试条件下(Google Scanned Objects上的两个前馈生成器,采用面丢失退化机制)可靠且可重复的评估器,并建议不要将几何/CLIP代理方法作为优化目标。

英文摘要

Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.

2606.18448 2026-06-18 cs.CL 新提交

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL:面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)

AI总结 提出VISUALSKILL分层多模态技能库,通过结合文档与UI探索构建,使智能体在CUA基准上平均得分提升15.3点,且多模态优于纯文本技能。

详情
AI中文摘要

计算机使用智能体(CUA)在标准化基准上接近人类水平,但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题,但仅以文本形式表示技能工件,忽略了GUI交互的视觉特性。我们提出VISUALSKILL:一种分层多模态技能,针对每个目标应用定制,并组织为按主题文件索引的中央索引,智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上,由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456,比无技能基线(0.303)绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比,VISUALSKILL进一步绝对提升8.3点(0.373 vs. 0.456),直接证明在技能工件中保留视觉图形而非将其语言化,有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

2606.18444 2026-06-18 cs.LG cs.AI 新提交

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN:基于时间感知多关系引导图神经网络的信用卡欺诈检测

Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

发表机构 * Unysis Truist Banks Infinity Tech Group Technical Product(Unysis 信任银行 Infinity 技术集团技术产品) Fairfax, USA(美国费尔法克斯) Atlanta, USA(美国亚特兰大) Sunnyvale, USA(美国 Sunnyvale) Persistent Systems IEEE Vice Chair AeroSpace Chapter(Persistent 系统 IEEE 副主席航空航天分会) Discover Financial Services(Discover 金融服务) Edison, USA(美国埃迪森)

AI总结 提出TMR-GGNN框架,通过时间窗口内异构实体交互建模、动态多关系图构建、时间感知注意力机制和对比学习解码器,结合InfoNCE与Focal Loss复合损失函数,解决数据不平衡和欺诈模式演化问题。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON), Pages 7

详情
AI中文摘要

近年来,由于高度不平衡的数据、不断演变的欺诈模式以及交易实体间复杂的关联结构,信用卡欺诈检测面临重大挑战。为解决这些问题,本研究提出了一种名为时间感知多关系引导图神经网络(TMR-GGNN)的新框架。具体而言,所提出的TMR-GGNN通过建模客户、商户、设备和IP在时间窗口内的异构交互,扩展了编码器-解码器图神经网络(GNN)架构。随后,该TMR-GGNN方法构建了一个动态的多关系图,并在编码器中引入时间感知关系注意力机制,以基于时间邻近性和语义上下文自适应地权衡交易相关性。因此,解码器采用对比学习模块来区分真实和合成的交易模式,同时提高模型对罕见欺诈案例的泛化能力。此外,为有效管理严重的类别不平衡并强调判别性学习,引入了结合基于信息噪声对比估计(InfoNCE)的对比损失与Focal Loss的复合损失函数。这种集成有助于改进欺诈识别,同时减少假阴性。

英文摘要

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18431 2026-06-18 cs.LG cs.DC 新提交

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测:面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department(康奈尔大学计算机科学系) Cornell University, Electrical and Computer Engineering Department(康奈尔大学电气与计算机工程系) Cornell University, Operations Research and Information Engineering Department(康奈尔大学运筹学与信息工程系) Microsoft Azure System Research(微软Azure系统研究) NVIDIA Corporation(英伟达公司)

AI总结 针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性,提出无预测的分布感知调度框架,通过轻量统计信号实现软优先级提升,结合缓存感知抢占,在多种工作负载下将P99 TTLT降低35-50%,TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

LLM服务表现出极端的长度可变性,使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT,并主要报告均值中心指标如TTFT和TBT。我们表明,这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱,同时对主导用户体验的尾延迟(P90-P99)控制有限,即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架,用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占,以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明,相对于具有完美长度知识的SRPT,我们的方法将P99 TTLT降低了高达35-50%,并在各种工作负载(包括推理密集型和聊天密集型任务)上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

2606.18430 2026-06-18 cs.LG cs.CR 新提交

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

签名过滤:大型语言模型中统计水印检测的轻量级增强方法

Chih-Duo Hong, Yen-Pang Chen, Fang Yu

发表机构 * National Chengchi University(国立政治大学)

AI总结 提出签名过滤模块,通过移除干扰水印检测的签名令牌,在弱信号和低熵设置下将检测率从8-31%提升至78-99%,同时保持可控的假阳性率。

详情
AI中文摘要

统计水印帮助组织归因大型语言模型(LLM)的输出,但现有检测器在水印信号弱、文本重复或水印被编辑时往往表现不佳。我们提出签名过滤,一种检测时模块,在不修改水印嵌入和文本生成的情况下增强水印检测。它学习一小部分“签名”令牌,这些令牌的存在会使水印测试不可靠,并在检测前移除这些令牌。通过在小训练集上求解混合整数线性规划获得签名,约束条件最大化真阳性率。我们还推导了在几种攻击者模型(色盲、颜色自适应和分布相关)下的有限样本和渐近界。在四个知名水印家族(Kgw、Sweet、Unigram、Exp)、四个基准语料库(C4、MBPP、HumanEval、Code-Search-Net)和六个LLM(Opt-1.3b、Opt-6.7b、Llama2-13b、Llama3.1-8b、Qwen2.5-14b、Phi-3-medium-14b)上,2-gram和3-gram签名在弱信号和低熵设置下将检测率从无过滤时的8-31%提升至78-99%,同时保持假阳性率可控且通常可忽略。在压力测试中,我们打乱句子并稀释、删除和替换25-50%的令牌,针对Kgw风格水印的2-gram过滤器保留了大部分干净文本的检测增益,通常匹配或超越先进的WinMax水印检测器。因此,签名过滤提供了一种简单、可扩展且模型无关的附加组件,以加强信息处理工作流中LLM文本基于水印的来源检查。

英文摘要

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany(奥尔巴尼大学)

AI总结 提出CAOA方法,结合语义感知点云补全和对称感知相对位姿估计,在Scan2CAD上实现17%精度提升,并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

Journal ref Thirteenth International Conference on 3D Vision (3DV), 2026

详情
AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度(DoF)位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐(CAOA),该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合,实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估,往往难以泛化到真实扫描。为弥合这一差距,我们引入了一种针对室内场景的合成数据生成策略,通过与广泛使用的补全数据集进行定量比较,验证了其显著减小合成到真实领域差距的效果。此外,我们发布了S2C-Completion,一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集,用于真实室内单物体补全,并作为该任务的新基准。对于物体-CAD对齐,我们通过对称感知损失融入对称信息,提高了对对称模糊的鲁棒性。在Scan2CAD基准上,CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

2606.18426 2026-06-18 cs.RO 新提交

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18420 2026-06-18 cs.LG q-bio.QM stat.ML 新提交

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

测量噪声限制了非线性模型在生物医学预测中相对于线性模型的优势

Marc-Andre Schulz, Kerstin Ritter

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen(赫蒂人工智能脑健康研究所,图宾根大学) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Department of Psychiatry and Neurosciences, Charité – Universitätsmedizin Berlin(精神病学与神经科学系,柏林夏里特医学院) Bernstein Center for Computational Neuroscience, Berlin(伯恩斯坦计算神经科学中心,柏林) German Center for Mental Health (DZPG), partner site Tübingen(德国心理健康中心(DZPG),图宾根合作站点)

AI总结 本文指出,在生物医学表格数据中,测量噪声会削弱非线性结构,导致非线性模型与线性模型性能相当,并提出了一个精确的超额风险恒等式,揭示了测量可靠性、样本量和特征表示三个条件必须同时满足才能体现非线性优势。

详情
AI中文摘要

在生物医学表格数据上,诸如深度网络、梯度提升树和核方法等灵活模型,在给定相同特征的情况下,反复被线性回归和逻辑回归匹配或击败。通常的反应是将其视为模型方面的不足,需要通过更多数据、更好的架构或调参来修复,假设非线性结构存在而模型未能捕捉到。我们认为,当限制因素是测量而非模型时(这在生物医学中经常发生),这些修复无法奏效。加性噪声模糊了群体最优预测器,并且由于模糊在去除函数的广泛形状之前先去除精细、快速变化的细节,它比线性结构更快地抹去非线性结构。一个k阶交互作用被特征可靠性的k次幂衰减,而线性部分只衰减一次。在生物医学测量典型的可靠性下,即使底层生物学是强非线性的,非线性优势也可能消失,并且噪声所移除的部分无法通过更大的队列或更灵活的模型恢复,只能通过更好的测量。非线性是隐藏的,而非缺失,线性模型与灵活模型之间的平局本身并不能对生物学做出定论。这些片段是经典的,来自测量误差统计、心理测量学和高斯分析,我们将它们组合成一个精确的超额风险恒等式。测量可靠性是与样本量和特征表示并列的三个条件之一,必须对齐才能使灵活模型发挥作用,而它们共同只留下一个狭窄的窗口,大多数生物医学任务落在此窗口之外。在140个英国生物银行任务中,灵活模型与线性模型之间的差距(如果存在)带有预测的噪声特征,并且这三个条件可以通过干预而非仅通过基准测试来分离。

英文摘要

On biomedical tabular data, flexible models such as deep networks, gradient-boosted trees, and kernel methods are repeatedly matched or beaten by linear and logistic regression given the same features. The usual reaction is to treat this as a model-side shortfall, to be fixed with more data, a better architecture, or tuning, on the assumption that the nonlinear structure is there and the model has failed to capture it. We argue that these fixes cannot help when the binding limit is the measurement rather than the model, as it frequently is in biomedicine. Additive noise blurs the population-optimal predictor, and because blurring removes a function's fine, rapidly varying detail before its broad shape, it erases nonlinear structure faster than linear structure. A degree-$k$ interaction is attenuated by the $k$-th power of feature reliability, while the linear part is attenuated only once. At the reliabilities typical of biomedical measurement, the nonlinear advantage can vanish even when the underlying biology is strongly nonlinear, and what the noise removes cannot be recovered by a larger cohort or a more flexible model, only by better measurement. The nonlinearity is hidden, not absent, and a tie between linear and flexible models is not by itself a verdict on the biology. These pieces are classical, drawn from measurement-error statistics, psychometrics, and Gaussian analysis, and we assemble them into an exact excess-risk identity. Measurement reliability is one of three conditions, alongside sample size and feature representation, that must align for a flexible model to help, and together they leave only a narrow window that most biomedical tasks fall outside. Across 140 UK Biobank tasks, the gap between flexible and linear models, where it exists, carries the predicted noise signature, and the three conditions can be separated by intervention but not by a benchmark alone.

2606.18406 2026-06-18 cs.CL 新提交

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) Shandong Analysis and Test Center, Qilu University of Technology(齐鲁工业大学山东省分析测试中心) State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs(道地药材品质保障与可持续利用国家重点实验室)

AI总结 提出CoreMem架构,用黎曼检索替代余弦相似度解决高维检索枢纽问题,通过Fisher引导离散令牌蒸馏实现原则性压缩,在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情
AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而,在消费级硬件(例如8 GB VRAM边缘设备)上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索,以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础,经常在高维检索中遭受枢纽问题,并在压缩过程中出现句法碎片化。为克服这些限制,我们提出CoreMem,一种资源高效的边缘-云记忆架构,从根本上由信息几何统一。首先,黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配,通过马氏距离有效惩罚枢纽记忆,并采用O(Ndr) Woodbury加速实现实时搜索。其次,Fisher引导离散令牌蒸馏(FDTD)引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数,提供原则性的压缩-KL权衡,并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估,CoreMem实现了显著的准确率提升,在开放域(+4.51个百分点)和时间(+4.17个百分点)推理上取得实质性增益。广泛性能分析证实,CoreMem在严格的8 GB VRAM预算内无缝运行,成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

2606.18394 2026-06-18 cs.CL 新提交

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2606.18390 2026-06-18 cs.LG q-bio.QM 新提交

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

MOLAR: 从噪声标签中学习多模态分子表示

Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li, Eran Segal

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Zhengzhou University(郑州大学) The Education University of Hong Kong(香港教育大学) The Chinese University of Hong Kong(香港中文大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出MOLAR框架,通过分离干净属性推断与标签观测,利用图与文本模态的残差证据,从噪声标签中学习多模态分子表示,在自然噪声和标签翻转基准上优于基线方法。

详情
AI中文摘要

动机:噪声标签是分子属性预测中的常见挑战,因为分子注释通常来自实验分析、 curated数据库或弱注释流程,而非直接观测到的干净生物状态。将记录标签视为可靠监督会导致模型记忆损坏的观测并学习误导性的分子证据。在多模态分子表示学习中,图-文本融合或对齐可能放大此问题,从而跨模态传播标签引起的错误。结果:我们提出MOLAR,一个从噪声标签中学习多模态分子表示的噪声感知框架。MOLAR将潜在干净属性推断与记录标签观测分离:图和文本视图为干净属性分布贡献残差证据,一个分类标签观测通道将此分布映射到记录标签用于训练。该公式从模型中推导出后验标签可靠性和模态特定的分子证据。在自然噪声分子基准和受控标签翻转基准上的实验表明,MOLAR始终优于代表性基线。可视化分析进一步表明MOLAR提供了可解释的可靠性和模态证据诊断。

英文摘要

Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence. In multimodal molecular representation learning, this issue can be amplified by graph-text fusion or alignment, which may propagate label-induced errors across modalities. Results: We propose MOLAR, a noise-aware framework for learning multimodal molecular representations from noisy labels. MOLAR separates latent clean-property inference from recorded-label observation: graph and text views contribute residual evidence to a clean-property distribution, and a categorical label-observation channel maps this distribution to recorded labels for training. This formulation derives posterior label reliability and modality-specific molecular evidence from the model. Experiments on naturally noisy molecular benchmarks and controlled label-flipping benchmarks show that MOLAR consistently outperforms representative baselines. Visualization analyses further show that MOLAR provides interpretable reliability and modality-evidence diagnostics.

2606.18389 2026-06-18 cs.CL 新提交

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据?引导它:面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所) German Research Institute for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 提出激活引导作为低资源语言合成数据生成的替代方法,包括语言引导和质量引导,实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)已成为合成数据生成的有效工具,包括低资源语言,生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示,这增加了推理成本,并可能通过词汇锚定降低多样性。在这项工作中,我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略:语言引导,针对语言的 linguistic identity;以及质量引导,通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法,通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用,并与非引导对应方法进行比较。我们的结果表明,早期层的引导一致地提高了生成数据的多样性,同时通常产生更强的下游模型性能,特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.18385 2026-06-18 cs.AI 新提交

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT:一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出CaVe-VLM-CoT框架,通过五阶段闭环流水线(提取器、检索器、求解器、引用注入器、验证器)实现证据推理,并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础,在ScienceQA和MMMU上取得性能提升。

详情
AI中文摘要

视觉-语言模型(VLM)仍然容易产生幻觉,输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题,因为它们既没有强制执行步骤级引用基础,也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT,一个模块化的基于反射的智能体RAG框架,通过五阶段闭环流水线强制执行证据推理:提取器、检索器、求解器、引用注入器和验证器,其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础,我们提出了一套涵盖所有阶段的23个组件级指标,以CaVeScore为核心,这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改,CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore,在MMMU(30个学科)上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

2606.18384 2026-06-18 cs.LG cs.DC 新提交

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

SCOPE-FL:一种策略证明的基于链的最优帕累托高效联邦学习系统

Seyed Salar Ghazi, Kaiwen Zhang, Mehdi feizi, Hans-Arno Jacobsen

发表机构 * École de Technologie Supérieure (ÉTS)(高等技术学院) Ferdowsi University of Mashhad(菲尔多西大学) University of Toronto(多伦多大学)

AI总结 针对分层联邦学习中客户端选择策略缺乏帕累托效率和策略证明性导致整体福利下降的问题,提出SCOPE-FL框架,采用顶级交易循环算法同时保证帕累托最优和策略证明性,并通过区块链智能合约实现奖励分配。

详情
AI中文摘要

分层联邦学习(HFL)能够在分布式设备间实现可扩展的协作模型训练,同时保护数据隐私。然而,现有的HFL客户端选择机制存在根本性的策略低效问题。通过优先考虑稳定性而非帕累托效率(PE),它们产生次优的资源分配,并且缺乏策略证明性(SP),参与者有动机歪曲其真实偏好,这两种失败在实践中都会在帕累托意义上降低系统整体福利。为解决这一问题,我们提出SCOPE-FL(策略证明的基于链的最优帕累托高效联邦学习),一种同步HFL框架,将客户端选择建模为双边学校选择问题,通过顶级交易循环(TTC)算法求解,同时保证PE和SP。对于奖励分配,SCOPE-FL采用基于一轮重建(OR)的可扩展沙普利值近似,确保补偿与每个客户端的贡献成比例。整个机制通过区块链智能合约执行,为SP保证在实践中成立提供了防篡改环境。在MNIST、Fashion-MNIST和CIFAR-10上的综合评估表明,SCOPE-FL在模型准确率、收敛速度和奖励效率方面优于现有最先进方法(包括DA、IAS等),同时通信延迟与DA相当,区块链开销在大规模下显著低于DA。

英文摘要

Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic inefficiency. By prioritizing stability over Pareto efficiency (PE), they produce suboptimal resource allocations, and without strategy proofness (SP), participants are incentivized to misrepresent their true preferences, both failures degrading system overall welfare in the Pareto sense in practice. To address it, we propose SCOPE-FL (Strategy-proof Chain-based Optimal pareto efficient Federated Learning), a synchronous HFL framework that formulates client selection as a two-sided school choice problem solved through the Top Trading Cycle (TTC) algorithm that simultaneously guarantees PE and SP. For reward distribution, SCOPE-FL employs a scalable Shapley value approximation based on One-Round Reconstruction (OR), ensuring compensation proportional to each client's contribution. The entire mechanism executes via blockchain smart contracts, providing the tamper-proof environment required for the SP guarantees to hold in practice. A comprehensive evaluation on MNIST, Fashion-MNIST, and CIFAR-10 demonstrates that SCOPE-FL outperforms state-of-the-art approaches, including DA, IAS, and other methods across model accuracy, convergence rate, and reward efficiency, while achieving communication latency comparable to DA and blockchain overhead significantly lower than DA at scale.

2606.18383 2026-06-18 cs.LG cs.CL 新提交

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理:认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院巴特那分校计算机科学与工程系)

AI总结 提出一种后验泛化框架,通过稀疏代理(SAE重建)认证语言模型,推导期望风险上界,并在GPT-2 Small等模型上验证非平凡界,揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于从语言模型(LM)中提取可解释特征,但一个核心问题仍然存在:基于SAE的解释何时可以被视为底层冻结LM的忠实视图?我们通过一个后验泛化框架来研究这个问题,该框架通过稀疏代理来认证LM,稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界:代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地,非平凡界表明提取的稀疏特征保留了有意义的预测信息,而小的重建和匹配误差表明代理在行为上接近原始模型。实验上,我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上,该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性,较深层变得更容易认证,这与更强的局部保真度和更弱的下游误差放大相关。最后,通过特征洗牌消融,我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性,为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

2606.18381 2026-06-18 cs.CL cs.IR 新提交

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG: 基于注意力引导的树搜索与渐进嵌入的长文档RAG

Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini

发表机构 * University of British Columbia(不列颠哥伦比亚大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出SproutRAG,通过注意力引导构建句子级分块树,实现多粒度检索,无需额外LLM调用,平均信息效率提升6.1%。

详情
AI中文摘要

检索增强生成(RAG)系统必须平衡检索粒度与上下文连贯性,现有方法通过LLM引导的分块、单级上下文扩展或层次摘要来解决这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用,将上下文聚合限制在单一粒度级别,或通过摘要引入信息损失。我们提出SproutRAG,一种注意力引导的层次化RAG框架,通过将句子级块组织成逐渐增大但语义连贯的单元,利用学习到的句子间注意力构建二分块树,从而解决这一权衡。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同,SproutRAG学习哪些注意力头和层最能捕捉语义文档结构,实现无需额外LLM调用或压缩摘要的多粒度检索。在检索时,SproutRAG使用层次化束搜索检索多个粒度的候选,捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练,同时改进嵌入和树结构。在涵盖科学、法律和开放域设置的四个基准上的实验表明,SproutRAG在最强基线上平均信息效率(IE)提升6.1%。代码可在该https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.

2606.18375 2026-06-18 cs.RO 新提交

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究院)

AI总结 提出PAIWorld框架,通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏,解决多视图世界模型的3D不一致问题,在机器人操作基准上取得领先性能。

详情
AI中文摘要

世界基础模型(WFMs)是强大的模拟器,但它们主要运行在单视图设置中,缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头(自我中心、眼到手和腕装)进行策略学习,但当前的多视图世界模型只是简单地拼接视图标记,没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷:缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此,我们提出PAIWorld,一个通过三个核心组件增强扩散变换器世界模型的框架:(1)几何感知交叉注意力块,建立跨视图的显式通路;(2)几何旋转位置编码,将相机射线方向和外部姿态编码到注意力机制中;(3)潜在3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性,在WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

2606.18372 2026-06-18 cs.CL cs.AI 新提交

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除?用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University(康奈尔大学)

AI总结 针对教育对话中课程术语与个人身份信息混淆的问题,提出一种完全本地的级联框架,通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类,在数学辅导对话上达到0.958的宏F1,优于商业API和纯LLM基线。

详情
AI中文摘要

教育对话是研究中有价值但敏感的资源:捕捉真实学习的同一份转录往往也包含与课程内容纠缠的个人身份信息(PII),其中“Riemann”可能指真实学生或数学概念。现有方法在治理和准确性之间强制权衡。商业大型语言模型(LLM)可以处理这种歧义,但需要将学生数据发送给第三方,而本地命名实体识别(NER)系统保留治理但过度删除课程术语。我们提出一个完全本地的级联框架,将去标识从开放式实体识别重新定义为约束性隐私分类。一个召回优先的联合提议器结合两个轻量级编码器和确定性规则,过度生成候选跨度;然后一个上下文感知审查器利用周围对话和说话者角色对每个候选做出二元的保留/删除决策。我们在两个大型平台的数学辅导转录上评估了三种审查器配置,与同系列纯LLM基线和商业API进行比较。最强的本地配置达到0.958宏F1,而同系列纯LLM基线为0.767,商业API为0.706,同时完全在单个笔记本电脑上运行。在针对课程-人名歧义的挑战集上,相同配置仅下降0.03 F1,而较小审查器下降0.19至0.25。这些结果表明,对于教育去标识,问题表述比模型规模更重要。

英文摘要

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

2606.18367 2026-06-18 cs.LG 新提交

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

时间序列基础模型基准是否隐藏了依赖于状态的失败?来自交通速度预测的证据

Yingshuo Wang, Xian Sun, Lingdong Kong, Wei Gao, Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Duke University(杜克大学) National University of Singapore(新加坡国立大学) Northeastern University(东北大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Southern Methodist University(南卫理公会大学)

AI总结 本文提出状态分层评估方法,发现时间序列基础模型在交通状态转换时准确率和预测区间覆盖率显著下降,并提出了双峰混合增强方法以改善转换状态覆盖。

Comments 5 pages, 2 figures. Accepted at the Workshop on Forecasting as a New Frontier of Intelligence, ICML 2026

详情
AI中文摘要

标准基准使用聚合指标评估时间序列基础模型(TSFMs),但这可能掩盖关键运行状态下的严重失败。我们引入了状态分层评估,并将其应用于两个标准交通速度基准上的三个TSFMs。交通在自由流和拥堵状态之间表现出突然的状态切换,在转换期间产生双峰速度分布。当我们按交通状态分层时,准确率和预测区间覆盖率在转换期间急剧下降:转换状态的MAE达到11 mph(而总体为3 mph),90%预测区间的经验覆盖率低至55%。这些失败在聚合指标中不可见,因为自由流观测主导了样本。一个简单的历史条件基线(从每个传感器的训练分布中采样)实现了比任何TSFM更好的转换覆盖率,但总体准确率差得多。我们提出了双峰混合增强(BMA),一种后处理方法,将TSFM预测与历史分布知识相结合,在保持TSFM准确率的同时接近历史基线的转换覆盖率。我们的结果表明,TSFM基准应纳入状态感知评估,以揭示聚合指标隐藏的失败。

英文摘要

Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on two standard traffic speed benchmarks. Traffic exhibits abrupt regime switching between free-flow and congested states, producing bimodal speed distributions during transitions. When we stratify by traffic regime, both accuracy and prediction-interval coverage degrade sharply during transitions: transition-regime MAE reaches 11 mph (versus 3 mph overall), and empirical coverage of 90% prediction intervals drops as low as 55%. These failures are invisible in aggregate metrics because free-flow observations dominate the sample. A simple historical conditional baseline (sampling from per-sensor training distributions) achieves better transition coverage than any TSFM, but has far worse overall accuracy. We propose bimodal mixture augmentation (BMA), a post-hoc method that combines TSFM forecasts with historical distributional knowledge, approaching the historical baseline's transition coverage while preserving the TSFM's accuracy. Our results suggest that TSFM benchmarks should incorporate regime-aware evaluation to surface failures that aggregate metrics hide.

2606.18363 2026-06-18 cs.RO cs.AI 新提交

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park(马里兰大学帕克分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Waterloo(滑铁卢大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Pennsylvania(宾夕法尼亚大学) Amazon FAR(亚马逊 FAR)

AI总结 提出Guava框架,通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计,将具身操作能力蒸馏到4B开源模型中,在仿真和真实环境中性能媲美前沿专有模型。

详情
AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型,为端到端的视觉-语言-行动系统提供了一种有前景的替代方案,它将高层推理与外部模块(用于感知、规划和控制)相结合。然而,对于具身操作而言,什么构成了有效的工具框架,以及这种框架能在多大程度上解锁广泛推理模型的具身能力,仍不清楚。在这项工作中,我们提出了Guava,一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素:迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性,我们开发了一个端到端的训练流程,利用完全在仿真中收集的不到2000条轨迹,将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明,其性能与前沿专有模型相当,同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明,一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口,使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

2606.18328 2026-06-18 cs.RO 新提交

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、规划:从机器人失败中学习技能与概念

Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver

发表机构 * CMU(卡内基梅隆大学) Princeton(普林斯顿大学) AI2(艾伦人工智能研究所) MIT(麻省理工学院) Centaur AI Bosch Center for AI(博世人工智能中心)

AI总结 提出ReSYNC方法,通过技能学习与概念发现的交替过程,从失败恢复经验中逐步构建抽象谓词,实现全局失败避免和长期规划,性能提升超50%。

Comments 9 pages, 6 figures. Website: https://jaraxxus-me.github.io/ReSYNC/

详情
AI中文摘要

智能机器人不仅应该从失败中恢复,还应该获取必要的抽象知识以避免未来的失败。虽然强化学习(RL)可以学习反应性恢复行为,但为每种不同的失败模式训练单独的策略效率极低。我们引入了恢复驱动的关系概念综合(ReSYNC),这是第一种从失败恢复经验中逐步发现并细化状态抽象(关系谓词)以支持抽象规划的方法。与纯粹的反应性方法不同,ReSYNC通过增量双学习过程联合学习技能和概念。在技能学习阶段,机器人使用RL学习从训练任务中出现的失败中恢复。在概念学习阶段,机器人发现新的关系谓词并细化其抽象规划模型,以解释和泛化所学的恢复行为。这种交互使ReSYNC能够将训练中看到的局部恢复转化为测试时的全局失败避免。在四个模拟领域,我们展示了ReSYNC持续扩展和细化其抽象库的能力,使其能够解决长期、前所未见的问题,性能超过强基线50%以上。此外,我们展示了ReSYNC的仿真到现实迁移,其中它执行真实世界的非抓取操作技能,并通过抽象规划泛化到未见场景。总体而言,ReSYNC代表了朝着机器人自主获取抽象以实现物理世界中可扩展的、感知失败的规划迈出的重要一步。

英文摘要

Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.

2606.18327 2026-06-18 cs.LG cs.AI 新提交

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL:基于强化学习的自一致性训练

Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出Self-CTRL方法,通过强化学习优化语言模型自我解释与行为之间的一致性,在概率推理和宪法AI任务上显著提升一致性和安全性。

Comments 34 pages, 12 figures, includes appendices

详情
AI中文摘要

能够忠实描述自身行为的语言模型(LMs)更容易被用户审计、理解和信任。本文描述了基于强化学习的自一致性训练(Self-CTRL),该方法通过更新解释以更好地预测行为或更新行为以更好地匹配解释,优化LM的自我解释与相关输入行为之间的一致性。我们在两个领域应用该方法。首先,研究一个形式化概率推理任务,其中LM必须学习模仿一组有偏采样器,并评估其报告相关偏差的能力。我们发现,一致性训练将自我报告和行为测量的潜在偏差之间的相关性从$R^2=0.24$提高到$R^2=0.64$(在保留分布上),匹配直接真实标签监督的泛化能力。其次,研究一个宪法AI领域,其中LM必须描述何时拒绝或遵守用户请求。在此,Self-CTRL产生忠实描述模型在保留请求上行为的规则,将第三方审计模型的拒绝预测从$36\%$提高到$92\%$。另一方面,行为更新改善了对齐,将HarmBench失败率从$15.0\%$降低到$0.5\%$,而不会显著增加对无害提示的拒绝。通过对齐解释和行为,我们的工作为训练更安全、更透明、更可控的AI模型提供了通用方法。

英文摘要

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

2606.18326 2026-06-18 cs.LG 新提交

Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance

基于重正化群神经网络的类别不平衡故障诊断

Evgeny Nikulchev, Dmitry Ilin

发表机构 * MIREA – Russian Technological University(莫斯科俄罗斯技术大学)

AI总结 提出RGNet,一种基于重正化群概念的神经网络架构,通过层次化粗粒化特征空间处理类别不平衡和多维噪声,在AI4I数据集上验证了其有效性。

Comments 8 pages

详情
AI中文摘要

机器学习模型在实际任务中的应用面临类别不平衡和多维噪声等挑战。本文提出RGNet,一种基于重正化群(RG)概念的神经网络架构,用于特征空间的层次化粗粒化。该模型依次压缩输入维度,并在分类前拼接所有尺度,从而捕获局部细节和全局模式。引入了RG流的概念——可解释的低维表示,通过t-SNE可视化揭示了离散曲线结构,证实了粗粒化的有效性。在不平衡的AI4I数据集上给出了实验结果。结果表明,RGNet是一种通用、可解释且具有竞争力的故障预测解决方案,适用于类别不平衡的应用场景。

英文摘要

The application of machine learning models in practical tasks faces challenges such as class imbalance and multidimensional noise. This paper proposes RGNet, a neural network architecture based on the concept of the renormalization group (RG), for hierarchical coarse-graining of the feature space. The model sequentially compresses the input dimensionality and concatenates all scales before classification, allowing it to capture both local details and global patterns. The notion of RG-flows is introduced - interpretable low-dimensional representations whose visualization via t-SNE reveals a discrete curvilinear structure confirming the effectiveness of coarse-graining. Experimental results are presented on the imbalanced AI4I dataset. The obtained results demonstrate that RGNet is a universal, interpretable, and competitive solution for fault prediction in applications with imbalanced classes.