arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2605.21602 2026-05-26 cs.AI cs.SE

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

基准测试与改进LLMs中的分布外对齐失败监控器

Dylan Feng, Pragya Srivastava, Anca Dragan, Cassidy Laidlaw

AI总结 针对大语言模型在分布外情境下的安全与对齐失败问题,提出MOOD基准并证明结合守卫模型与OOD检测器可提升监控召回率。

详情
AI中文摘要

大语言模型(LLMs)的许多安全和对齐失败源于分布外(OOD)情境:模型开发者未预见到的异常提示或响应模式。我们通过引入名为Misalignment Out Of Distribution (MOOD)的基准,系统研究LLM监控流程能否检测这些OOD对齐失败。对于在大量安全数据集上训练的现成模型,很难找到真正OOD的失败。我们通过在MOOD中包含一个受限训练集(用于训练我们自己的监控器)以及七个具有不同对齐失败且超出训练分布的测试集来规避这一问题。利用MOOD,我们发现守卫模型(安全分类器)通常难以泛化到OOD。为解决此问题,我们提出将守卫模型与OOD检测器结合。我们测试了四种OOD检测器,发现将守卫模型与基于马氏距离和困惑度的OOD检测器结合,可将召回率从39%提升至45%。我们还建立了跨模型规模的监控器(结合守卫模型和OOD检测器)的正向扩展趋势;发现将OOD检测纳入监控比使用参数多20倍的守卫模型能获得更高的召回率增益。我们的工作表明,OOD检测应成为LLM监控的关键组成部分,并为这一重要问题的进一步研究奠定了基础。我们公开发布了实验代码和数据,相关链接见:https://github.com/Dylan102938/mood-bench。

英文摘要

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem. We release the code and data for our experiments publicly, and you can find the relevant links here: https://github.com/Dylan102938/mood-bench.

2605.20787 2026-05-26 cs.CV

Findings of the Counter Turing Test: AI-Generated Image Detection

反图灵测试结果:AI生成图像检测

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

AI总结 本文通过Defactify 4.0工作坊的反图灵测试竞赛,评估了多种检测方法在区分AI生成图像与真实图像及识别具体生成模型上的性能,发现检测准确率较高但模型识别仍具挑战。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

生成式AI技术(如Stable Diffusion、DALL-E和Midjourney)的快速发展显著改变了合成视觉内容的创建方式。虽然这些模型推动了各行各业的创新,但也带来了严重挑战,包括错误信息、虚假信息和有偏内容生成。AI生成图像日益逼真,使其检测成为研究人员、政策制定者和行业利益相关者关注的紧迫问题。 在本文中,我们介绍了Defactify 4.0工作坊的成果,该工作坊推出了用于AI生成图像检测的反图灵测试(CT2)。竞赛包含两个关键任务:(1)将图像二分类为AI生成或真实;(2)识别生成AI图像的具体生成模型。为支持这两个任务,我们采用了MS COCOAI数据集,该基准包含由五个最先进模型生成的96000张真实和合成图像,以及来自MS COCO的真实图像。 参与者采用了多种检测策略,包括卷积神经网络(CNN)、视觉Transformer(ViT)、基于频率的分析、对比学习和多模态技术。结果表明,虽然AI生成图像可以被高精度检测(F1分数>0.83),但准确识别具体模型仍然更具挑战性(最高F1分数:0.4986)。这些发现凸显了改进模型指纹识别、对抗鲁棒性和实时检测机制的必要性。

英文摘要

The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To support both tasks, we employed the MS COCOAI dataset, a benchmark of 96000 real and synthetic images generated by five state-of-the-art models alongside real images from MS COCO. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

2605.20772 2026-05-26 cs.CV

VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

VIHD: 基于视觉干预的医学视觉问答幻觉检测

Jiayi Chen, Benteng Ma, Zehui Liao, Winston Chong, Yasmeen George, Jianfei Cai

AI总结 提出VIHD方法,通过视觉依赖探测和视觉干预解码校准语义熵,有效检测医学多模态大语言模型中的幻觉响应。

Comments Early accepted by MICCAI 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

尽管医学多模态大语言模型(MLLMs)在辅助诊断方面展现出潜力,但它们仍然频繁生成在语言上看似合理但缺乏视觉证据的幻觉响应。这种幻觉对临床决策构成风险,因此需要有效的检测方法。现有的内省检测方法主要通过分析模型在原始或扰动输入条件下的响应来进行不确定性估计或逻辑验证。然而,这种外部扰动通常是启发式的且与上下文无关,忽略了解码过程中生成令牌与相关视觉令牌之间的内部跨模态依赖。为解决这一问题,我们提出了VIHD,一种基于视觉干预的幻觉检测方法,通过针对性的视觉令牌掩码校准语义熵,以实现更有效的幻觉检测。VIHD通过视觉依赖探测(VDP)定位视觉主导的解码器层,通过令牌掩码执行视觉干预解码(VID)以校准语义分布,并将得到的校准语义熵(CSE)量化为可靠的幻觉信号。在三个医学VQA基准测试和两个医学MLLM上的大量实验表明,VIHD始终优于最先进的方法,强调了细粒度视觉依赖对于幻觉检测的重要性。代码将发布在https://github.com/Jiayi-Chen-AU/VIHD。

英文摘要

While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at https://github.com/Jiayi-Chen-AU/VIHD

2605.20761 2026-05-26 cs.CL

Findings of the Counter Turing Test: AI-Generated Text Detection

反图灵测试的发现:AI生成文本检测

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

AI总结 本文通过反图灵测试(CT2)共享任务,评估了AI生成文本检测技术的有效性,发现二分类任务表现优异(F1=1.0000),但模型归因任务更具挑战性(最佳F1=0.9531),并分析了微调Transformer、集成学习等方法的优劣。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

大型语言模型生成流畅、上下文连贯文本的能力不断增强,给负责确保数字内容真实性的系统和机构带来了越来越大的压力。先进的生成模型如GPT-4、Claude 3.5和Llama能够生成高度连贯且类似人类的文本,使得区分人类撰写和AI生成的内容变得越来越困难。虽然这些模型具有变革性的应用,但它们的滥用引发了关于错误信息、偏见叙事和安全威胁的担忧。 本文对最先进的AI生成文本检测技术进行了全面分析,并通过反图灵测试(CT2)共享任务评估了其有效性。任务A(二分类)要求参与者区分人类撰写和AI生成的文本,而任务B(模型归因)则专注于识别生成给定文本的具体语言模型。结果显示,二分类性能较高,最佳系统F1得分为1.0000,但模型归因得分显著较低,最佳系统仅为0.9531,凸显了该任务的复杂性。 表现最佳的团队利用了微调Transformer模型、集成学习和混合检测方法,其中基于DeBERTa和BART的方法表现出色。然而,任务B的较低得分强调了区分不同LLM输出的挑战,需要进一步研究对抗鲁棒性、特征提取和跨领域泛化。

英文摘要

The growing capability of large language models to produce fluent, contextually coherent text has created mounting pressure on the systems and institutions responsible for ensuring the authenticity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

2605.20749 2026-05-26 cs.LG cs.AI

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

魔鬼在于条件数:为什么GLU优于非GLU结构?

Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Peisong Wen, Qingming Huang

AI总结 通过神经正切核分析,发现门控线性单元(GLU)通过重塑核谱、减小条件数来加速优化收敛,而非主要降低泛化差距。

Comments Accepted by ICML 2026

详情
AI中文摘要

门控线性单元(GLU)及其变体被广泛应用于现代开源大语言模型架构中,并且始终优于其非门控对应物,然而这种优势的根本原因尚不清楚。在这项工作中,我们通过分析神经正切核(NTK)机制下的两层网络来研究GLU。我们的分析表明,GLU结构重塑了NTK谱,导致更小的条件数和更紧凑的特征值分布。基于这一发现,我们进一步分析了由此产生的训练动态,并展示了重塑后的谱如何导致GLU模型更快的收敛,包括在GLU和非GLU模型之间观察到的特征损失交叉现象。最后,我们通过实验观察到,GLU在缩小各种模型(包括ViT和GPT-2)的泛化差距方面影响有限,这表明其主要优势在于加速优化而非减少泛化差距。代码可在 https://github.com/Zemdalk/GLU-NTK 获取。

英文摘要

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap. The code is available at: https://github.com/Zemdalk/GLU-NTK.

2605.19739 2026-05-26 cs.CV

FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

FlowErase-RL:将概念擦除重新思考为流匹配模型中的奖励优化

Yi Sun, Zhiqi Zhang, Xinhao Zhong, Yimin Zhou, Shuoyang Sun, Bin Chen, Shu-Tao Xia, Ke Xu

AI总结 提出FlowErase-RL,首个基于GRPO的框架,通过动态双路径奖励机制将概念擦除转化为奖励优化问题,在抑制目标概念的同时保持生成保真度,实现最先进的擦除性能与鲁棒性。

详情
AI中文摘要

近期流匹配模型的进展显著提升了文本到图像生成的质量,但也因生成有害或不良内容而引入了日益增长的安全风险。现有的概念擦除方法要么是推理时干预,效果有限;要么依赖监督微调(SFT),后者需要精确对齐的数据,且在可扩展性和多概念场景中面临挑战。本文提出\emph{FlowErase-RL},首个基于GRPO的流匹配模型概念擦除框架。我们将概念擦除重新表述为奖励优化问题,并引入 extbf{动态双路径奖励机制},联合优化(i)概念擦除(CE)奖励以抑制目标概念,以及(ii)非目标空间(NS)奖励以保持生成保真度。通过性能驱动的切换策略,在训练过程中自适应平衡两条奖励路径,无需显式监督即可实现稳定优化。在裸体、物体和艺术风格擦除上的大量实验表明,我们的方法在保持强大图像质量和语义对齐的同时,实现了最先进的擦除性能。此外,它对对抗攻击表现出鲁棒抵抗性,并能有效扩展到多概念场景。我们的结果为流匹配模型中的安全可控生成建立了新范式。

英文摘要

Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emph{FlowErase-RL}, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbf{dynamic dual-path reward mechanism} that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.

2605.18797 2026-05-26 cs.LG cs.AI

Simply Stabilizing the Loop via Fully Looped Transformer

通过全循环Transformer简单稳定循环

Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang

AI总结 针对循环Transformer在迭代次数增加时出现的训练不稳定性,提出全循环Transformer,通过全循环架构和注意力注入两种无参数修改,稳定训练至12次循环,下游任务性能提升最高13.2%。

详情
AI中文摘要

扩展模型性能通常需要增加模型大小。循环Transformer通过迭代重用相同的Transformer块提供了一种引人注目的替代方案,用额外的计算换取性能提升,而不增加参数数量或上下文长度。由于推理时可以调整循环迭代次数,它还提供了一种平衡性能和测试时计算的自然机制。然而,当循环迭代次数增加时,循环Transformer仍然存在训练不稳定性。我们的分析表明,这种不稳定性源于两个来源:梯度振荡和残差爆炸。为了解决这两个问题,我们提出了全循环Transformer,它引入了两种无参数修改:(1)全循环架构,将循环间信号分布到所有层以缓解残差爆炸;(2)注意力注入,重用现有的注意力块以抑制梯度振荡。这些修改稳定了训练动态,使得全循环Transformer能够稳定训练多达12次循环迭代,而其他基线循环模型在这种情况下会崩溃。在循环Transformer不会崩溃的较温和设置中,全循环Transformer仍然将平均下游任务性能提升了高达13.2%。总体而言,我们的实验表明,全循环Transformer提高了训练稳定性,增强了下游性能,并通过在推理时改变循环迭代次数,提供了在不同测试时计算预算下的初步适应性。

英文摘要

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

AI总结 提出ESI-BENCH基准,通过主动探索(感知、移动、操作)在OmniGibson环境中评估具身空间智能,发现主动探索显著优于被动方法,失败主因是动作盲视而非感知弱,且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情
AI中文摘要

空间智能通过感知-动作循环展开:智能体通过行动获取观察,并推理观察如何随动作变化。它们不是被动处理所见,而是主动揭示未见——遮挡结构、动态、包含关系和功能,这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述,将观察者重新定义为行动者。我们引入ESI-BENCH,一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准,涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验,发现主动探索显著优于被动对应物,智能体自发发现涌现的空间策略而无需明确指令,而随机多视角往往增加噪声而非信号,尽管消耗更多图像。大多数失败并非源于感知弱,而是动作盲视:糟糕的动作选择导致糟糕的观察,进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理,但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示,与寻求证伪视角并在矛盾下修正信念的人类不同,模型无论证据质量如何都过早且高置信度地承诺,暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

2605.18172 2026-05-26 cs.AI

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

可视化不可见:生成式视觉定位赋能多模态大语言模型的通用脑电图理解

Jun-Yu Pan, Yansen Wang, Enze Zhang, Bao-Liang Lu, Wei-Long Zheng, Dongsheng Li

AI总结 提出生成式视觉定位(GVG)框架,通过脑电图到图像的生成模型作为视觉翻译器,为多模态大语言模型提供结构化视觉上下文,以增强非视觉脑电图的理解和临床状态解释。

详情
AI中文摘要

利用预训练大语言模型和多模态大语言模型的通用表示为脑基础模型提供了一条有前景的路径。然而,视觉诱发的脑电图数据集仍然稀缺,导致现有方法主要将神经信号与抽象文本对齐,这种有损翻译可能丢弃脑活动中编码的细粒度感知信息。我们提出生成式视觉定位(GVG)框架,通过使用脑电图到图像的生成模型作为视觉翻译器,将不可见的信息可视化。GVG 不是仅将脑电图强制转换为文本,而是为非视觉脑电图生成实例特定的代理图像,提供结构化的视觉上下文,使多模态大语言模型能够利用其视觉先验进行临床状态解释。我们在两个多模态大语言模型骨干上验证了这一想法:GVG-X-Omni 和 GVG-Janus。仅图像对齐已具有竞争力:轻量级 GVG-X-Omni 在冻结的 7B 骨干上仅调整 170M 参数,即可匹配 1.7B 参数的文本对齐基线。我们进一步扩展了 GVG-Janus,采用三模态图像+文本对齐,其中文本提供类别语义锚点,视觉代理用感知细节丰富神经表示。实验表明,在脑电图理解和视觉生成方面均取得了一致增益,表明视觉代理定位作为文本对齐的有效补充。

英文摘要

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

2605.17937 2026-05-26 cs.CL cs.AI

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench:面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

AI总结 提出首个大规模自动化量化回测基准BacktestBench,包含18,246个问答对,并设计多智能体基线AutoBacktest,通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情
AI中文摘要

量化回测对于评估交易策略至关重要,但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型(LLMs)通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径,但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战,这阻碍了该领域的进展。为弥补这一关键差距,我们引入了BacktestBench,这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建,包含18,246个精心标注的问答对,涵盖四个任务类别:指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest,一个稳健的多智能体基线,通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现,将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估,辅以有针对性的消融实验,识别了影响端到端性能的关键因素,并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

2605.17730 2026-05-26 cs.LG cs.AI

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

L-Drive:超越单一映射——潜在上下文驱动时间序列预测

Fan Zhang, Shijun Chen, Hua Wang

AI总结 针对分布偏移和机制变化导致直接映射范式在转折点响应滞后的问题,提出L-Drive框架,通过引入潜在上下文表征高层动态并利用门控调制增量表示,提升对变化段的适应能力,同时采用补丁共享相对位置基函数增强段内结构建模,实现预测精度与计算效率的更好平衡。

详情
AI中文摘要

多变量时间序列预测的主流方法主要遵循直接映射范式。它们在观测空间中学习从历史到未来的统一映射,以拟合值级依赖关系。然而,现实世界系统经常经历分布偏移和机制变化。在这种情况下,统一映射在转折点附近可能出现响应滞后,导致切换窗口内误差累积,降低预测可靠性。为解决此问题,我们提出L-Drive,一种变化感知预测框架。L-Drive引入潜在上下文,显式表征随时间演变的高层动态,并使用门控调制增量表示。这提供了更及时的变化线索,并改善了对变化段的适应。此外,它结合了补丁共享相对位置基函数,以加强段内结构建模并减少由绝对位置记忆引起的过拟合。大量实验验证了L-Drive的有效性,并展示了其在预测精度和计算效率之间更好的整体权衡。

英文摘要

Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.

2605.17537 2026-05-26 cs.AI

Self-supervised Hierarchical Visual Reasoning with World Model

基于世界模型的自监督分层视觉推理

Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, Houqiang Li

AI总结 提出ResDreamer,一种分层世界模型,通过自监督方式学习残差表示,实现高效视觉推理,在3D开放环境中达到最先进的样本和参数效率。

详情
AI中文摘要

具有对抗对手的3D开放世界环境因其巨大的状态空间仍然是强化学习的核心挑战。有效的推理表示在此类环境中至关重要。虽然现有的自监督视觉预见推理方法常常遭受多步误差累积,许多最近的研究转向注入领域特定知识以提供更稳定的指导。我们的关键洞察是,视觉推理表示的照片级真实感是次要的;真正重要的是提供信息丰富、任务相关的信号。为此,我们提出ResDreamer,一种分层世界模型,其中每个更高层被训练来重建下一层的残差。这种设计使得对日益复杂的世界动态进行渐进抽象成为可能,并促进更丰富潜在表示的出现。受“苦涩教训”启发,ResDreamer以纯自监督方式训练其推理表示。高层残差表示用于调节低层预测,使得世界模型仅以线性增加的跨层通信成本即可有效扩展。实验表明,ResDreamer实现了最先进的样本效率和参数效率。这种可扩展的分层视觉预见推理架构为开放、动态环境中更具能力的在线RL代理铺平了道路。代码可在https://github.com/XuYuanFei01/ResDreamer获取。

英文摘要

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at https://github.com/XuYuanFei01/ResDreamer.

2605.17287 2026-05-26 cs.CV

LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

LISA: 语言引导的干扰感知空间-频率注意力用于驾驶员视线估计

Jun Ma, Zhenye Yang, Ruichen Zhou, Pei Zhang, Huan Li, Jinpeng Chen

AI总结 提出LISA框架,结合频域先验与视觉语言知识,通过双域融合机制和训练时解耦策略,实现鲁棒的驾驶员视线估计,在遮挡和光照变化下达到最优性能。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

驾驶员视线估计是现代监控系统中评估驾驶员注意力的一项基本指标。除了易受突然光照变化和传感器噪声影响外,空间域模型难以将真实的视线线索与无关的视觉属性分离。在本文中,我们提出了LISA,一个语言引导的干扰感知空间-频率注意力框架,结合了频域先验与视觉语言知识。观察到即使在空间扰动下幅度谱仍保持相对稳定,我们设计了一种双域融合机制。它将稳定的低频语义集成到高频细节中,利用空间注意力精确定位眼部区域。为减少语义模糊性,我们还引入了一种训练时解耦策略。使用冻结的CLIP编码器和正交正则化,我们将视线特征与外观干扰明确分离。在两个基准上的实验表明,LISA达到了最先进的性能,在遮挡和光照变化下具有显著增强的鲁棒性。代码仓库可在 https://github.com/Mason-bupt/LISA 获取。

英文摘要

Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.

2605.17234 2026-05-26 cs.LG

Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning

通过代理引导剪枝实现高效缩放定律估计的主动预算分配

Viktoria Schram, Markus Hiller, Daniel Beck, Trevor Cohn

AI总结 提出结合连续减半与参数/非参数代理模型的主动预算分配方法,在显著降低计算成本(节省高达98.7%)的同时获得更优的损失-计算量前沿,实现精确的缩放定律估计。

Comments Accepted at ICML 2026

详情
AI中文摘要

预测更大规模下的模型性能能够设计针对特定性能目标的训练策略和架构。经验缩放定律研究通过识别函数形式来辅助这一预测任务,这些函数形式利用学习曲线定义的损失-计算量前沿描述损失与计算量之间的关系。由于该方法的经验性质,计算负担巨大,使得战略资源分配至关重要——然而这一领域却出人意料地未被充分探索。在本工作中,我们通过探索连续减半(SH)以及SH与参数化和非参数化代理模型结合的适用性来弥补这一不足。除了能够更系统地分配给定的计算预算外,我们的发现表明,SH与代理模型结合得到的一组学习曲线中,包含一条损失-计算量值低于朴素均匀分配或仅SH方法所能获得的曲线。我们的实验在真实世界和合成学习曲线数据集上分别展示了高达2.84%和5.47%的平均相对改进。这种战略资源分配使我们能够以显著降低的计算成本获得准确的缩放定律,相比传统的穷举方法节省高达98.7%的计算量。

英文摘要

Predicting model performance at larger scales enables the design of training strategies and architectures tailored to specific performance targets. Empirical scaling law research identifies functional forms to aid this prediction task. These describe the relationship between loss and compute using a loss-compute frontier defined by learning curves. Due to the empirical nature of this approach, the computational burden is substantial, making strategic resource allocation essential - yet it remains surprisingly underexplored. In this work, we address this shortcoming by exploring the suitability of Successive Halving (SH) and SH combined with parametric and non-parametric surrogate models. In addition to enabling a more systematic allocation of a given compute budget, our findings show that SH paired with surrogate models yields a set of learning curves that includes one with a lower loss-compute value than what naive uniform allocation or an SH-only approach can obtain. Our experiments demonstrate mean relative improvements of up to 2.84% and 5.47% on real-world and synthetic learning curve datasets. This strategic resource allocation enables us to obtain accurate scaling laws at significantly reduced computational costs, saving up to 98.7% over the traditional exhaustive approach.

2605.16953 2026-05-26 cs.AI cs.CL

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容:一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

AI总结 通过EEG实验,研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异,揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情
AI中文摘要

尽管AI生成的幻觉带来了相当大的风险,但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题,本文探索了人类的神经动力学,以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号,该任务要求判断由多模态大语言模型(MLLM)生成的图像描述的正确性。基于平均事件相关电位(ERP)研究,我们揭示了多种认知过程,例如语义整合、推理处理、记忆检索和认知负荷,在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是,人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明,被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

2605.16023 2026-05-26 cs.CL cs.LG

Judge Circuits

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Selin Kahvecioglu, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

AI总结 本研究利用位置感知边归因修补(PEAP)因果分析Gemma-3、Qwen2.5和Llama-3的内部机制,发现结构化理解和开放式偏好任务中的判断共享一个稀疏、泛化的潜在评估子图,并通过解耦抽象判断与输出格式,揭示了格式诱导不一致性的机制原因。

Comments 39 pages

详情
AI中文摘要

LLM-as-a-judge已成为大规模评估模型输出的主导范式,然而同一模型在其输出格式变化时(例如,1-5评分与真/假标签)会系统地给出不同的分数。现有对这些格式诱导不一致性的诊断停留在输入输出层面。利用位置感知边归因修补(PEAP),我们因果地研究了Gemma-3、Qwen2.5和Llama-3的内部机制。我们发现,跨结构化理解和开放式偏好任务的判断共享一个稀疏、泛化的潜在评估子图,位于中后期多层感知器(MLPs)中;将其零消融会破坏判断,同时保留架构模块化模型中的世界知识。通过结构上解耦抽象判断与输出格式,我们为我们研究的开放权重模型上的格式诱导不一致性提供了机制解释:在共享主干中计算的连续判断信号通过脆弱、格式特定的终端分支映射,使得格式无关的偏好能够在请求的输出格式下游被隔离。我们的发现意味着跨格式的基准级可靠性比较部分测量的是格式化器几何形状而非评估质量。

英文摘要

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

2605.14769 2026-05-26 cs.LG

Composable Crystals: Controllable Materials Discovery via Concept Learning

可组合晶体:通过概念学习实现可控材料发现

Nian Liu, Yuwei Zeng, Ryoji Kubo, Nikita Kazeev, Stephen Gregory Dale, Artem Maevskiy, Pengru Huang, Thomas Laurent, Kostya S. Novoselov, Xavier Bresson

AI总结 提出基于概念组合的晶体生成框架,利用向量量化变分自编码器自动发现可重用晶体概念,通过概念重组实现可控的新晶体探索,在MP-20和Alex-MP-20上V.S.U.N指标提升最高53.2%和51.7%。

详情
AI中文摘要

从头晶体生成是材料发现中的核心任务,旨在生成同时有效、稳定、独特且新颖的晶体。现有方法主要依赖黑盒随机采样,对生成结构如何超越观测分布的控制有限。本文提出了一种基于概念的组合式晶体生成框架。我们训练了一个向量量化变分自编码器,自动发现一组可重用的晶体概念,这些概念作为引导生成的构建块。这些学习到的概念在局部原子环境和全局对称模式上自然表现出可解释性,并能泛化到不同分布的晶体。通过重组这些概念,我们的框架能够可控地探索训练分布之外的新颖晶体,而非仅依赖无约束的随机采样。为进一步提高组合效率,我们引入了一个组合生成器,并使用模型自身生成的高质量样本对其进行迭代优化。最终的概念组合用于条件化下游晶体生成。在MP-20和Alex-MP-20上的数值实验表明,分别组合概念使基础模型在V.S.U.N指标上提升高达53.2%和51.7%,尤其在新颖性方面增益显著。

英文摘要

De novo crystal generation, a central task in materials discovery, aims to generate crystals that are simultaneously valid, stable, unique, and novel. Existing methods mainly rely on black-box stochastic sampling, providing limited control over how generated structures move beyond the observed distribution. In this paper, we introduce a concept-based compositional framework for crystal generation. We train a vector-quantized variational autoencoder to automatically discover a shared set of reusable crystal concepts, which serve as building blocks for guided generation. These learned concepts naturally exhibit interpretability from both local atomic environments and global symmetry patterns, and generalize to crystals from different distributions. By recombining such concepts, our framework enables controllable exploration of novel crystals beyond the training distribution, rather than relying solely on unconstrained random sampling. To further improve composition efficiency, we introduce a composition generator and iteratively refine it using high-quality samples generated by the model itself. The resulting concept compositions are then used to condition downstream crystal generation. Numerical experiments on MP-20 and Alex-MP-20 show that compositing concepts separately increase base model up to 53.2% and 51.7% on V.S.U.N metric, with particular gains in novelty.

2605.14759 2026-05-26 cs.LG

Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement

Crys-JEPA:通过嵌入筛选和生成精炼加速晶体发现

Nian Liu, Nikita Kazeev, Stephen Gregory Dale, Artem Maevskiy, Yuwei Zeng, Ryoji Kubo, Pengru Huang, Thomas Laurent, Yann LeCun, Kostya S. Novoselov, Xavier Bresson

AI总结 提出Crys-JEPA联合嵌入预测架构,通过能量感知的潜在空间和筛选-精炼流程,解决晶体生成中稳定性和新颖性的冲突,在MP-20和Alex-MP-20数据集上V.S.U.N.指标分别提升53.8%和72.7%。

详情
AI中文摘要

从头晶体生成旨在发现不仅真实而且稳定和新颖的材料。然而,大多数现有生成模型被训练为最大化观测晶体的似然,这鼓励样本接近已知材料,但不一定与发现中重要的标准一致。我们的实证分析表明,当前晶体生成模型在稳定性和新颖性之间存在明显冲突:接近观测分布的样本倾向于保持稳定性但提供有限的新颖性,而远离分布的样本通常迅速失去稳定性。这表明发现既稳定又新颖晶体的有用区域极其狭窄。为了突破这一限制,我们引入了Crys-JEPA,一种用于晶体的联合嵌入预测架构,它学习一个能量感知的潜在空间,保留形成能差异。在这个空间中,稳定性评估可以重新表述为基于嵌入的与可访问训练晶体的比较,减少了对昂贵能量评估和特定任务外部参考的依赖。基于Crys-JEPA,我们进一步开发了一个筛选-精炼流程,识别有前景的生成晶体并重新引入它们以精炼生成模型。在MP-20和Alex-MP-20数据集上,我们在V.S.U.N.指标上分别比基线提升了53.8%和72.7%。

英文摘要

De novo crystal generation seeks to discover materials that are not merely realistic, but also stable and novel. However, most existing generative models are trained to maximize the likelihood of observed crystals, which encourages samples to stay close to known materials yet not necessarily align with the criteria that matter in discovery. Our empirical analysis shows that current crystal generative models exhibit a clear conflict between stability and novelty: samples near the observed distribution tend to retain stability but offer limited novelty, whereas samples farther from it often lose stability rapidly. This suggests that the useful region for discovering crystals that are both stable and novel is extremely narrow. To move beyond this limitation, we introduce Crys-JEPA, a joint embedding predictive architecture for crystals that learns an energy-aware latent space preserving formation-energy differences. In this space, stability assessment can be reformulated as an embedding-based comparison against accessible training crystals, reducing the reliance on expensive energy evaluation and task-specific external references. Building on Crys-JEPA, we further develop a screening-and-refinement pipeline that identifies promising generated crystals and reintroduces them to refine the generative model. On MP-20 and Alex-MP-20 datasets, we achieve improvements over baselines up to 53.8% and 72.7% on V.S.U.N. metric, respectively.

2605.13643 2026-05-26 cs.CL

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

前缀教导,后缀消退:强到弱在线策略蒸馏中的局部可教性崩溃

Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, Jieping Ye

AI总结 本文发现强到弱在线策略蒸馏中,教师反馈在生成轨迹的后缀部分缺乏局部对比度,导致“局部可教性崩溃”,并提出基于变化点检测的截断规则来优化监督区域,实验表明该方法优于全轨迹蒸馏。

详情
AI中文摘要

在线策略蒸馏(OPD)利用来自更强教师的密集反馈,在学生模型自身的生成轨迹上训练学生模型。先前文献表明,只要教师反馈可用,监督完整响应令牌序列应能单调提升性能。然而,我们证明这一假设在强到弱OPD设置中有时不成立。虽然生成轨迹的后缀部分可能仍存在非零的师生优势,但它们通常缺乏使密集反馈有效优先学生学习所需的局部对比度。我们将这种失败模式称为局部可教性崩溃。由此得出的原则很简单:监督应集中在教师反馈仍具有判别性的轨迹区域,而非均匀覆盖整个响应。我们通过一种轨迹特定的释放规则来操作这一原则。该规则测量教师相对于学生前K个候选集的边际,将该边际在NLTK分词的句子片段上聚合,并在检测到BIC风格的下行变化点时截断密集OPD监督。使用Qwen3模型系列在强到弱蒸馏任务上的实验结果表明,该释放规则在不同学生规模下的五个域内基准上始终优于标准全轨迹OPD。此外,与基线蒸馏方法相比,我们的方法在域外任务上更好地保留了模型能力。这些结果表明,有效的强到弱OPD需要评估教师指导的可用性及其局部效用,确保生成的反馈保持可教性。

英文摘要

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

2605.13438 2026-05-26 cs.AI cs.CL

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Xinliang Zhou

AI总结 提出CogniFold,一种受大脑启发的主动记忆系统,通过将互补学习系统扩展为三层(海马体、新皮层、前额叶意图层)并利用图拓扑自组织,实现事件流的持续认知结构涌现,在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情
AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的,缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体,我们引入了CogniFold,一种受大脑启发的“始终在线”智能体记忆,专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构,从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统(CLS)理论从两层(海马体、新皮层)扩展到三层,增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心,CogniFold通过图拓扑自组织实现这一点:认知结构在事件流下主动组装,语义相似时合并,过时时衰减,通过联想回忆重新链接,并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成,证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外,在跨越五个认知领域的7个广泛覆盖的基准测试中,我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.

2605.12964 2026-05-26 cs.CV

Asymmetric Flow Models

非对称流模型

Hansheng Chen, Jan Ackermann, Minseo Kim, Gordon Wetzstein, Leonidas Guibas

AI总结 提出非对称流建模(AsymFlow),通过秩非对称速度参数化将噪声预测限制在低秩子空间,同时保持数据预测全维,从而在高维空间中实现高效的流生成,在ImageNet 256×256上取得领先的1.57 FID,并首次提供将预训练潜在流模型微调为像素空间模型的途径。

Comments Code: https://github.com/Lakonik/LakonLab Webpage: https://hanshengchen.com/asymflow

详情
AI中文摘要

高维空间中的基于流的生成是困难的,因为即使数据具有强低秩结构,速度预测也需要建模高维噪声。我们提出非对称流建模(AsymFlow),一种秩非对称速度参数化,将噪声预测限制在低秩子空间,同时保持数据预测全维。通过这种非对称预测,AsymFlow在不改变网络架构或训练/采样过程的情况下,解析地恢复全维速度。在ImageNet 256×256上,AsymFlow取得了领先的1.57 FID,大幅优于先前的DiT/JiT类像素扩散模型。AsymFlow还首次提供了将预训练潜在流模型微调为像素空间模型的途径:将低秩像素子空间与潜在空间对齐,得到无缝初始化,保留潜在模型的高级语义和结构,因此微调主要改善低级不匹配,而非重新学习像素生成。我们展示了从FLUX.2 klein 9B微调得到的像素AsymFlow模型在像素空间文本到图像生成中建立了新的最先进水平,在HPSv3、DPG-Bench和GenEval上击败了其潜在基础模型,并在定性上显示出显著改善的视觉真实感。

英文摘要

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

2605.12906 2026-05-26 cs.LG cs.AI

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

数据难度与LLM微调中的泛化-外推权衡

Siyuan Liu, Tinghong Chen, Xinghan Li, Yifei Wang, Jingzhao Zhang

AI总结 本文通过实证和理论分析,研究了监督微调中数据难度对模型行为的影响,发现数据难度与数据量共同决定泛化与外推之间的权衡,并存在最优难度随数据量增加而向更难数据偏移的规律。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调(SFT)期间的数据选择可以显著改变大型语言模型(LLMs)的行为。尽管已有工作研究了基于困惑度、难度或长度等启发式方法选择数据的效果,但报告的结果往往不一致或依赖于上下文。在这项工作中,我们从实证和理论角度系统地研究了数据难度在微调中的作用,并发现不存在普遍最优的难度水平;相反,其有效性取决于数据集大小。我们表明,对于固定的数据预算,SFT存在一个最优的数据难度,并且随着数据预算的增加,该最优难度向更难的数据偏移。为了解释这一现象,我们进行了受控的合成实验,揭示了一个简单的底层机制:分布内泛化差距与外推差距之间的相互作用。我们通过使用PAC-Bayesian泛化界限的理论分析进一步支持了这一机制。总的来说,我们的结果阐明了数据大小和难度如何共同影响SFT中泛化与外推之间的权衡,为在特定模型和数据条件下基于难度的数据选择提供了指导。

英文摘要

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

2605.12649 2026-05-26 cs.CV

DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

DIVER: 通过表达性语义恢复深入挖掘蒸馏数据

Qianxin Xia, Zhiyong Shu, Wenbo Jiang, Jiawei Du, Jielei Wang, Guoming Lu

AI总结 提出双阶段蒸馏框架DIVER,利用预训练扩散模型通过语义继承、引导和融合恢复蒸馏数据的表达性语义,提升跨架构泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在从原始数据集中合成一个紧凑的代理数据集,该数据集不可读或非原始,以保护隐私并实现高效学习。然而,先前的方法通常采用单阶段蒸馏范式,该范式会学习过度适应先验架构的特定模式,从而抑制语义表达并导致跨异构架构的性能下降。为了解决这个问题,我们提出了一种新颖的双阶段蒸馏框架,称为${ extbf{DIVER}}$,它利用预训练的扩散模型通过表达性语义恢复深入挖掘蒸馏数据,整个过程包括语义继承、引导和融合。语义继承将抽象蒸馏图像的高级语义蒸馏到潜在空间中,以过滤掉架构特定的“噪声”并保留内在语义。此外,语义引导通过指导反向过程来改善原始语义的保留。最后,语义融合被设计为仅在反向过程的具体阶段提供语义引导,防止语义模糊和伪影,同时保持引导信息。大量实验验证了DIVER在改进经典蒸馏技术和显著提升跨架构泛化方面的有效性和效率,在ImageNet(256×256)上仅需与原始DiT相当的处理时间,且仅使用4 GB GPU内存。

英文摘要

Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called ${\textbf{DIVER}}$, which leverages the pre-trained diffusion model to dive deeper into $\textbf{DI}$stilled data $\textbf{V}$ia $\textbf{E}$xpressive semantic $\textbf{R}$ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256$\times$256) with only 4 GB of GPU memory usage.

2605.12374 2026-05-26 cs.CV cs.AI cs.LG

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

填补GAP:多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

AI总结 提出GAP(粒度对齐范式),通过特征级、上下文级和能力引导级对齐,解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题,提升感知与推理性能。

详情
AI中文摘要

视觉潜在推理让多模态大语言模型(MLLM)以连续令牌形式创建中间视觉证据,避免外部工具或图像生成器。然而,现有方法通常遵循输出即输入的潜在范式,产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据:主流的视觉潜在模型建立在预归一化MLLM上,重用解码器隐藏状态作为预测的潜在输入,尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围(Xie et al., 2025; Li et al., 2026; Team et al., 2026)。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发,我们提出GAP,一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理:特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在;上下文级对齐通过可检查的辅助视觉监督锚定潜在目标;能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上,所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明,生成的潜在提供了任务相关的视觉信号,而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

2605.10913 2026-05-26 cs.AI cs.PL cs.SE

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd: 一个为元代理提供形式化执行迹的运行时基座

Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

AI总结 提出Shepherd,一个基于函数式编程的Python运行时基座,将代理执行作为一等对象,通过类似Git的执行迹支持元代理的检查、分叉和重放,在三个用例中显著提升性能。

Comments 50 pages, 22 figures, 14 tables

详情
AI中文摘要

随着LLM代理系统承担更复杂的任务,它们越来越依赖元代理:对其他代理进行操作的高阶代理,就像管理者监督员工一样。无论元代理做什么:协调代理、在执行前停止风险动作、或修复失败的运行,都需要在运行时操纵代理执行。现有的代理基座使得这变得困难:它们只给元代理提供纯文本记录和环境快照,要求元代理构建自己的工具来重建和编排执行状态。因此,我们引入了Shepherd,一个基于函数式编程原则的Python基座,其中代理的执行本身是一个一等对象,元代理可以检查和转换它。每个模型调用、工具调用和环境变化都成为类似Git的执行迹中的一个结构化事件,任何过去的状态都可以被分叉(比docker commit快5倍)并重放。三个示例用例展示了Shepherd的多功能性:(1)一个监督代理防止并行编码代理之间的冲突,将CooperBench的性能从28.8%提升到54.7%;(2)一个反事实优化器通过提出编辑并从行为改变点重放运行来修复代理工作流,在TerminalBench-2上比MetaHarness低58%的挂钟时间;(3)一个元代理在展开期间选择分叉点以改进长程代理强化学习中的信用分配,在TerminalBench-2上将GRPO的增益翻倍。我们开源Shepherd,以通过原则性和高效的代理执行操作赋能未来的元代理。

英文摘要

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

2605.09270 2026-05-26 cs.LG cs.AI

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

记忆定理而非实例:通过数学推理探究SFT泛化

Ruiying Peng, Mengyu Yang, Jing Lei, Xiaohui Li, Xueyu Wu, Xinlei Chen

AI总结 针对监督微调(SFT)损害推理泛化的问题,提出Theorem-SFT方法,通过显式定理应用训练,在多个基准上取得显著提升,并揭示前馈层是推理规则的主要存储位置。

详情
AI中文摘要

监督微调(SFT)广泛用于任务特定适配,但近期工作表明它会系统性地削弱推理泛化。我们认为根本原因不在于记忆本身,而在于其目标:标准SFT驱动模型利用并记忆问题-答案对中的虚假表面相关性,使其对表面输入变化脆弱。为解决此问题,我们提出Theorem-SFT,通过教授模型规则如何被调用而非答案看起来像什么,将监督重新导向显式定理应用。Theorem-SFT在多个基准和模型家族上取得一致提升:在MATH上(LLaMA3.2-3B-Instruct)提升8.8%,在GeoQA上(Qwen2.5-VL-7B-Instruct)提升20.27%,无需特定模态的重新训练。仅微调MLP层即可达到全层性能,表明前馈组件是推理规则的主要存储位置。我们的发现重新定义了争论:泛化失败并非源于记忆机制本身,而是源于记忆了错误的归纳目标。

英文摘要

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

2605.09223 2026-05-26 cs.CV

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

CREST: 曲率调节的事件中心采样用于高效长视频理解

Mehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan, Md. Tanvir Alam, Ismat Rahman, Yu Tian, Md Mosaddek Khan

AI总结 提出一种无训练帧选择方法CREST,利用查询-帧相关性的时间几何(局部曲率)来指导采样,在固定预算下实现高效长视频理解。

详情
AI中文摘要

从长视频中选择信息帧是一个组合问题,现有方法要么通过高效启发式方法处理,但未显式建模查询条件的时间结构,要么通过多阶段检索流水线处理,但预处理成本高。我们提出 extbf{CREST},一种基于查询-帧相关性的时间几何的无训练帧选择方法。CREST基于观察:相关性随时间表现出结构化的局部变化——显著事件周围曲率陡峭,冗余段区域平坦。通过使用局部曲率指导选择,CREST在短暂决定性事件和缓慢演变的证据之间更有效地分配固定帧预算。在固定主干网络和帧预算下,CREST在LongVideoBench和VideoMME上比轻量级相关性-覆盖基线AKS获得更高准确率,同时保留了更强多阶段检索流水线MIRA的93-95%准确率,而预处理成本仅为后者的3-4%。 ootnote{代码和实现细节包含在补充材料中,将在接收后公开发布。}在时间帧选择的诊断基准TempRel上,CREST比AKS相对提高6.88%。成对LLM-as-a-judge评估进一步表明,CREST选择的帧产生更连贯的帧条件描述,在两个基准上胜率分别为60.58%和54.50%。这些结果表明,局部时间几何为长视频帧选择提供了简单高效的基础。

英文摘要

Selecting informative frames from long videos is a combinatorial problem that existing methods address either through efficient heuristics without explicit modeling of query-conditioned temporal structure, or through multi stage retrieval pipelines with substantial preprocessing cost. We propose \textbf{CREST}, a training-free frame selection method grounded in the temporal geometry of query--frame relevance. CREST is based on the observation that relevance over time exhibits structured local variation: sharp curvature around salient events and flatter regions in redundant segments. By using local curvature to guide selection, CREST allocates a fixed frame budget more effectively across brief decisive events and slowly evolving evidence. Under a fixed backbone and frame budget, CREST achieves higher accuracy than AKS, a lightweight relevance--coverage baseline, on LongVideoBench and VideoMME, while retaining 93--95\% of the accuracy of MIRA, a stronger multi-stage retrieval pipeline, at only 3--4\% of its preprocessing cost.\footnote{Code and implementation details are included in the supplementary material and will be released publicly upon acceptance.} On TempRel, our diagnostic benchmark for temporal frame selection, CREST achieves a 6.88\% relative improvement over AKS. Pairwise LLM-as-a-judge evaluation further shows that CREST-selected frames yield more coherent frame-conditioned descriptions, with win rates of 60.58\% and 54.50\% on the two benchmarks. These results show that local temporal geometry provides a simple and efficient basis for long-video frame selection.

2605.07607 2026-05-26 cs.CV

FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth

FS-I2P:一种具有动态分配深度的分层聚焦扫描配准网络

Zhixin Cheng, Yujia Chen, Xujing Tao, Bohao Liao, Xiaotian Yin, Baoqun Yin, Tianzhu Zhang

AI总结 提出一种基于聚焦-扫描范式的分层交互模块和动态层分配策略,用于解决图像到点云配准中的尺度模糊和注意力漂移问题,在RGB-D Scenes V2和7-Scenes数据集上达到最优性能。

详情
AI中文摘要

图像到点云的配准常常受到视角变化、跨模态差异和重复纹理的挑战,这些因素会导致尺度模糊,进而产生错误的对应关系。最近的无检测方法通过利用多尺度特征和基于Transformer的交互来缓解这一问题。然而,它们仍然存在跨层的注意力漂移和层内不一致性,阻碍了精确配准。受人类行为启发,我们提出了一种“聚焦-扫描”范式,并在基于SSM的框架内开发了分层聚焦-扫描交互模块,以增强多层次跨模态特征关联。此外,我们引入了一种动态层分配策略,自适应地确定迭代深度,以更好地利用几何约束并提高匹配鲁棒性。在两个基准数据集RGB-D Scenes V2和7-Scenes上的大量实验和消融研究表明,我们的方法达到了最先进的性能。

英文摘要

Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus--Sweep'' paradigm and develop a Hierarchical Focus--Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.

2605.07233 2026-05-26 cs.LG cs.CR stat.ML

Modulated learning for private and distributed regression with just a single sample per client device

调制学习:每个客户端设备仅有一个样本的私有分布式回归

Praneeth Vepakomma, Amirhossein Reisizadeh, Samuel Horváth, Munther A. Dahleh

AI总结 针对每个客户端仅有一个样本的分布式学习场景,提出一种通过注入校准噪声并共享后处理表示来实现隐私保护的全局模型学习方法,在期望上匹配非私有中心化梯度更新。

Comments 30 pages

详情
AI中文摘要

本文聚焦于从大量设备中学习的问题,每个设备仅持有一个数据样本。这种每客户端一个样本的设置存在于多个实际应用中,包括从健身追踪器、数据/应用使用聚合器、可穿戴传感设备和日常事件监测器等学习。当客户端只有一个样本时,标准的联邦学习范式会失效,因为基于单个点的局部更新远非有用,尤其是在模型系数估计的早期轮次中。这种效用进一步被每轮添加的隐私诱导噪声削弱。本文针对这一问题,使此类客户端能够协作贡献,有效学习全局模型,同时不泄露其数据隐私。所提出的方法在每个客户端注入一个精心校准的噪声扰动来变换样本,然后共享经过后处理的表示给服务器。服务器聚合这些表示,处理得到无偏梯度更新,该更新在期望上匹配非私有中心化梯度,同时保护数据隐私。这种方法不同于传统的私有联邦学习,其中通信负载涉及模型系数而非私有变换的数据样本。该方法使数据极其有限的设备能够协作学习准确、保护隐私的模型,无需大量本地数据集或牺牲个体隐私。

英文摘要

This work focuses on the question of learning from a large number of devices with each device holding only a single sample of data. Several real-world applications exist to this one sample per client setup up including learning from fitness trackers, data/app usage aggregators, body-worn sensing devices, and daily event monitors to name a few. When a client has only one sample, the standard federated learning paradigm breaks down as a local update based on that single point is far from being useful, especially in the earlier rounds for estimation of the model coefficients. This utility is further weakened by the privacy-inducing noise applied at every round. This work caters to this problem to enable such clients to collaboratively contribute to effectively learn a global model without leaking the privacy of their data. The proposed approach injects a single, carefully calibrated noisy perturbation to transform the sample at each client, followed by a post-processed representation which is shared with the server. These representations aggregated at the server are processed to obtain an unbiased gradient update that in expectation matches the non-private centralized gradient while preserving data privacy. This approach is different than traditional private federated learning, where the communication payloads involve model coefficients as opposed to privately transformed data samples. This method enables devices with extremely limited data to collaborate and learn accurate, privacy-preserving models without requiring large local datasets or sacrificing individual privacy.

2605.04906 2026-05-26 cs.AI

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Strat-Reasoner:在多智能体游戏中增强大语言模型的战略推理能力

Yidong He, Yutao Lai, Pengxu Yang, Jiarui Gan, Jiexin Wang, Yi Cai, Mengchen Zhao

AI总结 提出Strat-Reasoner框架,通过递归推理范式和集中式思维链比较模块,结合混合优势与组相对强化学习,提升大语言模型在多智能体游戏中的战略推理能力。

详情
AI中文摘要

虽然大语言模型(LLMs)在某些推理任务中表现出色,但在最终结果取决于所有智能体联合策略的多智能体游戏中,它们却难以应对。在多智能体游戏中,其他智能体的非平稳性给推理过程的评估和多个推理步骤上的信用分配带来了重大挑战。现有的单智能体强化学习(RL)方法及其多智能体扩展未能解决这些挑战,因为它们没有将其他智能体纳入推理过程。在这项工作中,我们提出了Strat-Reasoner,一种新颖的基于强化学习的框架,旨在提升LLMs在多智能体游戏中的战略推理能力。我们引入了一种新颖的递归推理范式,其中智能体的推理也整合了其他智能体的推理过程。为了为中间推理序列提供有效的奖励信号,我们采用了一个集中的思维链(CoT)比较模块来评估推理质量。最后,我们计算了一个准确的混合优势,并开发了一种组相对强化学习方法以优化LLM策略。实验结果表明,Strat-Reasoner显著提升了底层LLMs的战略能力,在各种多智能体游戏中平均性能提升了22.1%。代码已公开在https://github.com/ydhe1012/Strat-Reasoner。

英文摘要

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games. Code is publicly available at https://github.com/ydhe1012/Strat-Reasoner.