arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18172 2026-05-26 cs.AI

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

可视化不可见：生成式视觉定位赋能多模态大语言模型的通用脑电图理解

Jun-Yu Pan, Yansen Wang, Enze Zhang, Bao-Liang Lu, Wei-Long Zheng, Dongsheng Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出生成式视觉定位（GVG）框架，通过脑电图到图像的生成模型作为视觉翻译器，为多模态大语言模型提供结构化视觉上下文，以增强非视觉脑电图的理解和临床状态解释。

详情

AI中文摘要

利用预训练大语言模型和多模态大语言模型的通用表示为脑基础模型提供了一条有前景的路径。然而，视觉诱发的脑电图数据集仍然稀缺，导致现有方法主要将神经信号与抽象文本对齐，这种有损翻译可能丢弃脑活动中编码的细粒度感知信息。我们提出生成式视觉定位（GVG）框架，通过使用脑电图到图像的生成模型作为视觉翻译器，将不可见的信息可视化。GVG 不是仅将脑电图强制转换为文本，而是为非视觉脑电图生成实例特定的代理图像，提供结构化的视觉上下文，使多模态大语言模型能够利用其视觉先验进行临床状态解释。我们在两个多模态大语言模型骨干上验证了这一想法：GVG-X-Omni 和 GVG-Janus。仅图像对齐已具有竞争力：轻量级 GVG-X-Omni 在冻结的 7B 骨干上仅调整 170M 参数，即可匹配 1.7B 参数的文本对齐基线。我们进一步扩展了 GVG-Janus，采用三模态图像+文本对齐，其中文本提供类别语义锚点，视觉代理用感知细节丰富神经表示。实验表明，在脑电图理解和视觉生成方面均取得了一致增益，表明视觉代理定位作为文本对齐的有效补充。

英文摘要

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.17937 2026-05-26 cs.CL cs.AI

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench：面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

发表机构 * Beijing Normal University（北京师范大学）； Elmleaf Ltd.（Elmleaf公司）

AI总结提出首个大规模自动化量化回测基准BacktestBench，包含18,246个问答对，并设计多智能体基线AutoBacktest，通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情

DOI: 10.1145/3770855.3817460

AI中文摘要

量化回测对于评估交易策略至关重要，但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型（LLMs）通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径，但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战，这阻碍了该领域的进展。为弥补这一关键差距，我们引入了BacktestBench，这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建，包含18,246个精心标注的问答对，涵盖四个任务类别：指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest，一个稳健的多智能体基线，通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现，将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估，辅以有针对性的消融实验，识别了影响端到端性能的关键因素，并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17730 2026-05-26 cs.LG cs.AI

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

L-Drive：超越单一映射——潜在上下文驱动时间序列预测

Fan Zhang, Shijun Chen, Hua Wang

发表机构 * Business University, Yantai, Shandong, China（山东商业大学）； Ludong University, Yantai, Shandong, China（鲁东大学）

AI总结针对分布偏移和机制变化导致直接映射范式在转折点响应滞后的问题，提出L-Drive框架，通过引入潜在上下文表征高层动态并利用门控调制增量表示，提升对变化段的适应能力，同时采用补丁共享相对位置基函数增强段内结构建模，实现预测精度与计算效率的更好平衡。

详情

AI中文摘要

多变量时间序列预测的主流方法主要遵循直接映射范式。它们在观测空间中学习从历史到未来的统一映射，以拟合值级依赖关系。然而，现实世界系统经常经历分布偏移和机制变化。在这种情况下，统一映射在转折点附近可能出现响应滞后，导致切换窗口内误差累积，降低预测可靠性。为解决此问题，我们提出L-Drive，一种变化感知预测框架。L-Drive引入潜在上下文，显式表征随时间演变的高层动态，并使用门控调制增量表示。这提供了更及时的变化线索，并改善了对变化段的适应。此外，它结合了补丁共享相对位置基函数，以加强段内结构建模并减少由绝对位置记忆引起的过拟合。大量实验验证了L-Drive的有效性，并展示了其在预测精度和计算效率之间更好的整体权衡。

英文摘要

Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.17537 2026-05-26 cs.AI

人类如何处理AI生成的幻觉内容：一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）； Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China（复旦大学可信具身人工智能研究院）

AI总结通过EEG实验，研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异，揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情

AI中文摘要

尽管AI生成的幻觉带来了相当大的风险，但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题，本文探索了人类的神经动力学，以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号，该任务要求判断由多模态大语言模型（MLLM）生成的图像描述的正确性。基于平均事件相关电位（ERP）研究，我们揭示了多种认知过程，例如语义整合、推理处理、记忆检索和认知负荷，在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是，人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明，被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

URL PDF HTML ☆

赞 0 踩 0

2605.16023 2026-05-26 cs.CL cs.LG

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Selin Kahvecioglu, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

发表机构 * Technische Universität Berlin（柏林技术大学）； BIFOLD – Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究院）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）； Fraunhofer Heinrich Hertz Institute（弗劳恩霍夫海因里希·赫茨研究所）； Marburg University（马尔堡大学）； Centre for European Research in Trusted AI (CERTAIN)（欧洲可信人工智能研究中心）

AI总结本研究利用位置感知边归因修补（PEAP）因果分析Gemma-3、Qwen2.5和Llama-3的内部机制，发现结构化理解和开放式偏好任务中的判断共享一个稀疏、泛化的潜在评估子图，并通过解耦抽象判断与输出格式，揭示了格式诱导不一致性的机制原因。

Comments 39 pages

详情

AI中文摘要

LLM-as-a-judge已成为大规模评估模型输出的主导范式，然而同一模型在其输出格式变化时（例如，1-5评分与真/假标签）会系统地给出不同的分数。现有对这些格式诱导不一致性的诊断停留在输入输出层面。利用位置感知边归因修补（PEAP），我们因果地研究了Gemma-3、Qwen2.5和Llama-3的内部机制。我们发现，跨结构化理解和开放式偏好任务的判断共享一个稀疏、泛化的潜在评估子图，位于中后期多层感知器（MLPs）中；将其零消融会破坏判断，同时保留架构模块化模型中的世界知识。通过结构上解耦抽象判断与输出格式，我们为我们研究的开放权重模型上的格式诱导不一致性提供了机制解释：在共享主干中计算的连续判断信号通过脆弱、格式特定的终端分支映射，使得格式无关的偏好能够在请求的输出格式下游被隔离。我们的发现意味着跨格式的基准级可靠性比较部分测量的是格式化器几何形状而非评估质量。

英文摘要

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

URL PDF HTML ☆

赞 0 踩 0

2605.14769 2026-05-26 cs.LG

Composable Crystals: Controllable Materials Discovery via Concept Learning

可组合晶体：通过概念学习实现可控材料发现

Nian Liu, Yuwei Zeng, Ryoji Kubo, Nikita Kazeev, Stephen Gregory Dale, Artem Maevskiy, Pengru Huang, Thomas Laurent, Kostya S. Novoselov, Xavier Bresson

发表机构 * National University of Singapore（新加坡国立大学）； Loyola Marymount University（洛桑玛丽蒙大学）

AI总结提出基于概念组合的晶体生成框架，利用向量量化变分自编码器自动发现可重用晶体概念，通过概念重组实现可控的新晶体探索，在MP-20和Alex-MP-20上V.S.U.N指标提升最高53.2%和51.7%。

详情

AI中文摘要

从头晶体生成是材料发现中的核心任务，旨在生成同时有效、稳定、独特且新颖的晶体。现有方法主要依赖黑盒随机采样，对生成结构如何超越观测分布的控制有限。本文提出了一种基于概念的组合式晶体生成框架。我们训练了一个向量量化变分自编码器，自动发现一组可重用的晶体概念，这些概念作为引导生成的构建块。这些学习到的概念在局部原子环境和全局对称模式上自然表现出可解释性，并能泛化到不同分布的晶体。通过重组这些概念，我们的框架能够可控地探索训练分布之外的新颖晶体，而非仅依赖无约束的随机采样。为进一步提高组合效率，我们引入了一个组合生成器，并使用模型自身生成的高质量样本对其进行迭代优化。最终的概念组合用于条件化下游晶体生成。在MP-20和Alex-MP-20上的数值实验表明，分别组合概念使基础模型在V.S.U.N指标上提升高达53.2%和51.7%，尤其在新颖性方面增益显著。

英文摘要

De novo crystal generation, a central task in materials discovery, aims to generate crystals that are simultaneously valid, stable, unique, and novel. Existing methods mainly rely on black-box stochastic sampling, providing limited control over how generated structures move beyond the observed distribution. In this paper, we introduce a concept-based compositional framework for crystal generation. We train a vector-quantized variational autoencoder to automatically discover a shared set of reusable crystal concepts, which serve as building blocks for guided generation. These learned concepts naturally exhibit interpretability from both local atomic environments and global symmetry patterns, and generalize to crystals from different distributions. By recombining such concepts, our framework enables controllable exploration of novel crystals beyond the training distribution, rather than relying solely on unconstrained random sampling. To further improve composition efficiency, we introduce a composition generator and iteratively refine it using high-quality samples generated by the model itself. The resulting concept compositions are then used to condition downstream crystal generation. Numerical experiments on MP-20 and Alex-MP-20 show that compositing concepts separately increase base model up to 53.2% and 51.7% on V.S.U.N metric, with particular gains in novelty.

URL PDF HTML ☆

赞 0 踩 0

2605.14759 2026-05-26 cs.LG

Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement

Crys-JEPA：通过嵌入筛选和生成精炼加速晶体发现

Nian Liu, Nikita Kazeev, Stephen Gregory Dale, Artem Maevskiy, Yuwei Zeng, Ryoji Kubo, Pengru Huang, Thomas Laurent, Yann LeCun, Kostya S. Novoselov, Xavier Bresson

发表机构 * National University of Singapore（国立新加坡大学）； Loyola Marymount University（洛约拉玛丽蒙特大学）； New York University（纽约大学）； AMI

AI总结提出Crys-JEPA联合嵌入预测架构，通过能量感知的潜在空间和筛选-精炼流程，解决晶体生成中稳定性和新颖性的冲突，在MP-20和Alex-MP-20数据集上V.S.U.N.指标分别提升53.8%和72.7%。

详情

AI中文摘要

从头晶体生成旨在发现不仅真实而且稳定和新颖的材料。然而，大多数现有生成模型被训练为最大化观测晶体的似然，这鼓励样本接近已知材料，但不一定与发现中重要的标准一致。我们的实证分析表明，当前晶体生成模型在稳定性和新颖性之间存在明显冲突：接近观测分布的样本倾向于保持稳定性但提供有限的新颖性，而远离分布的样本通常迅速失去稳定性。这表明发现既稳定又新颖晶体的有用区域极其狭窄。为了突破这一限制，我们引入了Crys-JEPA，一种用于晶体的联合嵌入预测架构，它学习一个能量感知的潜在空间，保留形成能差异。在这个空间中，稳定性评估可以重新表述为基于嵌入的与可访问训练晶体的比较，减少了对昂贵能量评估和特定任务外部参考的依赖。基于Crys-JEPA，我们进一步开发了一个筛选-精炼流程，识别有前景的生成晶体并重新引入它们以精炼生成模型。在MP-20和Alex-MP-20数据集上，我们在V.S.U.N.指标上分别比基线提升了53.8%和72.7%。

英文摘要

De novo crystal generation seeks to discover materials that are not merely realistic, but also stable and novel. However, most existing generative models are trained to maximize the likelihood of observed crystals, which encourages samples to stay close to known materials yet not necessarily align with the criteria that matter in discovery. Our empirical analysis shows that current crystal generative models exhibit a clear conflict between stability and novelty: samples near the observed distribution tend to retain stability but offer limited novelty, whereas samples farther from it often lose stability rapidly. This suggests that the useful region for discovering crystals that are both stable and novel is extremely narrow. To move beyond this limitation, we introduce Crys-JEPA, a joint embedding predictive architecture for crystals that learns an energy-aware latent space preserving formation-energy differences. In this space, stability assessment can be reformulated as an embedding-based comparison against accessible training crystals, reducing the reliance on expensive energy evaluation and task-specific external references. Building on Crys-JEPA, we further develop a screening-and-refinement pipeline that identifies promising generated crystals and reintroduces them to refine the generative model. On MP-20 and Alex-MP-20 datasets, we achieve improvements over baselines up to 53.8% and 72.7% on V.S.U.N. metric, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.13643 2026-05-26 cs.CL

填补GAP：多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba（阿里云大模型应用团队）； Alibaba University of Waterloo（阿里大学水力学院）； Vector Institute（向量研究所）； Zhejiang University（浙江大学）

AI总结提出GAP（粒度对齐范式），通过特征级、上下文级和能力引导级对齐，解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题，提升感知与推理性能。

详情

AI中文摘要

视觉潜在推理让多模态大语言模型（MLLM）以连续令牌形式创建中间视觉证据，避免外部工具或图像生成器。然而，现有方法通常遵循输出即输入的潜在范式，产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据：主流的视觉潜在模型建立在预归一化MLLM上，重用解码器隐藏状态作为预测的潜在输入，尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围（Xie et al., 2025; Li et al., 2026; Team et al., 2026）。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发，我们提出GAP，一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理：特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在；上下文级对齐通过可检查的辅助视觉监督锚定潜在目标；能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上，所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明，生成的潜在提供了任务相关的视觉信号，而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

URL PDF HTML ☆

赞 0 踩 0

2605.10913 2026-05-26 cs.AI cs.PL cs.SE

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd: 一个为元代理提供形式化执行迹的运行时基座

Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

发表机构 * Northeastern University（东北大学）； Stanford University（斯坦福大学）

AI总结提出Shepherd，一个基于函数式编程的Python运行时基座，将代理执行作为一等对象，通过类似Git的执行迹支持元代理的检查、分叉和重放，在三个用例中显著提升性能。

Comments 50 pages, 22 figures, 14 tables

详情

AI中文摘要

随着LLM代理系统承担更复杂的任务，它们越来越依赖元代理：对其他代理进行操作的高阶代理，就像管理者监督员工一样。无论元代理做什么：协调代理、在执行前停止风险动作、或修复失败的运行，都需要在运行时操纵代理执行。现有的代理基座使得这变得困难：它们只给元代理提供纯文本记录和环境快照，要求元代理构建自己的工具来重建和编排执行状态。因此，我们引入了Shepherd，一个基于函数式编程原则的Python基座，其中代理的执行本身是一个一等对象，元代理可以检查和转换它。每个模型调用、工具调用和环境变化都成为类似Git的执行迹中的一个结构化事件，任何过去的状态都可以被分叉（比docker commit快5倍）并重放。三个示例用例展示了Shepherd的多功能性：（1）一个监督代理防止并行编码代理之间的冲突，将CooperBench的性能从28.8%提升到54.7%；（2）一个反事实优化器通过提出编辑并从行为改变点重放运行来修复代理工作流，在TerminalBench-2上比MetaHarness低58%的挂钟时间；（3）一个元代理在展开期间选择分叉点以改进长程代理强化学习中的信用分配，在TerminalBench-2上将GRPO的增益翻倍。我们开源Shepherd，以通过原则性和高效的代理执行操作赋能未来的元代理。

英文摘要

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

URL PDF HTML ☆

赞 0 踩 0

2605.09270 2026-05-26 cs.LG cs.AI

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

记忆定理而非实例：通过数学推理探究SFT泛化

Ruiying Peng, Mengyu Yang, Jing Lei, Xiaohui Li, Xueyu Wu, Xinlei Chen

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Huawei Technologies（华为技术）

AI总结针对监督微调（SFT）损害推理泛化的问题，提出Theorem-SFT方法，通过显式定理应用训练，在多个基准上取得显著提升，并揭示前馈层是推理规则的主要存储位置。

详情

AI中文摘要

监督微调（SFT）广泛用于任务特定适配，但近期工作表明它会系统性地削弱推理泛化。我们认为根本原因不在于记忆本身，而在于其目标：标准SFT驱动模型利用并记忆问题-答案对中的虚假表面相关性，使其对表面输入变化脆弱。为解决此问题，我们提出Theorem-SFT，通过教授模型规则如何被调用而非答案看起来像什么，将监督重新导向显式定理应用。Theorem-SFT在多个基准和模型家族上取得一致提升：在MATH上（LLaMA3.2-3B-Instruct）提升8.8%，在GeoQA上（Qwen2.5-VL-7B-Instruct）提升20.27%，无需特定模态的重新训练。仅微调MLP层即可达到全层性能，表明前馈组件是推理规则的主要存储位置。我们的发现重新定义了争论：泛化失败并非源于记忆机制本身，而是源于记忆了错误的归纳目标。

英文摘要

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

URL PDF HTML ☆

赞 0 踩 0

2605.09223 2026-05-26 cs.CV

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

CREST: 曲率调节的事件中心采样用于高效长视频理解

Mehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan, Md. Tanvir Alam, Ismat Rahman, Yu Tian, Md Mosaddek Khan

发表机构 * Dept. of CSE, University of Dhaka（达卡大学计算机科学与工程系）； Dept. of CSE, University of Central Florida（中央佛罗里达大学计算机科学与工程系）

AI总结提出一种无训练帧选择方法CREST，利用查询-帧相关性的时间几何（局部曲率）来指导采样，在固定预算下实现高效长视频理解。

详情

AI中文摘要

从长视频中选择信息帧是一个组合问题，现有方法要么通过高效启发式方法处理，但未显式建模查询条件的时间结构，要么通过多阶段检索流水线处理，但预处理成本高。我们提出 extbf{CREST}，一种基于查询-帧相关性的时间几何的无训练帧选择方法。CREST基于观察：相关性随时间表现出结构化的局部变化——显著事件周围曲率陡峭，冗余段区域平坦。通过使用局部曲率指导选择，CREST在短暂决定性事件和缓慢演变的证据之间更有效地分配固定帧预算。在固定主干网络和帧预算下，CREST在LongVideoBench和VideoMME上比轻量级相关性-覆盖基线AKS获得更高准确率，同时保留了更强多阶段检索流水线MIRA的93-95%准确率，而预处理成本仅为后者的3-4%。 ootnote{代码和实现细节包含在补充材料中，将在接收后公开发布。}在时间帧选择的诊断基准TempRel上，CREST比AKS相对提高6.88%。成对LLM-as-a-judge评估进一步表明，CREST选择的帧产生更连贯的帧条件描述，在两个基准上胜率分别为60.58%和54.50%。这些结果表明，局部时间几何为长视频帧选择提供了简单高效的基础。

英文摘要

Selecting informative frames from long videos is a combinatorial problem that existing methods address either through efficient heuristics without explicit modeling of query-conditioned temporal structure, or through multi stage retrieval pipelines with substantial preprocessing cost. We propose \textbf{CREST}, a training-free frame selection method grounded in the temporal geometry of query--frame relevance. CREST is based on the observation that relevance over time exhibits structured local variation: sharp curvature around salient events and flatter regions in redundant segments. By using local curvature to guide selection, CREST allocates a fixed frame budget more effectively across brief decisive events and slowly evolving evidence. Under a fixed backbone and frame budget, CREST achieves higher accuracy than AKS, a lightweight relevance--coverage baseline, on LongVideoBench and VideoMME, while retaining 93--95\% of the accuracy of MIRA, a stronger multi-stage retrieval pipeline, at only 3--4\% of its preprocessing cost.\footnote{Code and implementation details are included in the supplementary material and will be released publicly upon acceptance.} On TempRel, our diagnostic benchmark for temporal frame selection, CREST achieves a 6.88\% relative improvement over AKS. Pairwise LLM-as-a-judge evaluation further shows that CREST-selected frames yield more coherent frame-conditioned descriptions, with win rates of 60.58\% and 54.50\% on the two benchmarks. These results show that local temporal geometry provides a simple and efficient basis for long-video frame selection.

URL PDF HTML ☆

赞 0 踩 0

2605.07607 2026-05-26 cs.CV

FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth

FS-I2P：一种具有动态分配深度的分层聚焦扫描配准网络

Zhixin Cheng, Yujia Chen, Xujing Tao, Bohao Liao, Xiaotian Yin, Baoqun Yin, Tianzhu Zhang

发表机构 * School of Information Science and Technology, University of Science and Technology of China（信息科学与技术学院，中国科学技术大学）； School of Computer Science and Information Engineering, Hefei University of Technology（计算机科学与信息工程学院，合肥工业大学）； National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory（深空探测国家实验室，深空探测实验室）； Institute of Advanced Technology, University of Science and Technology of China（先进技术研究院，中国科学技术大学）

AI总结提出一种基于聚焦-扫描范式的分层交互模块和动态层分配策略，用于解决图像到点云配准中的尺度模糊和注意力漂移问题，在RGB-D Scenes V2和7-Scenes数据集上达到最优性能。

详情

AI中文摘要

图像到点云的配准常常受到视角变化、跨模态差异和重复纹理的挑战，这些因素会导致尺度模糊，进而产生错误的对应关系。最近的无检测方法通过利用多尺度特征和基于Transformer的交互来缓解这一问题。然而，它们仍然存在跨层的注意力漂移和层内不一致性，阻碍了精确配准。受人类行为启发，我们提出了一种“聚焦-扫描”范式，并在基于SSM的框架内开发了分层聚焦-扫描交互模块，以增强多层次跨模态特征关联。此外，我们引入了一种动态层分配策略，自适应地确定迭代深度，以更好地利用几何约束并提高匹配鲁棒性。在两个基准数据集RGB-D Scenes V2和7-Scenes上的大量实验和消融研究表明，我们的方法达到了最先进的性能。

英文摘要

Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus--Sweep'' paradigm and develop a Hierarchical Focus--Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.07233 2026-05-26 cs.LG cs.CR stat.ML

Modulated learning for private and distributed regression with just a single sample per client device

调制学习：每个客户端设备仅有一个样本的私有分布式回归

Praneeth Vepakomma, Amirhossein Reisizadeh, Samuel Horváth, Munther A. Dahleh

发表机构 * MIT（麻省理工学院）

AI总结针对每个客户端仅有一个样本的分布式学习场景，提出一种通过注入校准噪声并共享后处理表示来实现隐私保护的全局模型学习方法，在期望上匹配非私有中心化梯度更新。

Comments 30 pages

详情

AI中文摘要

本文聚焦于从大量设备中学习的问题，每个设备仅持有一个数据样本。这种每客户端一个样本的设置存在于多个实际应用中，包括从健身追踪器、数据/应用使用聚合器、可穿戴传感设备和日常事件监测器等学习。当客户端只有一个样本时，标准的联邦学习范式会失效，因为基于单个点的局部更新远非有用，尤其是在模型系数估计的早期轮次中。这种效用进一步被每轮添加的隐私诱导噪声削弱。本文针对这一问题，使此类客户端能够协作贡献，有效学习全局模型，同时不泄露其数据隐私。所提出的方法在每个客户端注入一个精心校准的噪声扰动来变换样本，然后共享经过后处理的表示给服务器。服务器聚合这些表示，处理得到无偏梯度更新，该更新在期望上匹配非私有中心化梯度，同时保护数据隐私。这种方法不同于传统的私有联邦学习，其中通信负载涉及模型系数而非私有变换的数据样本。该方法使数据极其有限的设备能够协作学习准确、保护隐私的模型，无需大量本地数据集或牺牲个体隐私。

英文摘要

This work focuses on the question of learning from a large number of devices with each device holding only a single sample of data. Several real-world applications exist to this one sample per client setup up including learning from fitness trackers, data/app usage aggregators, body-worn sensing devices, and daily event monitors to name a few. When a client has only one sample, the standard federated learning paradigm breaks down as a local update based on that single point is far from being useful, especially in the earlier rounds for estimation of the model coefficients. This utility is further weakened by the privacy-inducing noise applied at every round. This work caters to this problem to enable such clients to collaboratively contribute to effectively learn a global model without leaking the privacy of their data. The proposed approach injects a single, carefully calibrated noisy perturbation to transform the sample at each client, followed by a post-processed representation which is shared with the server. These representations aggregated at the server are processed to obtain an unbiased gradient update that in expectation matches the non-private centralized gradient while preserving data privacy. This approach is different than traditional private federated learning, where the communication payloads involve model coefficients as opposed to privately transformed data samples. This method enables devices with extremely limited data to collaborate and learn accurate, privacy-preserving models without requiring large local datasets or sacrificing individual privacy.

URL PDF HTML ☆

赞 0 踩 0

2605.04906 2026-05-26 cs.AI

ESIA：基于能量的时空交互感知框架用于行人意图预测

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

发表机构 * James Watt School of Engineering, University of Glasgow（格拉斯哥大学詹姆斯·瓦特工程学院）

AI总结提出ESIA框架，利用条件随机场和能量函数建模时空交互，通过结构一致性约束和模拟退火算法实现行人意图预测，在标准基准上达到最先进性能并提升可解释性。

Comments 13 pages, 6 figures, 3 tables

详情

AI中文摘要

自动驾驶的最新进展推动了行人意图预测的研究，该研究旨在通过建模时间动态、社交互动和环境背景来推断未来的过街决策和行动。然而，现有研究仍受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测中缺乏全局一致性，这损害了鲁棒性和可解释性。在这项工作中，我们提出了ESIA（基于能量的时空交互感知框架），一种新颖的基于条件随机场（CRF）的范式。我们将意图预测任务视为一个基于统一图表示的结构化预测问题，将行人和环境视为时空节点。为了表征它们的不同角色，我们为节点分配一元势能以捕捉个体意图，为边分配成对势能以编码社交和环境交互。这些势能被整合到一个统一的全局能量函数中，以确保行为预测的场景级一致性。为了在没有真实标签监督的情况下进一步约束推理，我们引入了结构一致性项来惩罚逻辑矛盾。该优化通过一种新颖的一元种子模拟退火（U-SSA）算法高效求解，该算法利用高置信度的一元先验快速收敛到高质量解。在标准基准上的大量实验表明，ESIA在现有方法中实现了最先进的性能，并具有更好的可解释性。

英文摘要

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

Self-supervised Hierarchical Visual Reasoning with World Model

LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

Judge Circuits

Composable Crystals: Controllable Materials Discovery via Concept Learning

Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Asymmetric Flow Models

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth

Modulated learning for private and distributed regression with just a single sample per client device

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation

Efficient Preference Poisoning Attack on Offline RLHF

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

Rethinking LLM Ensembling from the Perspective of Mixture Models

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction