arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3412
2605.24680 2026-05-26 cs.LG

Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data

基于轨迹的难度评分用于表格数据的可靠学习

Tomer Lavi, Bracha Shapira, Nadav Rappoport

AI总结 提出轨迹难度评分(TDS),通过分析梯度提升树的逐树累积预测轨迹,为每个实例估计难度,并在分类和回归任务中优于现有基线,同时支持主动学习、选择性预测和共形预测等应用。

详情
AI中文摘要

梯度提升树在表格数据上表现出色,但常常留下一个长尾的预测不佳实例。我们引入了一种基于轨迹的难度评分(TDS),这是一种针对提升集成模型的实例级难度估计器,源自每棵树的累积预测轨迹。对于每个实例,我们计算可解释的轨迹描述符(例如,方差、振荡峰值、符号切换和尾部稳定性),并训练一个轻量级回归模型来预测保留损失。经验CDF将得到的信号校准为$[0,1]$内的分数,支持对困难案例进行排序。在多种表格基准和集成大小上,TDS与误差表现出强秩相关性,并且在分类任务上优于现有的实例难度和不确定性基线,同时在回归任务上保持竞争力。然后,我们展示了单个难度信号如何改进多个数据挖掘工作流:用于标签高效训练的难度驱动主动学习、用于改进风险覆盖权衡的难度阈值选择性预测,以及用于更均匀条件覆盖的TDS分层(Mondrian)共形预测。最后,使用SHAP归因对高TDS实例进行聚类,揭示了以紧凑特征值范围为特征的连贯故障模式,支持错误分析和针对性数据采集。

英文摘要

Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.

2605.24679 2026-05-26 cs.CV

MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models

MindAdapter: 跨被试脑到视觉解码模型的少样本参数高效残差校准

Jiaxiang Liu, Jiawei Du, Xupeng Chen, Guoqi Li, Jiang Cai, Simon Fong, Mingkun Xu

AI总结 提出MindAdapter框架,通过解耦的线性-残差级联对齐和拓扑锚定双流流形约束,实现跨被试脑到视觉解码的少样本参数高效校准。

Comments Accepted to KDD 2026 (AI4Sciences Track). 15 pages, 7 figures

详情
AI中文摘要

跨被试脑到视觉解码由于严重的个体间变异性导致系统性的被试特异性功能错位,仍然是脑机接口的核心挑战。为了解决这个问题,我们提出了MindAdapter,一个针对预训练脑到视觉解码模型的参数高效少样本校准框架。MindAdapter采用解耦的线性-残差级联对齐范式,冻结预训练的显式脑功能对齐主干(粗粒度),并引入轻量级非线性残差适配器(细粒度),从而将全局跨被试对应关系与被试特异性残差校正分离,实现细粒度的空间和语义校准。为了进一步保持全局表征稳定性,我们设计了一个拓扑锚定的双流流形约束,其中一小部分共享刺激作为拓扑锚点,提供体素级配对监督,而语义流通过冻结的视觉-语言解码器在未配对的脑数据上强制执行一致性。总之,MindAdapter在保持预训练期间学习的全局表征几何结构的同时,高效地注入被试特异性校正。在自然场景数据集(NSD)上的实验表明,MindAdapter仅使用少量共享刺激就能显著提高跨被试视觉重建和检索精度,为个性化脑到视觉解码提供了一种实用且数据高效的解决方案。

英文摘要

Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.

2605.24675 2026-05-26 cs.CV cs.AI

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

AI总结 针对网页图像翻译中视觉表示差距问题,提出VaaWIT框架,通过双流注意力模块和视觉感知适配器,实现大语言模型对细粒度视觉特征的动态融合,在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要,尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型(LVLMs)已经推进了多模态理解,但由于视觉表示差距,将它们应用于网页图像翻译仍然具有挑战性:标准编码器通常优先考虑高级语义,而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战,我们提出了VaaWIT,一个端到端框架,用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献:(1)双流注意力模块(DSAM),促进多语言语义特征与详细视觉表示之间的双向交互,从而合成对文本变化鲁棒的统一特征;(2)视觉感知适配器(VAA),一种参数高效的微调策略,将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐,同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明,VaaWIT显著优于最先进(SOTA)的开源基线,并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

2605.24674 2026-05-26 cs.CV

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

推理对齐:扩散Transformer在视频编辑中的隐式推理

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

AI总结 针对指令式视频编辑中条件信号未分化及交叉注意力监督不足的问题,提出RVEDiT框架,通过粒度路由令牌条件和参考锚定注意力对齐实现粗到细编辑与内部推理正则化。

详情
AI中文摘要

基于指令的视频编辑需要根据自然语言指令转换源视频,同时保留无关内容并保持时间连贯性。我们认为现有的扩散Transformer(DiT)编辑器由于两个结构原因难以完成此任务。首先,条件信号未分化地输入所有Transformer块,迫使单个令牌流同时编码全局编辑意图和细粒度视觉证据。其次,控制编辑的交叉注意力模式仅通过像素级重建间接监督,使得模型内部推理过程约束不足。为了解决这两个限制,我们提出了RVEDiT,一个隐式推理视频编辑DiT框架,围绕两个互补组件构建。第一个组件,粒度路由令牌条件,引入从多模态大语言模型蒸馏的可学习编辑令牌,并将其路由到浅层块,同时将原生视觉和文本令牌保留给深层块,从而在骨干网络内部诱导出从粗到细的编辑过程。第二个组件,参考锚定注意力对齐,在训练期间采用参数共享的参考分支,并最大化编辑分支和参考分支注意力特征之间的互信息,正则化模型的内部推理而不产生任何额外的推理成本。在标准基于指令的视频编辑基准上的实验表明,RVEDiT始终优于最先进的基线,特别是在局部和组合编辑方面取得了显著提升。

英文摘要

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

2605.24667 2026-05-26 cs.AI cs.LG

When Mean CE Fails: Median CE Can Better Track Language Model Quality

当平均交叉熵失效时:中位数交叉熵能更好地跟踪语言模型质量

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

AI总结 本文发现中位数交叉熵比平均交叉熵更能反映语言模型在训练过程中的任务性能,并建议在评估时报告多个百分位交叉熵。

Comments 20 pages

详情
AI中文摘要

平均交叉熵是语言模型的标准验证指标,但在训练过程中可能无法跟踪模型质量。我们在两种常见场景下研究了这一点。首先,在Qwen2.5-1.5B的合成事实学习SFT中,我们发现平均CE在初始学习阶段后显著上升,而保留的事实召回准确率保持接近峰值。其次,在TinyStories上的top-K蒸馏中,我们发现减小K会改善中位数CE而恶化平均CE;Top-5学生获得了最高的LLM评判分数,并在中位数CE上低于其教师,尽管其平均CE最差。在这两种情况下,中位数CE与任务性能的相关性比平均CE更紧密。分析训练过程中整体和尾部百分位CE的变化表明,训练重塑了经验性的每token CE分布。在top-K蒸馏中,较小的K产生了一个在两端都有更多质量的分布,降低了中位数并增加了平均值。在Qwen SFT中,整体部分迅速饱和,而尾部在训练后半段延伸。在这两种情况下,任务评估指标似乎对整体部分比尾部更敏感。实际上,我们建议在报告平均CE的同时报告一小部分百分位CE摘要,并利用它们之间的一致性作为跟踪分布重塑的工具,以及当平均和中位数CE在模型选择上不一致时的低成本诊断。

英文摘要

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

2605.24661 2026-05-26 cs.AI cs.CL

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量:一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

AI总结 提出一个基于行为的多维框架,从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量,揭示仅靠准确率无法观察到的行为,并支持部署决策。

详情
AI中文摘要

LLMs在复杂推理任务中取得了显著成功,但当前的评估方法主要依赖最终答案的正确性,对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白,本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量,操作化了六个理论驱动的维度:正确性(CQ)、一致性(CS)、鲁棒性(RS)、逻辑连贯性(LS)、效率(ES)和稳定性(SS)。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明,该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是,逻辑连贯性与正确性正交(r = -0.172,不显著),证实了正确答案可能源于不连贯的推理,而Claude-Haiku-4.5取得了最高的多维得分(Q_bal = 0.778)。此外,该框架暴露了关键的排名反转:DeepSeek-V3在准确率优先下排名第二,但在法律/合规权重下排名第五,这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的(|r| < 0.50),为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策:识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型(LS--CQ正交性);防止仅基于准确率的基准测试导致的排名错误;以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2605.24659 2026-05-26 cs.LG

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

IterInject: 通过反馈引导的迭代优化实现对LLM智能体的间接提示注入

Zixuan Chen, Jiaxiang Chen, Li Luo, Ke Xu, Xiaoxiang Huang, Tanfeng Sun, Xinghao Jiang

AI总结 提出IterInject框架,通过规则诊断器和LLM优化器迭代优化对抗载荷,实现对LLM智能体的间接提示注入攻击,在多个基准和实际系统中显著优于现有方法,并揭示了注意力介导的阈值机制。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

基于LLM的智能体越来越多地被部署用于需要规划、工具使用和与外部服务交互的复杂任务。它们对外部不可信内容的依赖使其容易受到间接提示注入(IPI)攻击,其中嵌入在检索数据中的对抗指令劫持智能体行为。现有攻击依赖于无法适应智能体特定防御的静态载荷;即使是最近的适应性方法也缺乏结构化反馈来指导优化。我们提出\oursys,一个反馈引导的迭代框架,闭合了注入、诊断和精炼之间的循环:基于规则的诊断器产生带有行为描述的结构化结果标签,基于LLM的优化器根据完整的优化历史精炼载荷。一个合成步骤从失败模式中生成新的伪装种子,使策略空间能够自我进化。在AgentDojo和InjectAgent上,\oursys在四个受害模型上显著优于静态基线和现有的适应性方法。在Claude Code(一个具有分层防御的生产级编码智能体)上的扩展实验表明,优化后的载荷在9个目标中的5个上取得了完全成功;即使那些抵抗完全利用的目标也显示出通过迭代精炼可衡量的改进。我们进一步对IPI进行了机制分析,识别出中后层中注意力介导的阈值机制;三个因果干预验证了这一发现,并指出了具体的防御方向。

英文摘要

LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

2605.24658 2026-05-26 cs.LG

WLNO: Wavelet-Laplace Neural Operator for Solving Partial Differential Equations

WLNO: 用于求解偏微分方程的小波-拉普拉斯神经算子

Muhammad Abid, Arth Sojitra, Omer San

AI总结 提出WLNO,通过融合Haar小波多尺度空间分解与拉普拉斯神经算子的极点-留数公式,在五个基准PDE问题上优于LNO,尤其擅长处理具有强空间多尺度结构的问题。

详情
AI中文摘要

本文介绍了小波-拉普拉斯神经算子(WLNO),一种新颖的神经算子,它将Haar小波多尺度空间分解与拉普拉斯神经算子(LNO)的拉普拉斯域极点-留数公式融合在一起。虽然LNO通过可学习的系统极点和留数捕捉瞬态和稳态动力学,但它缺乏提取复杂PDE解中固有的空间局部多尺度特征的显式机制。WLNO通过用并行单级Haar离散小波变换(DWT)分支增强LNO核心来解决这一问题,该分支将提升的特征图分解为四个频率子带:近似(LL)、水平细节(LH)、垂直细节(HL)和对角细节(HH),并在通过逆DWT重建之前对每个子带应用独立学习的$1\times1$卷积。两个分支通过一个可学习的sigmoid门控权重$\alpha_\mathrm{wav}$融合,该权重初始化为给小波分支一个小的初始贡献,允许模型在整个训练过程中自适应地平衡拉普拉斯域动力学与空间多尺度特征。WLNO与LNO在五个基准PDE问题上使用相同的超参数、训练数据和评估协议进行评估:扩散方程、Burgers方程、反应扩散系统、达西流和二维Navier-Stokes方程。WLNO在所有五个问题上始终优于LNO,在具有强空间多尺度结构的问题上改进最为显著,例如具有尖锐激波前沿的Burgers方程和具有相干涡旋结构的Navier-Stokes方程,而在更平滑和椭圆问题上表现一致。这些结果表明,基于小波的多尺度空间分解是拉普拉斯域算子学习的一种有原则且有效的补充。

英文摘要

This work introduces the Wavelet-Laplace Neural Operator (WLNO), a novel neural operator that fuses Haar wavelet multi-scale spatial decomposition with the Laplace-domain pole-residue formulation of the Laplace Neural Operator (LNO). While LNO captures transient and steady-state dynamics through learnable system poles and residues, it lacks an explicit mechanism for extracting spatially localized multi-scale features inherent in complex PDE solutions. WLNO addresses this by augmenting the LNO core with a parallel single-level Haar discrete wavelet transform (DWT) branch that decomposes the lifted feature map into four frequency subbands: approximation (LL), horizontal detail (LH), vertical detail (HL), and diagonal detail (HH) and applies independent learned $1\times1$ convolutions to each subband before reconstruction via the inverse DWT. The two branches are fused through a learnable sigmoid-gated weight $α_\mathrm{wav}$, initialized to give a small initial contribution to the wavelet branch, allowing the model to adaptively balance Laplace-domain dynamics against spatial multi-scale features throughout training. WLNO is evaluated against LNO on five benchmark PDE problems using identical hyperparameters, training data, and evaluation protocols: the diffusion equation, the Burgers equation, the reaction-diffusion system, Darcy flow, and the two-dimensional Navier-Stokes equation. WLNO consistently outperforms LNO on all five problems, with the most pronounced improvement on problems with strong spatial multi-scale structure, such as the Burgers equation with sharp shock fronts and the Navier-Stokes equation with coherent vortical structures, while remaining consistent across smoother and elliptic problems. These results demonstrate that wavelet-based multi-scale spatial decomposition is a principled and effective complement to Laplace-domain operator learning.

2605.24657 2026-05-26 cs.AI cs.SE

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

超越纯推理部署:比较基于权重的巩固与级联压缩

Simon Dennis, Kevin Shabahang, Hao Guo, Rivaan Patil

AI总结 针对大型语言模型纯推理部署中用户知识无法持久化的问题,提出通过夜间反射、合成和LoRA微调将交互知识巩固到模型权重中,实验表明该方法相比级联压缩知识保留率提升43.6个百分点。

Comments 15 pages

详情
AI中文摘要

主流LLM平台以纯推理配置部署模型:模型服务请求但从不更新每个用户的权重。用户必须反复重新教授偏好、修正和项目上下文,基于上下文的变通方法消耗上下文窗口空间,并在级联压缩下退化。我们评估了一种替代方案:通过反射、合成和低秩适应(LoRA)微调,在单个消费级GPU上将交互知识夜间巩固到模型权重中。在十次真实的软件开发对话中(n=10,三种记忆类型共1146个测试问题),三轮级联压缩保留了36.8±3.0%的知识(介于11.8%的无上下文下限和90.1%的全上下文上限之间),而巩固保留了80.4±1.3%——提升了43.6个百分点(配对t(9)=14.8,p<0.001),是压缩保留量的两倍多,其中程序性修正(36.3%->74.6%)和情景项目事实(31.5%->78.2%)的增益最大。作为方法论上的附带说明,平均每token验证交叉熵与LLM判断的准确性呈负相关(r=-0.51),而中位数每token验证交叉熵几乎完全跟踪准确性(r=+0.99):在容忍表面形式变化的评估器下,平均值具有误导性,而重尾鲁棒统计量才是可靠的信号。持久个性化需要超越纯推理部署,转向将知识巩固到权重的架构。

英文摘要

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p < 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% -> 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench:面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

AI总结 提出AVBench,通过细粒度人类中心指标和偏好学习训练的专业评估器,实现音视频生成的自动化、准确评估。

详情
AI中文摘要

音视频(AV)生成的快速进步使得能够生成具有同步声音的高保真合成内容,特别是涉及语音和交互的人类相关场景。然而,AV生成的评估仍处于早期阶段,只有少数针对人类相关场景的粗粒度基准,并且依赖于有限的预设评估和通用多模态大语言模型,导致对模型能力的不准确评估。为了解决这些问题,我们引入了AVBench,一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估:(i)人类中心和细粒度指标。AVBench整合了十个评估维度,专为以人为中心的现实场景设计,涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。(ii)通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题,我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后,评估器学会可靠地检测细微的跨模态不一致性。关键的是,AVBench不输出离散的文本判断,而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠,并且与人类判断高度一致。综合来看,AVBench为AV生成提供了自动化评估,展示了数据过滤的强大潜力,并可作为来自人类反馈的强化学习(RLHF)的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

2605.24647 2026-05-26 cs.CL

Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation

在你说话之前了解你:多轮对话中用于LLM个性化的用户状态建模

Jiani Luo, Xiaoyan Zhao, Yang Zhang, Shuyi Miao, Bingbing Xu, Stefan Konigorski, Tat-Seng Chua

AI总结 提出基于自由能原理的PUMA框架,通过显式用户状态模型和动作选择机制,将个性化对话从被动记忆检索转变为基于模型的用户演化决策,在医疗咨询基准上提升长期对话效果。

Comments 30pages, 3 figures

详情
AI中文摘要

个性化对话不仅需要回忆显式的用户历史:系统还需要推断通过交互演化并塑造适当响应策略的隐藏用户状态。现有的基于记忆和配置文件的方法主要重用可观察的用户信息,对建模用户状态动态或基于它们如何塑造未来用户状态来选择动作的支持有限。我们提出了PUMA(面向动作选择的预期用户状态建模),这是一个基于自由能原理(FEP)的框架,将个性化形式化为部分可观测下的决策,围绕一个显式的用户状态模型,该模型捕获潜在用户状态及其动作条件动态。在每一轮中,PUMA维护对用户隐藏状态的信念,细化用于观测生成和动作条件状态转移的用户状态模型,并通过最小化预期自由能来选择对话动作,在统一标准下平衡认知和实用目标。这种表述将个性化从被动记忆检索转变为基于模型的用户演化决策。我们在面向医疗咨询和动机性访谈的基准上实例化PUMA,并带有潜在状态标注以进行严格评估。实验表明,PUMA在保持强响应质量的同时改善了长期对话结果,跨数据集研究展示了更可靠的用户状态估计和下一状态预测。

英文摘要

Personalized dialogue requires more than recalling explicit user histories: systems also need to infer hidden user states that evolve through interaction and shape appropriate response strategies. Existing memory- and profile-based methods primarily reuse observable user information, offering limited support for modeling user-state dynamics or selecting actions based on how they shape future user states. We propose PUMA (Prospective User-state Modeling for Action selection), a framework grounded in the Free Energy Principle (FEP) that formulates personalization as decision-making under partial observability, centered on an explicit user state model that captures latent user states and their action-conditioned dynamics. At each turn, PUMA maintains a belief over the user's hidden state, refines the user state model for observation generation and action-conditioned state transition, and selects dialogue actions by minimizing expected free energy, balancing epistemic and pragmatic objectives under a unified criterion. This formulation shifts personalization from passive memory retrieval to model-based decision-making over user evolution. We instantiate PUMA on healthcare-oriented counseling and motivational interviewing benchmarks with latent state annotations for rigorous evaluation. Experiments show that PUMA improves long-horizon dialogue outcomes while maintaining strong response quality, and a cross-dataset study demonstrates more reliable user-state estimation and next-state prediction.

2605.24643 2026-05-26 cs.RO cs.SY eess.SY

Towards Low-Gravity Planetary Exploration using Reinforcement Learning for Walking, Jumping, and In-flight Attitude Control

面向低重力行星探测的强化学习行走、跳跃与飞行姿态控制

Jørgen Anker Olsen, Kostas Alexis

AI总结 本文利用强化学习为四足机器人在火星低重力环境下开发行走、垂直跳跃、前向跳跃及飞行姿态控制策略,实现跨越障碍物并安全着陆,仿真与实验验证了策略的有效性。

Comments 16 pages, 16 figures

详情
AI中文摘要

本文提出了用于行星探测场景中动态四足运动的强化学习策略。基于采用五杆腿设计的任务优化四足机器人,我们开发了针对行走、垂直跳跃、前向跳跃和飞行姿态控制的强化学习策略,这些策略明确针对火星上的低重力环境进行了调整。这些策略共同使机器人能够通过协调跳跃和精确的飞行中重新定向来克服比自身更大的障碍物,实现安全着陆。我们通过单轴重新定向测试在Olympus四足机器人上展示了姿态控制策略的Sim2Real迁移,而所有运动策略均在仿真中进行了验证。一个完整的火星探测任务场景展示了在复杂地形上协调策略部署的能力。实验结果显示,在2.6秒内完成90°姿态重新定向,仿真表明在火星重力条件下可实现3.1米的垂直跳跃和3.9米的前向跳跃。- 补充视频:https://www.youtube.com/watch?v=qlSJ3P87A4A

英文摘要

This paper presents reinforcement learning (RL) policies for dynamic quadrupedal locomotion in planetary exploration scenarios. Building on a taskoptimized quadruped with a 5-bar leg design, we develop RL policies for walking, vertical jumping, forward jumping, and in-flight attitude control, explicitly tailored to the reduced gravity on Mars. These policies jointly enable such robots to overcome obstacles larger than themselves through coordinated jumping and precise in-flight reorientation for safe landings. We demonstrate Sim2Real transfer of the attitude control policy on the Olympus quadruped through single-axis reorientation tests, while all locomotion policies are validated in simulation. A complete Mars exploration mission scenario demonstrates coordinated policy deployment across challenging terrain. Experimental results show 90° attitude reorientation in 2.6 seconds, with simulations demonstrating 3.1 meter vertical jumps and 3.9 meter forward jumps under Martian gravity conditions. - Supplementary video: https://www.youtube.com/watch?v=qlSJ3P87A4A

2605.24642 2026-05-26 cs.CV cs.RO

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

AI总结 本文通过线性探测分析量化了视觉-语言-动作模型(VLA)与几何基础模型(GFM)之间的“几何差距”,比较了三种注入几何信息的架构,并研究了非架构因素对几何VLA性能的影响。

详情
AI中文摘要

近期工作探索了视觉-语言-动作模型(VLA)与用于3D重建的几何基础模型(GFM)(如VGGT)交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能,但仍不清楚:(i) 现代VLA是否已经具备足够的几何理解能力,(ii) 将几何理解注入VLA的最佳架构是什么,以及(iii) 其他影响几何VLA的设计选择的效果。在本文中,我们针对特定的VLA(GR00T-N1.5)和GFM(VGGT)进行了严格的实验分析,以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析,形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构,它们在将几何信息注入VLA的方式上有所不同,同时尽可能保持低级实现细节相似,以确保公平比较。最后,我们分析了非架构选择(例如,训练数据、相机数量、重建质量)对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

2605.24639 2026-05-26 cs.CV cs.AI

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

AI总结 提出DisDop框架,通过从遥感基础模型(RemoteCLIP和DINOv3)中系统蒸馏多级领域先验知识到轻量级检测器,实现开放词汇航空目标检测的最新性能。

详情
AI中文摘要

近年来,随着无人机的广泛应用,航空图像的目标检测引起了越来越多的关注,尤其是不受预定义类别限制的开放词汇航空检测。由于无人机视角图像的稀缺性及其与自然图像的显著差异,直接应用为自然场景设计的普通开放词汇检测方法难以取得令人满意的结果。一些研究提出通过使用轻量级网络或生成伪标签来从预训练模型迁移知识,但它们往往依赖于在自然图像上训练的模型,忽略了专门为遥感和航空图像定制的基础模型的潜力。为了解决这一局限性,我们提出了DisDop,一个统一的框架,系统地将来自遥感基础模型(例如RemoteCLIP和DINOv3)的多级领域先验知识蒸馏到轻量级检测器中。具体来说,我们首先通过教师融合策略蒸馏视觉先验,该策略结合了RemoteCLIP的跨模态对齐能力和DINOv3的细粒度局部特征提取能力,将其互补优势迁移到检测器的骨干网络中。其次,我们通过显式建模类别间语义关系来蒸馏嵌入在RemoteCLIP文本编码器中的文本先验,同时结合全局上下文先验以增强小目标的局部特征表示。通过这种多级先验蒸馏框架,我们的DisDop在开放词汇航空检测基准上取得了新的最先进性能。大量的消融分析也证明了我们提出模块的合理性和有效性。

英文摘要

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone's viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP's cross-modal alignment capability with DINOv3's fine-grained local feature extraction ability, transferring their complementary strengths to the detector's backbone. Second, we distill textual priors embedded in RemoteCLIP's text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

2605.24635 2026-05-26 cs.CL

HiMed: Incentivizing Hindi Reasoning in Medical LLMs

HiMed: 激励医疗大语言模型中的印地语推理

Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li, Yunxiang Jiang, Xinlei Xiong, Juhao Liang, Hongru Xiao, Xiang Li, Fan Bu, Jiale Han, Ruchir Gupta, Prayag Tiwari, Benyou Wang

AI总结 针对医疗大语言模型在印地语上表现不佳的问题,提出HiMed印地语医疗推理语料库与基准,并通过衰减支架奖励训练HiMed-8B模型,显著提升印地语医疗推理性能并缩小英印准确率差距。

详情
AI中文摘要

医疗大语言模型有望减少医疗保健差距,但印地语仍然严重代表性不足。尽管医疗大语言模型在高资源语言中表现出色,但其性能在印地语中急剧下降,尤其是在印度医学体系方面。我们认为,稳健的跨语言医疗迁移需要印地语推理。为此,我们引入了HiMed,一个涵盖西方和印度医学的印地语推理医疗语料库和基准套件。我们进一步通过设计衰减支架奖励,提出了HiMed-8B,一个印地语医疗推理大语言模型。大量实验表明,印地语医疗推理性能得到提升,英印准确率差距缩小。消融研究验证了每个训练阶段和奖励组件的贡献。所有数据和代码均可在GitHub上获取:https://github.com/FreedomIntelligence/HiMed。

英文摘要

Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high-resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross-lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed-8B, a Hindi-form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English--Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: https://github.com/FreedomIntelligence/HiMed.

2605.24631 2026-05-26 cs.LG cs.AI cs.CV

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

超越生成先验:JEPA引导扩散的少数采样

Sol Park, Soobin Um

AI总结 提出一种基于世界模型JEPA引导的扩散采样框架,通过近似策略实现高效计算,在无条件、类别条件和文本到图像生成中提升少数样本的保真度和语义有效性。

Comments ICML 2026, 21 pages, 9 figures

详情
AI中文摘要

少数采样旨在数据流形上生成低密度实例,在医学诊断、异常检测和创意AI等应用中具有核心重要性。然而,现有方法相对于从训练数据中学习的生成先验来定义少数样本,将稀有性限制在可能无法很好反映现实世界语义的模型特定概念中。在这项工作中,我们提出了一种以世界为中心的少数采样视角,该视角相对于现实世界先验而非生成器诱导的密度来定义稀有性。为此,我们引入了JEPA引导,一种由联合嵌入预测架构(JEPA)引导的扩散采样框架——JEPA是一类编码广泛、语义丰富表示的世界模型。JEPA引导将扩散轨迹导向JEPA隐含密度下的低密度区域,从而使生成的少数样本与现实世界的语义稀有性对齐。为了使JEPA引导在计算上实用,我们开发了带有理论误差界限的原则性近似策略,显著降低了引导计算的开销。在无条件、类别条件和文本到图像生成上的大量实验表明,JEPA引导持续提高了少数样本的保真度和语义有效性,在捕捉现实世界的稀有性概念方面优于以生成器为中心的基线。代码可在https://github.com/soobin-um/jepa-guidance获取。

英文摘要

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.

2605.24630 2026-05-26 cs.CV

DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion

DexSIM: 具有统一因果视频扩散的实时灵巧仿真

Adam Lee

AI总结 提出DexSIM框架,通过两阶段训练(双向视频扩散和自回归滚动训练)实现实时、长时一致的灵巧操作仿真,在像素相似度、运动保真度和手部投影精度上超越基线。

Comments World Model @ ICLR 2026

详情
AI中文摘要

视频扩散模型的最新进展已实现对物理世界的大规模仿真,但手部物体交互的仿真研究较少。我们提出DexSIM,一个用于实时灵巧操作仿真的灵巧仿真框架。以往利用视频扩散和3D重建的工作侧重于导航,而灵巧操作虽在创建交互式仿真体验和为机器人生成合成数据方面有广泛应用,但进展有限。现有方法缺乏实时交互性、长期空间一致性和记忆。我们为DexSIM提出两阶段训练框架。首先,通过在手部动作轨迹和视频的统一特征空间中进行联合嵌入,训练一个双向视频扩散模型。我们利用高斯热图手部编码实现更准确的手部表示。然后,我们进行基于滚动的自回归训练,将更新的空间缓存作为注意力汇点用于空间记忆,从而提高了长期一致性和3D感知灵巧操作仿真。DexSIM在像素和语义相似度、运动保真度和手部投影精度上优于基线。它还支持手部运动迁移等新应用,并以15.24 FPS的帧率实现实时交互。

英文摘要

Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.

2605.24625 2026-05-26 cs.CV

ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric Neuroimaging

ULF-Synth:用于儿科神经影像的物理引导超低场MRI增强

Toufiq Musah, Salvatore Calcagno, Federica Proietto Salanitri, Xiaomeng Li, Maruf Adewole, Marawan Elbatel

AI总结 提出ULF-Synth框架,通过从高场MRI合成逼真的超低场图像并采用空间-频率域目标,实现无需真实配对数据的超低场MRI增强,提升结构相似性和诊断可接受性。

Comments 10 pages, 2 figures, 3 tables

详情
AI中文摘要

超低场(ULF)MRI提供了便携且可及的神经影像,但与高场(HF)系统相比,存在信噪比降低和空间分辨率有限的问题。获取配对的ULF-HF数据进行监督增强通常很困难,尤其是在资源有限的环境中。我们提出了ULF-Synth框架,它结合了:(i)基于采集的从HF体积合成逼真ULF图像的方法,以创建大规模配对训练数据;(ii)优先恢复高频解剖细节的空间-频率域目标。该公式与架构无关,在编码器-解码器、对抗性和基于扩散的翻译模型中一致地提高了结构相似性和感知保真度。当仅使用合成数据训练时,所得模型有效泛化到真实的64mT ULF采集,改善了下游多类脑分割,并在盲法读者研究中获得了更高的放射科医生偏好和诊断可接受性。这些发现表明,合成配对监督提供了一种实用且可扩展的途径来增强ULF MRI,而无需真实的配对采集。代码、模型和数据集:https://github.com/toufiqmusah/ULF-Synth

英文摘要

Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: https://github.com/toufiqmusah/ULF-Synth

2605.24624 2026-05-26 cs.CV

Vision-Language Binding in In-Context Image Generation

上下文图像生成中的视觉-语言绑定

Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham

AI总结 本文通过因果干预方法揭示FLUX.2模型中文本令牌与参考图像之间的隐式跨模态绑定机制,并定位绑定发生在文本序列的填充令牌上。

Comments 35 pages, 19 figures

详情
AI中文摘要

上下文图像生成模型(如FLUX.2)接收文本提示和可选的参考图像作为输出的视觉条件。在内部,所有三个输入——文本、参考图像和噪声令牌——被连接并通过单个注意力流处理,其中所有令牌可以相互关注。这留下了参考信息如何通过模型流动以产生输出图像的问题。我们展示了文本令牌与参考图像之间出现隐式跨模态绑定:在前向传播过程中,文本令牌吸收视觉参考内容,并且这些吸收的内容因果地影响生成的输出。我们通过三种因果干预方法揭示了FLUX.2中的这种绑定:T2I Lens,通过文本到图像路径解码中间文本令牌激活;Attention Knockout,切断特定的注意力边;以及I2I-to-I2I Patching,在编辑运行之间复制文本令牌激活。在包括SUN397和DreamBench++数据集以及在线收集的图像在内的2875个编辑任务中,我们观察到一致的分工:参考图像的属性(如颜色、风格和场景设置)首先被写入文本令牌,然后由文本令牌携带到生成的图像中;像素精确的属性(如特定面孔或实例身份)绕过文本令牌,通过图像到图像注意力直接从参考图像流向生成的图像。我们进一步将参考-文本绑定定位到文本序列的填充令牌。这些结果表明,多模态DiT中的文本令牌不仅仅是提示持有者,而是参考图像内容的结构化通道。更广泛地说,它们表明即使在统一注意力的多模态生成模型中,令牌模态也决定了条件信息如何在网络中表示和路由。

英文摘要

In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.

2605.24622 2026-05-26 cs.RO cs.CV

PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution

PoseRefer: 用于语义基础指代消解的通路-局部参数

Anna Deichler

AI总结 提出PoseRefer架构,通过解耦姿态和文本通路并冻结MiniLM类别嵌入,在MM-Conv数据集上实现31.9%的top-1准确率,并揭示融合准确性可能受类别表示伪影影响。

Comments ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

一个机器人解析“把杯子放在那个上面”必须融合手势、语言和场景几何,然而3D基础基准测试仅部分捕获了这一情况:描述是事后编写的,手势是模板化的,或者指向是为相机摆拍的。MM-Conv从二元VR交互中捕获自然的伴随语音手势,同时包含全身动作捕捉和3D场景图。我们使用它来评估姿态-语言融合,采用解耦的后期融合架构,其中姿态和文本通路不共享任何学习参数。这两个选择共同使得通过受控消融更容易隔离类别、姿态和文本的贡献。使用冻结的MiniLM类别嵌入的融合在每种指代类型上都超过了仅姿态和最佳文本通路,达到31.9%的top-1。学习到的标量门根据文本通路是否有类别访问权限而在相反策略之间切换。这是一个可靠性诊断:除非通路在架构上解耦,否则语义基础系统的融合准确性声明与类别表示伪影无法区分。

英文摘要

A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.

2605.24621 2026-05-26 cs.CV cs.AI cs.LG

Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

AI总结 提出一种相位感知散射编解码器,通过在跳跃连接中显式保留相位信息来恢复空间结构,在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

详情
AI中文摘要

散射变换实现了Lipschitz稳定性和平移不变性,但密集预测任务需要保留在全局平均中丢失的空间结构。我们提出了相位感知散射编解码器,通过在跳跃连接中显式保留相位来恢复这些信息。在图像去噪(BSD68)上,打破平移不变性使PSNR提高了+2.17 dB;相位保留额外增加了+1.03 dB。一种新颖的空间洗牌消融实验(惩罚-1.26 dB)表明相位编码了位置依赖的结构。我们在第二个密集预测任务(ISIC皮肤病变分割)上进行了初步的可扩展性研究,完整的交叉验证正在进行中。这项工作推进了原则性的小波-深度学习集成,展示了相位信息如何在像素级预测中补充散射的稳定性-表达性权衡。

英文摘要

Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structure lost in global averaging. We propose Phase-Aware Scattering Encoder-Decoder, which restores this information by explicitly preserving phase in skip connections. On image denoising (BSD68), breaking translation invariance improves PSNR by $+2.17$~dB; phase preservation adds $+1.03$~dB. A novel spatial shuffling ablation ($-1.26$~dB penalty) demonstrates phase encodes location-dependent structure. We conduct a preliminary extensibility study on a second dense prediction task (ISIC skin lesion segmentation), with full cross-validation as ongoing work. This work advances principled wavelet-deep learning integration, showing how phase information complements scattering's stability-expressiveness trade-off in pixel-level prediction.

2605.24614 2026-05-26 cs.CL cs.AI cs.LG

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

AI总结 提出遗忘深度评分(UDS),通过激活修补量化遗忘的机制深度,在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情
AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制,但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识,但通常依赖于辅助训练或数据集特定调整,缺乏可推广的指标。为解决这些限制,我们提出遗忘深度评分(UDS),一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层,然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中,UDS实现了最高的忠实性和鲁棒性,证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示,白盒指标可能在层级别上不一致,并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

2605.24613 2026-05-26 cs.CL cs.AI cs.SE

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Guarded Repair: 面向危害感知的LLM数学推理事后替换

Haizhou Xia

AI总结 提出GuardedRepair框架,通过选择性替换机制在修复LLM数学推理错误时避免破坏正确结果,在GSM8K上准确率从95.60%提升至96.89%且未破坏正确案例。

Comments 15 pages,including appendices. Code and artifacts available at https://github.com/Haizhoux0517/guarded-repair

详情
AI中文摘要

LLM数学推理的事后修复引入了一种不对称风险:修复错误的推理轨迹是有用的,但替换原本正确的轨迹可能有害。我们在选择性替换设置下研究该问题,系统必须决定修复后的候选是否比保留原始缓存轨迹更安全。我们提出GuardedRepair,一种有保护的best-of-N修复框架,它诊断缓存推理轨迹,选择性触发修复,并仅在确定性验证守卫支持替换时才接受改变答案的候选。该框架结合了轻量级符号检查、表面语义风险诊断、有界候选生成和保守接受策略。在完整GSM8K测试集上,初始推理器已达到95.60%准确率,GuardedRepair将最终准确率提升至96.89%,修复了58个剩余错误中的17个,且主运行中未测量到破坏正确案例。在弱推理器ASDiv设置中,准确率从78.40%提升至87.60%。直接重新生成基线表明,这一增益不能仅由更强模型重新求解解释:重新求解所有GSM8K示例将准确率降至93.03%,并破坏了47个初始正确答案。额外分析表明,有保护修复显著改善了修复/破坏权衡,同时也揭示了替换风险被降低而非消除。这些结果支持将事后修复视为危害感知的选择性替换而非无约束的重新求解。

英文摘要

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

2605.24611 2026-05-26 cs.LG

Beyond Fixed Points: Superpolynomial Capacity of Asymmetric Hopfield Networks

超越不动点:非对称Hopfield网络的超多项式容量

Aakash Kumar, Anatoly Khina, Frederik Mallmann-Trenn, Emanuele Natale

AI总结 通过结合组合数学、数论和观点动力学分析,在经典同步非对称Hopfield网络中实现了指数级数量的极限环吸引子,每个吸引子具有指数级周期且对噪声鲁棒,首次证明了非对称Hopfield网络的超多项式容量。

详情
AI中文摘要

经典Hopfield网络由于对称权重而局限于静态模式,而非对称网络可以通过极限环吸引子编码时间序列。然而,在经典的同步非对称网络中实现长序列的高容量存储仍然是一个挑战。我们在具有二进制神经元和同步更新的经典非对称Hopfield模型中提出了一种简单且鲁棒的构造,使得$n$个神经元能够支持$\exp\!ig(Ω(n/(\log n)^2)ig)$个不同的极限环吸引子,每个吸引子的周期为$\exp\!ig(Ω(\sqrt n/\log n)ig)$,并且对翻转概率高达$ rac12-o(1)$的随机噪声具有鲁棒性,从而在存储序列的数量和长度上实现了超多项式容量。这是首次展示非对称Hopfield网络的这种容量,我们通过结合组合数学、数论和观点动力学的分析得到了这一结果。我们的发现表明,同步非对称Hopfield网络具有比先前认识到的更大且更鲁棒的序列记忆容量,证明在生物和人工神经系统中,鲁棒的序列表示可以通过粗糙的结构模式而非复杂的非线性来实现。

英文摘要

Classical Hopfield networks are limited to static patterns due to symmetric weights, whereas asymmetric networks can encode temporal sequences via limit-cycle attractors. Achieving high-capacity storage of long sequences in classical synchronous asymmetric networks, however, has remained a challenge. We present a simple and robust construction within the classical asymmetric Hopfield model with binary neurons and synchronous updates, that allows $n$ neurons to support $\exp\!\big(Ω(n/(\log n)^2)\big)$ distinct limit-cycle attractors, each with period $\exp\!\big(Ω(\sqrt n/\log n)\big)$ and robust to random noise with flip probability up to $\frac12-o(1)$, yielding superpolynomial capacity in both the number and length of stored sequences. This is the first demonstration of such capacity for asymmetric Hopfield networks, which we obtain by combining results from combinatorics, number theory and the analysis of opinion dynamics. Our findings show that synchronous asymmetric Hopfield networks possess a sequence-memory capacity which is larger and more robust than previously recognized, demonstrating that, in both biological and artificial neural systems, robust sequence representation can be achieved through coarse architectural motifs rather than complex nonlinearities.

2605.24608 2026-05-26 cs.AI cs.CV cs.LG

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

AI总结 本文基于格论和数学形态学,为深度卷积架构(CNN、ResNet、UNet)建立了严格的代数框架,揭示了标准CNN流水线是交叉格算子,并识别出三种真正的幂等开运算层设计。

详情
AI中文摘要

我们为深度卷积架构(包括CNN、ResNet和如UNet的编码器-解码器网络)建立了一个严格的代数框架,该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论,我们将其系统地应用于标准深度网络的每一层。主要发现是:标准CNN流水线(线性卷积 + ReLU + 平坦最大池化)是一个交叉格算子:卷积是傅里叶下半格中的腐蚀,ReLU是格并闭包,最大池化是逐点最大加格中的膨胀,它们的组合既不是形态学开运算也不是闭运算。第二个发现是:ReLU在逐点格中的上伴随是一个全局(非局部)算子,在全局非负函数上为恒等映射,否则为负无穷,因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因:组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计:纯最大加形态学层(逐点格)、谱维纳层(傅里叶格)和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下,并给出了激活-池化膨胀(APD)分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

2605.24604 2026-05-26 cs.CV

LC-Flow: Learning Local Continuous Optical Flow and Confidence from events

LC-Flow: 从事件中学习局部连续光流与置信度

Gunwoo Jeon, Chaesong Park, Jongwoo Lim

AI总结 提出LC-Flow,首个基于学习的、从局部事件中估计时间连续光流的方法,通过连续局部循环网络和联合学习的置信度,解决事件稀疏性和孔径问题,在MVSEC和DSEC上达到局部方法最优,且置信度引导的聚合在MVSEC上超越基于帧的方法。

详情
AI中文摘要

事件相机以微秒分辨率异步捕捉亮度变化,但现有光流方法未能充分利用这种时间连续性。基于帧的方法引入人工累积延迟并遭受领域过拟合,而基于模型的局部方法无状态运行,丢弃预测间的时间历史,产生不准确的光流。 我们提出 extbf{LC-Flow},首个时间连续的、基于学习的光流估计器,完全从局部事件操作。其核心是一个连续局部循环网络,为每个空间网格维护持久隐藏状态,随着事件到达逐步累积时间上下文。与受限于固定累积窗口的基于帧的方法不同,也与每一步从头重新计算运动的无状态基于模型的方法不同,LC-Flow在任意时间戳上生成具有完整运动历史的稀疏局部光流估计。 为了解决局部观测固有的歧义性,我们联合学习一个置信度分数,量化每个预测的可靠性,明确处理事件稀疏性和孔径问题。该置信度具有双重作用:为下游任务(如视觉里程计)过滤不可靠估计,并为多尺度置信度引导的聚合提供有原则的权重,从稀疏局部输出重建全局一致的光流。LC-Flow在MVSEC和DSEC上均达到局部方法的最优性能,而置信度引导的聚合在MVSEC基准上建立了新的总体最优,超越了依赖全局空间先验的重型基于帧的网络。

英文摘要

Event cameras capture brightness changes asynchronously with microsecond resolution, yet existing optical flow methods fail to fully exploit this temporal continuity. Frame-based approaches impose artificial accumulation latency and suffer from domain overfitting, while model-based local methods operate statelessly, discarding temporal history between predictions and yielding inaccurate flows. We propose \textbf{LC-Flow}, the first temporally continuous, learning-based optical flow estimator that operates purely from local events. At its core, a Continuous Local Recurrent Network maintains persistent hidden states per spatial grid, incrementally accumulating temporal context as events arrive. Unlike frame-based methods constrained to fixed accumulation windows, and unlike stateless model-based methods that recompute motion from scratch at each step, LC-Flow produces sparse local flow estimates at arbitrary timestamps with full motion history. To address the inherent ambiguity of local observations, we jointly learn a confidence score that quantifies the reliability of each prediction, explicitly handling event sparsity and the aperture problem. This confidence serves a dual role: filtering unreliable estimates for downstream tasks such as visual odometry, and providing principled weights for a multi-scale confidence-guided aggregation that reconstructs globally consistent flow from the sparse local outputs. LC-Flow achieves state-of-the-art performance among local methods on both MVSEC and DSEC, while the confidence-guided aggregation establishes a new overall state-of-the-art on the MVSEC benchmark, surpassing heavy frame-based networks that rely on global spatial priors.

2605.24603 2026-05-26 cs.CL cs.LG

CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

CSP-Atlas: 稀疏Python Transformer中的概念特异性神经回路

Piotr Wilam

AI总结 通过提取106个Python概念的特异性神经回路,发现模型内部组织遵循计算结构而非语义类别,并识别出原子性超簇。

Comments Code: https://github.com/piotrwilam/AtlasCSP

详情
AI中文摘要

一个稀疏的8层代码Transformer为每个测试的Python构造开发了专用的神经回路,并且这些回路按照清晰的计算原则而非语义类别进行组织。我们通过边缘化63,800个受控提示,提取了106个概念(43个AST节点类型,63个内置对象)的神经回路,并使用对比检查提示(呈现一个关键字标记而不带其关联的句法结构)将每个回路分解为概念特异性和标记驱动组件。出现了三个发现。首先,所有106个概念在九个参数设置中的每一个都产生非空的通用回路,并且跨构造的概念特异性排名在扫描中保持稳定——存活不是宽松阈值的伪影。其次,AST回路包含一个与标记激活不同的真正概念组件:在中间到后期层,仅概念神经元占最强烈激活神经元的比例高达62.5%,而内置回路几乎完全由标记驱动。第三,六个计算上原子的构造——Import、ImportFrom、Break、Continue、Pass、Assert——尽管在语义上不相关,却聚集在一起,仅共享作为不需要嵌套体的单语句构造的属性;这个原子性超簇,以及由标记歧义性和结构独特性组织的四层层次结构,表明模型的内部组织追踪计算结构而非含义。方法、完整分解数据和分析代码已发布。

英文摘要

A sparse 8-layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept-specific and token-driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non-empty universal circuits at every one of nine parameter settings, and the ranking of concept-specificity across constructs is stable across the sweep - survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept-only neurons constitute up to 62.5% of the loudest-firing neurons at mid-to-late layers, while builtin circuits are almost entirely token-driven. Third, six computationally atomic constructs - Import, ImportFrom, Break, Continue, Pass, Assert - cluster together despite being semantically unrelated, sharing only the property of being single-statement constructs requiring no nested body; this atomicity super-cluster, together with a four-tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model's internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.

2605.24600 2026-05-26 cs.AI

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Agent-as-Peer-Debriefer: 一种基于视角精炼的多智能体定性分析框架

Zhimin Lin, Kun Cheng, Fan Bai, Jie Gao

AI总结 提出一种多智能体框架,通过模拟同行汇报(peer debriefing)并引入理论驱动、数据驱动和应用三种分析视角,提升大语言模型在定性数据分析中的编码质量。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于定性数据分析(QDA),但其输出往往缺乏人类分析的深度和细微差别。我们认为这一差距反映了人类QDA中缺失的一种可信度实践:同行汇报(peer debriefing),即分析师向无偏见的同行寻求反馈并据此完善其编码。为了将这一实践引入LLM辅助的QDA,我们提出了Agent-as-Peer-Debriefer,一个将同行汇报构建到关键编码步骤中的多智能体QDA框架。在我们的框架中,层次编码代理遵循标准QDA流程生成代码、子主题和主题,以及自我解释和反思备忘录。然后,它将输出共享给三个同行汇报代理,每个代理应用不同的分析视角(理论驱动、数据驱动或应用),并通过保留、重命名、重新分配、合并或拆分代码来完善代码。这些视角来源于跨领域和数据集通用的已建立的人类QDA实践。为了评估该框架,我们在三个LLM上对两个领域的三个数据集进行了测试,测量与人工标注代码的语义相似度。在所有设置中,基于视角的同行汇报精炼比单一LLM基线更接近人类代码,消融实验进一步表明,这种提升不仅仅来自额外的精炼。三种视角也产生了不同的权衡,表明视角的选择是一个有意义且可控的设计决策。更广泛地说,这些发现表明,用明确的视角模拟同行汇报是实现更可信的LLM辅助QDA的一条有前景的途径。

英文摘要

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

2605.24598 2026-05-26 cs.AI cs.MA

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Hera: 面向设备-云协作LLM智能体的长时程协调学习

Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu, Zhiwen Chen, Chengfei Lv, Zhe Chen

AI总结 提出Hera,一种步骤级设备-云LLM智能体协调器,通过两阶段训练(模仿学习+强化学习)优化长时程任务的性能-成本帕累托前沿。

详情
AI中文摘要

大型语言模型(LLM)智能体通过自主与环境交互,擅长解决复杂的长期任务。然而,它们的实际部署面临根本性的设备-云困境:设备端模型高效但脆弱,而云端模型强大但计算成本高。最先进的LLM设备-云路由器通常做出粗粒度的任务级决策,无法适应多步智能体交互中变化的难度。为解决此问题,我们提出Hera,一种用于长期任务的步骤级设备-云LLM智能体协调器,实现了强大的性能-成本帕累托前沿。Hera采用新颖的两阶段训练范式:(1)冷启动的模仿学习,随后(2)联合优化任务成功率和云端使用效率的强化学习。第一阶段将步骤级路由视为监督分类问题:设备智能体在云端轨迹上重放,每个状态根据设备与云端动作的一致性进行标记。第二阶段,我们通过跨轨迹分组相同状态并使用偏好更高期望回报和更少未来云端调用的标签更新Hera,进行成本感知的强化学习。我们在ALFWorld、WebShop和AppWorld上评估Hera,它始终优于先前方法,在仅46.3%的步骤中使用云端的情况下,达到了云端单独成功率的92.5%。

英文摘要

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

2605.24597 2026-05-26 cs.AI cs.CL cs.LG

Learning to Reason Efficiently with A* Post-Training

学习通过A*后训练进行高效推理

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

AI总结 本文通过A*搜索算法指导LLM生成正确且高效的推理步骤,提出监督微调和强化学习两种训练方法,在1B-3B参数模型上显著提升推理准确性和效率。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的许多应用需要演绎推理,但模型经常产生不正确或冗余的推理步骤。我们将自然语言推理框架化为一个搜索问题,其中最终答案本身就是有效的证明,需要推理过程中间推理正确。具体来说,我们研究LLM是否能够通过A*搜索(一种保证通向目标的最优高效路径的算法)的指导,学习生成正确且高效的证明。我们探索了两种训练技术:在A*执行轨迹上的监督微调,以及使用A*信息的过程奖励模型进行强化学习。实验发现,1B-3B范围内的Llama-3.2模型从A*后训练中获益显著,从接近零准确率提升到超越更大的模型DeepSeek-V3.2。我们的分析揭示了一个权衡:简单的正确性奖励最大化准确率,而A*信息的信号在准确率和效率之间取得平衡。此外,我们发现,在更大的搜索空间中,使用不完美启发式训练的模型表现出更高的准确率。我们的结果展示了朝着由经典搜索算法原理指导的推理方向的有前景的路径。

英文摘要

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.