arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.24680 2026-05-26 cs.LG

Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data

基于轨迹的难度评分用于表格数据的可靠学习

Tomer Lavi, Bracha Shapira, Nadav Rappoport

AI总结提出轨迹难度评分（TDS），通过分析梯度提升树的逐树累积预测轨迹，为每个实例估计难度，并在分类和回归任务中优于现有基线，同时支持主动学习、选择性预测和共形预测等应用。

详情

AI中文摘要

梯度提升树在表格数据上表现出色，但常常留下一个长尾的预测不佳实例。我们引入了一种基于轨迹的难度评分（TDS），这是一种针对提升集成模型的实例级难度估计器，源自每棵树的累积预测轨迹。对于每个实例，我们计算可解释的轨迹描述符（例如，方差、振荡峰值、符号切换和尾部稳定性），并训练一个轻量级回归模型来预测保留损失。经验CDF将得到的信号校准为$[0,1]$内的分数，支持对困难案例进行排序。在多种表格基准和集成大小上，TDS与误差表现出强秩相关性，并且在分类任务上优于现有的实例难度和不确定性基线，同时在回归任务上保持竞争力。然后，我们展示了单个难度信号如何改进多个数据挖掘工作流：用于标签高效训练的难度驱动主动学习、用于改进风险覆盖权衡的难度阈值选择性预测，以及用于更均匀条件覆盖的TDS分层（Mondrian）共形预测。最后，使用SHAP归因对高TDS实例进行聚类，揭示了以紧凑特征值范围为特征的连贯故障模式，支持错误分析和针对性数据采集。

英文摘要

Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.

URL PDF HTML ☆

赞 0 踩 0

2605.24679 2026-05-26 cs.CV

MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models

MindAdapter: 跨被试脑到视觉解码模型的少样本参数高效残差校准

Jiaxiang Liu, Jiawei Du, Xupeng Chen, Guoqi Li, Jiang Cai, Simon Fong, Mingkun Xu

AI总结提出MindAdapter框架，通过解耦的线性-残差级联对齐和拓扑锚定双流流形约束，实现跨被试脑到视觉解码的少样本参数高效校准。

Comments Accepted to KDD 2026 (AI4Sciences Track). 15 pages, 7 figures

详情

AI中文摘要

跨被试脑到视觉解码由于严重的个体间变异性导致系统性的被试特异性功能错位，仍然是脑机接口的核心挑战。为了解决这个问题，我们提出了MindAdapter，一个针对预训练脑到视觉解码模型的参数高效少样本校准框架。MindAdapter采用解耦的线性-残差级联对齐范式，冻结预训练的显式脑功能对齐主干（粗粒度），并引入轻量级非线性残差适配器（细粒度），从而将全局跨被试对应关系与被试特异性残差校正分离，实现细粒度的空间和语义校准。为了进一步保持全局表征稳定性，我们设计了一个拓扑锚定的双流流形约束，其中一小部分共享刺激作为拓扑锚点，提供体素级配对监督，而语义流通过冻结的视觉-语言解码器在未配对的脑数据上强制执行一致性。总之，MindAdapter在保持预训练期间学习的全局表征几何结构的同时，高效地注入被试特异性校正。在自然场景数据集（NSD）上的实验表明，MindAdapter仅使用少量共享刺激就能显著提高跨被试视觉重建和检索精度，为个性化脑到视觉解码提供了一种实用且数据高效的解决方案。

英文摘要

Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.24675 2026-05-26 cs.CV cs.AI

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

AI总结针对网页图像翻译中视觉表示差距问题，提出VaaWIT框架，通过双流注意力模块和视觉感知适配器，实现大语言模型对细粒度视觉特征的动态融合，在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817631

AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要，尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型（LVLMs）已经推进了多模态理解，但由于视觉表示差距，将它们应用于网页图像翻译仍然具有挑战性：标准编码器通常优先考虑高级语义，而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战，我们提出了VaaWIT，一个端到端框架，用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献：（1）双流注意力模块（DSAM），促进多语言语义特征与详细视觉表示之间的双向交互，从而合成对文本变化鲁棒的统一特征；（2）视觉感知适配器（VAA），一种参数高效的微调策略，将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐，同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明，VaaWIT显著优于最先进（SOTA）的开源基线，并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.24674 2026-05-26 cs.CV

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

推理对齐：扩散Transformer在视频编辑中的隐式推理

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

AI总结针对指令式视频编辑中条件信号未分化及交叉注意力监督不足的问题，提出RVEDiT框架，通过粒度路由令牌条件和参考锚定注意力对齐实现粗到细编辑与内部推理正则化。

详情

AI中文摘要

基于指令的视频编辑需要根据自然语言指令转换源视频，同时保留无关内容并保持时间连贯性。我们认为现有的扩散Transformer（DiT）编辑器由于两个结构原因难以完成此任务。首先，条件信号未分化地输入所有Transformer块，迫使单个令牌流同时编码全局编辑意图和细粒度视觉证据。其次，控制编辑的交叉注意力模式仅通过像素级重建间接监督，使得模型内部推理过程约束不足。为了解决这两个限制，我们提出了RVEDiT，一个隐式推理视频编辑DiT框架，围绕两个互补组件构建。第一个组件，粒度路由令牌条件，引入从多模态大语言模型蒸馏的可学习编辑令牌，并将其路由到浅层块，同时将原生视觉和文本令牌保留给深层块，从而在骨干网络内部诱导出从粗到细的编辑过程。第二个组件，参考锚定注意力对齐，在训练期间采用参数共享的参考分支，并最大化编辑分支和参考分支注意力特征之间的互信息，正则化模型的内部推理而不产生任何额外的推理成本。在标准基于指令的视频编辑基准上的实验表明，RVEDiT始终优于最先进的基线，特别是在局部和组合编辑方面取得了显著提升。

英文摘要

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

URL PDF HTML ☆

赞 0 踩 0

2605.24667 2026-05-26 cs.AI cs.LG

When Mean CE Fails: Median CE Can Better Track Language Model Quality

当平均交叉熵失效时：中位数交叉熵能更好地跟踪语言模型质量

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

AI总结本文发现中位数交叉熵比平均交叉熵更能反映语言模型在训练过程中的任务性能，并建议在评估时报告多个百分位交叉熵。

Comments 20 pages

详情

AI中文摘要

平均交叉熵是语言模型的标准验证指标，但在训练过程中可能无法跟踪模型质量。我们在两种常见场景下研究了这一点。首先，在Qwen2.5-1.5B的合成事实学习SFT中，我们发现平均CE在初始学习阶段后显著上升，而保留的事实召回准确率保持接近峰值。其次，在TinyStories上的top-K蒸馏中，我们发现减小K会改善中位数CE而恶化平均CE；Top-5学生获得了最高的LLM评判分数，并在中位数CE上低于其教师，尽管其平均CE最差。在这两种情况下，中位数CE与任务性能的相关性比平均CE更紧密。分析训练过程中整体和尾部百分位CE的变化表明，训练重塑了经验性的每token CE分布。在top-K蒸馏中，较小的K产生了一个在两端都有更多质量的分布，降低了中位数并增加了平均值。在Qwen SFT中，整体部分迅速饱和，而尾部在训练后半段延伸。在这两种情况下，任务评估指标似乎对整体部分比尾部更敏感。实际上，我们建议在报告平均CE的同时报告一小部分百分位CE摘要，并利用它们之间的一致性作为跟踪分布重塑的工具，以及当平均和中位数CE在模型选择上不一致时的低成本诊断。

英文摘要

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

URL PDF HTML ☆

赞 0 踩 0

2605.24661 2026-05-26 cs.AI cs.CL

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量：一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

AI总结提出一个基于行为的多维框架，从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量，揭示仅靠准确率无法观察到的行为，并支持部署决策。

详情

AI中文摘要

LLMs在复杂推理任务中取得了显著成功，但当前的评估方法主要依赖最终答案的正确性，对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白，本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量，操作化了六个理论驱动的维度：正确性（CQ）、一致性（CS）、鲁棒性（RS）、逻辑连贯性（LS）、效率（ES）和稳定性（SS）。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明，该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是，逻辑连贯性与正确性正交（r = -0.172，不显著），证实了正确答案可能源于不连贯的推理，而Claude-Haiku-4.5取得了最高的多维得分（Q_bal = 0.778）。此外，该框架暴露了关键的排名反转：DeepSeek-V3在准确率优先下排名第二，但在法律/合规权重下排名第五，这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的（|r| < 0.50），为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策：识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型（LS--CQ正交性）；防止仅基于准确率的基准测试导致的排名错误；以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

URL PDF HTML ☆

赞 0 踩 0

2605.24659 2026-05-26 cs.LG

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

IterInject: 通过反馈引导的迭代优化实现对LLM智能体的间接提示注入

Zixuan Chen, Jiaxiang Chen, Li Luo, Ke Xu, Xiaoxiang Huang, Tanfeng Sun, Xinghao Jiang

AI总结提出IterInject框架，通过规则诊断器和LLM优化器迭代优化对抗载荷，实现对LLM智能体的间接提示注入攻击，在多个基准和实际系统中显著优于现有方法，并揭示了注意力介导的阈值机制。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

基于LLM的智能体越来越多地被部署用于需要规划、工具使用和与外部服务交互的复杂任务。它们对外部不可信内容的依赖使其容易受到间接提示注入（IPI）攻击，其中嵌入在检索数据中的对抗指令劫持智能体行为。现有攻击依赖于无法适应智能体特定防御的静态载荷；即使是最近的适应性方法也缺乏结构化反馈来指导优化。我们提出\oursys，一个反馈引导的迭代框架，闭合了注入、诊断和精炼之间的循环：基于规则的诊断器产生带有行为描述的结构化结果标签，基于LLM的优化器根据完整的优化历史精炼载荷。一个合成步骤从失败模式中生成新的伪装种子，使策略空间能够自我进化。在AgentDojo和InjectAgent上，\oursys在四个受害模型上显著优于静态基线和现有的适应性方法。在Claude Code（一个具有分层防御的生产级编码智能体）上的扩展实验表明，优化后的载荷在9个目标中的5个上取得了完全成功；即使那些抵抗完全利用的目标也显示出通过迭代精炼可衡量的改进。我们进一步对IPI进行了机制分析，识别出中后层中注意力介导的阈值机制；三个因果干预验证了这一发现，并指出了具体的防御方向。

英文摘要

LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

URL PDF HTML ☆

赞 0 踩 0

2605.24658 2026-05-26 cs.LG

WLNO: Wavelet-Laplace Neural Operator for Solving Partial Differential Equations

WLNO: 用于求解偏微分方程的小波-拉普拉斯神经算子

Muhammad Abid, Arth Sojitra, Omer San

AI总结提出WLNO，通过融合Haar小波多尺度空间分解与拉普拉斯神经算子的极点-留数公式，在五个基准PDE问题上优于LNO，尤其擅长处理具有强空间多尺度结构的问题。

详情

AI中文摘要

本文介绍了小波-拉普拉斯神经算子（WLNO），一种新颖的神经算子，它将Haar小波多尺度空间分解与拉普拉斯神经算子（LNO）的拉普拉斯域极点-留数公式融合在一起。虽然LNO通过可学习的系统极点和留数捕捉瞬态和稳态动力学，但它缺乏提取复杂PDE解中固有的空间局部多尺度特征的显式机制。WLNO通过用并行单级Haar离散小波变换（DWT）分支增强LNO核心来解决这一问题，该分支将提升的特征图分解为四个频率子带：近似（LL）、水平细节（LH）、垂直细节（HL）和对角细节（HH），并在通过逆DWT重建之前对每个子带应用独立学习的$1\times1$卷积。两个分支通过一个可学习的sigmoid门控权重$\alpha_\mathrm{wav}$融合，该权重初始化为给小波分支一个小的初始贡献，允许模型在整个训练过程中自适应地平衡拉普拉斯域动力学与空间多尺度特征。WLNO与LNO在五个基准PDE问题上使用相同的超参数、训练数据和评估协议进行评估：扩散方程、Burgers方程、反应扩散系统、达西流和二维Navier-Stokes方程。WLNO在所有五个问题上始终优于LNO，在具有强空间多尺度结构的问题上改进最为显著，例如具有尖锐激波前沿的Burgers方程和具有相干涡旋结构的Navier-Stokes方程，而在更平滑和椭圆问题上表现一致。这些结果表明，基于小波的多尺度空间分解是拉普拉斯域算子学习的一种有原则且有效的补充。

英文摘要

This work introduces the Wavelet-Laplace Neural Operator (WLNO), a novel neural operator that fuses Haar wavelet multi-scale spatial decomposition with the Laplace-domain pole-residue formulation of the Laplace Neural Operator (LNO). While LNO captures transient and steady-state dynamics through learnable system poles and residues, it lacks an explicit mechanism for extracting spatially localized multi-scale features inherent in complex PDE solutions. WLNO addresses this by augmenting the LNO core with a parallel single-level Haar discrete wavelet transform (DWT) branch that decomposes the lifted feature map into four frequency subbands: approximation (LL), horizontal detail (LH), vertical detail (HL), and diagonal detail (HH) and applies independent learned $1\times1$ convolutions to each subband before reconstruction via the inverse DWT. The two branches are fused through a learnable sigmoid-gated weight $α_\mathrm{wav}$, initialized to give a small initial contribution to the wavelet branch, allowing the model to adaptively balance Laplace-domain dynamics against spatial multi-scale features throughout training. WLNO is evaluated against LNO on five benchmark PDE problems using identical hyperparameters, training data, and evaluation protocols: the diffusion equation, the Burgers equation, the reaction-diffusion system, Darcy flow, and the two-dimensional Navier-Stokes equation. WLNO consistently outperforms LNO on all five problems, with the most pronounced improvement on problems with strong spatial multi-scale structure, such as the Burgers equation with sharp shock fronts and the Navier-Stokes equation with coherent vortical structures, while remaining consistent across smoother and elliptic problems. These results demonstrate that wavelet-based multi-scale spatial decomposition is a principled and effective complement to Laplace-domain operator learning.

URL PDF HTML ☆

赞 0 踩 0

2605.24657 2026-05-26 cs.AI cs.SE

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

超越纯推理部署：比较基于权重的巩固与级联压缩

Simon Dennis, Kevin Shabahang, Hao Guo, Rivaan Patil

AI总结针对大型语言模型纯推理部署中用户知识无法持久化的问题，提出通过夜间反射、合成和LoRA微调将交互知识巩固到模型权重中，实验表明该方法相比级联压缩知识保留率提升43.6个百分点。

Comments 15 pages

详情

AI中文摘要

主流LLM平台以纯推理配置部署模型：模型服务请求但从不更新每个用户的权重。用户必须反复重新教授偏好、修正和项目上下文，基于上下文的变通方法消耗上下文窗口空间，并在级联压缩下退化。我们评估了一种替代方案：通过反射、合成和低秩适应（LoRA）微调，在单个消费级GPU上将交互知识夜间巩固到模型权重中。在十次真实的软件开发对话中（n=10，三种记忆类型共1146个测试问题），三轮级联压缩保留了36.8±3.0%的知识（介于11.8%的无上下文下限和90.1%的全上下文上限之间），而巩固保留了80.4±1.3%——提升了43.6个百分点（配对t(9)=14.8，p<0.001），是压缩保留量的两倍多，其中程序性修正（36.3%->74.6%）和情景项目事实（31.5%->78.2%）的增益最大。作为方法论上的附带说明，平均每token验证交叉熵与LLM判断的准确性呈负相关（r=-0.51），而中位数每token验证交叉熵几乎完全跟踪准确性（r=+0.99）：在容忍表面形式变化的评估器下，平均值具有误导性，而重尾鲁棒统计量才是可靠的信号。持久个性化需要超越纯推理部署，转向将知识巩固到权重的架构。

英文摘要

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p < 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% -> 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

URL PDF HTML ☆

赞 0 踩 0

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

AI总结本文通过线性探测分析量化了视觉-语言-动作模型（VLA）与几何基础模型（GFM）之间的“几何差距”，比较了三种注入几何信息的架构，并研究了非架构因素对几何VLA性能的影响。

详情

AI中文摘要

近期工作探索了视觉-语言-动作模型（VLA）与用于3D重建的几何基础模型（GFM）（如VGGT）交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能，但仍不清楚：(i) 现代VLA是否已经具备足够的几何理解能力，(ii) 将几何理解注入VLA的最佳架构是什么，以及(iii) 其他影响几何VLA的设计选择的效果。在本文中，我们针对特定的VLA（GR00T-N1.5）和GFM（VGGT）进行了严格的实验分析，以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析，形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构，它们在将几何信息注入VLA的方式上有所不同，同时尽可能保持低级实现细节相似，以确保公平比较。最后，我们分析了非架构选择（例如，训练数据、相机数量、重建质量）对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

URL PDF HTML ☆

赞 0 踩 0

2605.24639 2026-05-26 cs.CV cs.AI

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

AI总结提出DisDop框架，通过从遥感基础模型（RemoteCLIP和DINOv3）中系统蒸馏多级领域先验知识到轻量级检测器，实现开放词汇航空目标检测的最新性能。

详情

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

AI总结提出一种相位感知散射编解码器，通过在跳跃连接中显式保留相位信息来恢复空间结构，在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

2605.24614 2026-05-26 cs.CL cs.AI cs.LG

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

AI总结提出遗忘深度评分（UDS），通过激活修补量化遗忘的机制深度，在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情

AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制，但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识，但通常依赖于辅助训练或数据集特定调整，缺乏可推广的指标。为解决这些限制，我们提出遗忘深度评分（UDS），一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层，然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中，UDS实现了最高的忠实性和鲁棒性，证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示，白盒指标可能在层级别上不一致，并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

URL PDF HTML ☆

赞 0 踩 0

2605.24613 2026-05-26 cs.CL cs.AI cs.SE

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Guarded Repair: 面向危害感知的LLM数学推理事后替换

Haizhou Xia

AI总结提出GuardedRepair框架，通过选择性替换机制在修复LLM数学推理错误时避免破坏正确结果，在GSM8K上准确率从95.60%提升至96.89%且未破坏正确案例。

Comments 15 pages,including appendices. Code and artifacts available at https://github.com/Haizhoux0517/guarded-repair

详情

AI中文摘要

LLM数学推理的事后修复引入了一种不对称风险：修复错误的推理轨迹是有用的，但替换原本正确的轨迹可能有害。我们在选择性替换设置下研究该问题，系统必须决定修复后的候选是否比保留原始缓存轨迹更安全。我们提出GuardedRepair，一种有保护的best-of-N修复框架，它诊断缓存推理轨迹，选择性触发修复，并仅在确定性验证守卫支持替换时才接受改变答案的候选。该框架结合了轻量级符号检查、表面语义风险诊断、有界候选生成和保守接受策略。在完整GSM8K测试集上，初始推理器已达到95.60%准确率，GuardedRepair将最终准确率提升至96.89%，修复了58个剩余错误中的17个，且主运行中未测量到破坏正确案例。在弱推理器ASDiv设置中，准确率从78.40%提升至87.60%。直接重新生成基线表明，这一增益不能仅由更强模型重新求解解释：重新求解所有GSM8K示例将准确率降至93.03%，并破坏了47个初始正确答案。额外分析表明，有保护修复显著改善了修复/破坏权衡，同时也揭示了替换风险被降低而非消除。这些结果支持将事后修复视为危害感知的选择性替换而非无约束的重新求解。

英文摘要

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

URL PDF HTML ☆

赞 0 踩 0

2605.24611 2026-05-26 cs.LG

Beyond Fixed Points: Superpolynomial Capacity of Asymmetric Hopfield Networks

超越不动点：非对称Hopfield网络的超多项式容量

Aakash Kumar, Anatoly Khina, Frederik Mallmann-Trenn, Emanuele Natale

AI总结通过结合组合数学、数论和观点动力学分析，在经典同步非对称Hopfield网络中实现了指数级数量的极限环吸引子，每个吸引子具有指数级周期且对噪声鲁棒，首次证明了非对称Hopfield网络的超多项式容量。

详情

AI中文摘要

经典Hopfield网络由于对称权重而局限于静态模式，而非对称网络可以通过极限环吸引子编码时间序列。然而，在经典的同步非对称网络中实现长序列的高容量存储仍然是一个挑战。我们在具有二进制神经元和同步更新的经典非对称Hopfield模型中提出了一种简单且鲁棒的构造，使得$n$个神经元能够支持$\exp\!ig(Ω(n/(\log n)^2)ig)$个不同的极限环吸引子，每个吸引子的周期为$\exp\!ig(Ω(\sqrt n/\log n)ig)$，并且对翻转概率高达$ rac12-o(1)$的随机噪声具有鲁棒性，从而在存储序列的数量和长度上实现了超多项式容量。这是首次展示非对称Hopfield网络的这种容量，我们通过结合组合数学、数论和观点动力学的分析得到了这一结果。我们的发现表明，同步非对称Hopfield网络具有比先前认识到的更大且更鲁棒的序列记忆容量，证明在生物和人工神经系统中，鲁棒的序列表示可以通过粗糙的结构模式而非复杂的非线性来实现。

英文摘要

Classical Hopfield networks are limited to static patterns due to symmetric weights, whereas asymmetric networks can encode temporal sequences via limit-cycle attractors. Achieving high-capacity storage of long sequences in classical synchronous asymmetric networks, however, has remained a challenge. We present a simple and robust construction within the classical asymmetric Hopfield model with binary neurons and synchronous updates, that allows $n$ neurons to support $\exp\!\big(Ω(n/(\log n)^2)\big)$ distinct limit-cycle attractors, each with period $\exp\!\big(Ω(\sqrt n/\log n)\big)$ and robust to random noise with flip probability up to $\frac12-o(1)$, yielding superpolynomial capacity in both the number and length of stored sequences. This is the first demonstration of such capacity for asymmetric Hopfield networks, which we obtain by combining results from combinatorics, number theory and the analysis of opinion dynamics. Our findings show that synchronous asymmetric Hopfield networks possess a sequence-memory capacity which is larger and more robust than previously recognized, demonstrating that, in both biological and artificial neural systems, robust sequence representation can be achieved through coarse architectural motifs rather than complex nonlinearities.

URL PDF HTML ☆

赞 0 踩 0

2605.24608 2026-05-26 cs.AI cs.CV cs.LG

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

AI总结本文基于格论和数学形态学，为深度卷积架构（CNN、ResNet、UNet）建立了严格的代数框架，揭示了标准CNN流水线是交叉格算子，并识别出三种真正的幂等开运算层设计。

详情

AI中文摘要

我们为深度卷积架构（包括CNN、ResNet和如UNet的编码器-解码器网络）建立了一个严格的代数框架，该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论，我们将其系统地应用于标准深度网络的每一层。主要发现是：标准CNN流水线（线性卷积 + ReLU + 平坦最大池化）是一个交叉格算子：卷积是傅里叶下半格中的腐蚀，ReLU是格并闭包，最大池化是逐点最大加格中的膨胀，它们的组合既不是形态学开运算也不是闭运算。第二个发现是：ReLU在逐点格中的上伴随是一个全局（非局部）算子，在全局非负函数上为恒等映射，否则为负无穷，因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因：组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计：纯最大加形态学层（逐点格）、谱维纳层（傅里叶格）和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下，并给出了激活-池化膨胀（APD）分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

URL PDF HTML ☆

赞 0 踩 0

2605.24604 2026-05-26 cs.CV

LC-Flow: Learning Local Continuous Optical Flow and Confidence from events

LC-Flow: 从事件中学习局部连续光流与置信度

Gunwoo Jeon, Chaesong Park, Jongwoo Lim

AI总结提出LC-Flow，首个基于学习的、从局部事件中估计时间连续光流的方法，通过连续局部循环网络和联合学习的置信度，解决事件稀疏性和孔径问题，在MVSEC和DSEC上达到局部方法最优，且置信度引导的聚合在MVSEC上超越基于帧的方法。

详情

AI中文摘要

事件相机以微秒分辨率异步捕捉亮度变化，但现有光流方法未能充分利用这种时间连续性。基于帧的方法引入人工累积延迟并遭受领域过拟合，而基于模型的局部方法无状态运行，丢弃预测间的时间历史，产生不准确的光流。我们提出 extbf{LC-Flow}，首个时间连续的、基于学习的光流估计器，完全从局部事件操作。其核心是一个连续局部循环网络，为每个空间网格维护持久隐藏状态，随着事件到达逐步累积时间上下文。与受限于固定累积窗口的基于帧的方法不同，也与每一步从头重新计算运动的无状态基于模型的方法不同，LC-Flow在任意时间戳上生成具有完整运动历史的稀疏局部光流估计。为了解决局部观测固有的歧义性，我们联合学习一个置信度分数，量化每个预测的可靠性，明确处理事件稀疏性和孔径问题。该置信度具有双重作用：为下游任务（如视觉里程计）过滤不可靠估计，并为多尺度置信度引导的聚合提供有原则的权重，从稀疏局部输出重建全局一致的光流。LC-Flow在MVSEC和DSEC上均达到局部方法的最优性能，而置信度引导的聚合在MVSEC基准上建立了新的总体最优，超越了依赖全局空间先验的重型基于帧的网络。

英文摘要

Event cameras capture brightness changes asynchronously with microsecond resolution, yet existing optical flow methods fail to fully exploit this temporal continuity. Frame-based approaches impose artificial accumulation latency and suffer from domain overfitting, while model-based local methods operate statelessly, discarding temporal history between predictions and yielding inaccurate flows. We propose \textbf{LC-Flow}, the first temporally continuous, learning-based optical flow estimator that operates purely from local events. At its core, a Continuous Local Recurrent Network maintains persistent hidden states per spatial grid, incrementally accumulating temporal context as events arrive. Unlike frame-based methods constrained to fixed accumulation windows, and unlike stateless model-based methods that recompute motion from scratch at each step, LC-Flow produces sparse local flow estimates at arbitrary timestamps with full motion history. To address the inherent ambiguity of local observations, we jointly learn a confidence score that quantifies the reliability of each prediction, explicitly handling event sparsity and the aperture problem. This confidence serves a dual role: filtering unreliable estimates for downstream tasks such as visual odometry, and providing principled weights for a multi-scale confidence-guided aggregation that reconstructs globally consistent flow from the sparse local outputs. LC-Flow achieves state-of-the-art performance among local methods on both MVSEC and DSEC, while the confidence-guided aggregation establishes a new overall state-of-the-art on the MVSEC benchmark, surpassing heavy frame-based networks that rely on global spatial priors.

URL PDF HTML ☆

赞 0 踩 0

2605.24603 2026-05-26 cs.CL cs.LG

CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

CSP-Atlas: 稀疏Python Transformer中的概念特异性神经回路

Piotr Wilam

AI总结通过提取106个Python概念的特异性神经回路，发现模型内部组织遵循计算结构而非语义类别，并识别出原子性超簇。

Comments Code: https://github.com/piotrwilam/AtlasCSP

详情

AI中文摘要

一个稀疏的8层代码Transformer为每个测试的Python构造开发了专用的神经回路，并且这些回路按照清晰的计算原则而非语义类别进行组织。我们通过边缘化63,800个受控提示，提取了106个概念（43个AST节点类型，63个内置对象）的神经回路，并使用对比检查提示（呈现一个关键字标记而不带其关联的句法结构）将每个回路分解为概念特异性和标记驱动组件。出现了三个发现。首先，所有106个概念在九个参数设置中的每一个都产生非空的通用回路，并且跨构造的概念特异性排名在扫描中保持稳定——存活不是宽松阈值的伪影。其次，AST回路包含一个与标记激活不同的真正概念组件：在中间到后期层，仅概念神经元占最强烈激活神经元的比例高达62.5%，而内置回路几乎完全由标记驱动。第三，六个计算上原子的构造——Import、ImportFrom、Break、Continue、Pass、Assert——尽管在语义上不相关，却聚集在一起，仅共享作为不需要嵌套体的单语句构造的属性；这个原子性超簇，以及由标记歧义性和结构独特性组织的四层层次结构，表明模型的内部组织追踪计算结构而非含义。方法、完整分解数据和分析代码已发布。

英文摘要

A sparse 8-layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept-specific and token-driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non-empty universal circuits at every one of nine parameter settings, and the ranking of concept-specificity across constructs is stable across the sweep - survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept-only neurons constitute up to 62.5% of the loudest-firing neurons at mid-to-late layers, while builtin circuits are almost entirely token-driven. Third, six computationally atomic constructs - Import, ImportFrom, Break, Continue, Pass, Assert - cluster together despite being semantically unrelated, sharing only the property of being single-statement constructs requiring no nested body; this atomicity super-cluster, together with a four-tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model's internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.

URL PDF HTML ☆

赞 0 踩 0

2605.24600 2026-05-26 cs.AI

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Agent-as-Peer-Debriefer: 一种基于视角精炼的多智能体定性分析框架

Zhimin Lin, Kun Cheng, Fan Bai, Jie Gao

AI总结提出一种多智能体框架，通过模拟同行汇报（peer debriefing）并引入理论驱动、数据驱动和应用三种分析视角，提升大语言模型在定性数据分析中的编码质量。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于定性数据分析（QDA），但其输出往往缺乏人类分析的深度和细微差别。我们认为这一差距反映了人类QDA中缺失的一种可信度实践：同行汇报（peer debriefing），即分析师向无偏见的同行寻求反馈并据此完善其编码。为了将这一实践引入LLM辅助的QDA，我们提出了Agent-as-Peer-Debriefer，一个将同行汇报构建到关键编码步骤中的多智能体QDA框架。在我们的框架中，层次编码代理遵循标准QDA流程生成代码、子主题和主题，以及自我解释和反思备忘录。然后，它将输出共享给三个同行汇报代理，每个代理应用不同的分析视角（理论驱动、数据驱动或应用），并通过保留、重命名、重新分配、合并或拆分代码来完善代码。这些视角来源于跨领域和数据集通用的已建立的人类QDA实践。为了评估该框架，我们在三个LLM上对两个领域的三个数据集进行了测试，测量与人工标注代码的语义相似度。在所有设置中，基于视角的同行汇报精炼比单一LLM基线更接近人类代码，消融实验进一步表明，这种提升不仅仅来自额外的精炼。三种视角也产生了不同的权衡，表明视角的选择是一个有意义且可控的设计决策。更广泛地说，这些发现表明，用明确的视角模拟同行汇报是实现更可信的LLM辅助QDA的一条有前景的途径。

英文摘要

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

URL PDF HTML ☆

赞 0 踩 0

2605.24598 2026-05-26 cs.AI cs.MA

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Hera: 面向设备-云协作LLM智能体的长时程协调学习

Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu, Zhiwen Chen, Chengfei Lv, Zhe Chen

AI总结提出Hera，一种步骤级设备-云LLM智能体协调器，通过两阶段训练（模仿学习+强化学习）优化长时程任务的性能-成本帕累托前沿。

详情

AI中文摘要

大型语言模型（LLM）智能体通过自主与环境交互，擅长解决复杂的长期任务。然而，它们的实际部署面临根本性的设备-云困境：设备端模型高效但脆弱，而云端模型强大但计算成本高。最先进的LLM设备-云路由器通常做出粗粒度的任务级决策，无法适应多步智能体交互中变化的难度。为解决此问题，我们提出Hera，一种用于长期任务的步骤级设备-云LLM智能体协调器，实现了强大的性能-成本帕累托前沿。Hera采用新颖的两阶段训练范式：（1）冷启动的模仿学习，随后（2）联合优化任务成功率和云端使用效率的强化学习。第一阶段将步骤级路由视为监督分类问题：设备智能体在云端轨迹上重放，每个状态根据设备与云端动作的一致性进行标记。第二阶段，我们通过跨轨迹分组相同状态并使用偏好更高期望回报和更少未来云端调用的标签更新Hera，进行成本感知的强化学习。我们在ALFWorld、WebShop和AppWorld上评估Hera，它始终优于先前方法，在仅46.3%的步骤中使用云端的情况下，达到了云端单独成功率的92.5%。

英文摘要

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

URL PDF HTML ☆

赞 0 踩 0

2605.24597 2026-05-26 cs.AI cs.CL cs.LG

Learning to Reason Efficiently with A* Post-Training

学习通过A*后训练进行高效推理

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

AI总结本文通过A*搜索算法指导LLM生成正确且高效的推理步骤，提出监督微调和强化学习两种训练方法，在1B-3B参数模型上显著提升推理准确性和效率。

Comments Preprint

详情

AI中文摘要

大型语言模型（LLM）的许多应用需要演绎推理，但模型经常产生不正确或冗余的推理步骤。我们将自然语言推理框架化为一个搜索问题，其中最终答案本身就是有效的证明，需要推理过程中间推理正确。具体来说，我们研究LLM是否能够通过A*搜索（一种保证通向目标的最优高效路径的算法）的指导，学习生成正确且高效的证明。我们探索了两种训练技术：在A*执行轨迹上的监督微调，以及使用A*信息的过程奖励模型进行强化学习。实验发现，1B-3B范围内的Llama-3.2模型从A*后训练中获益显著，从接近零准确率提升到超越更大的模型DeepSeek-V3.2。我们的分析揭示了一个权衡：简单的正确性奖励最大化准确率，而A*信息的信号在准确率和效率之间取得平衡。此外，我们发现，在更大的搜索空间中，使用不完美启发式训练的模型表现出更高的准确率。我们的结果展示了朝着由经典搜索算法原理指导的推理方向的有前景的路径。

英文摘要

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

URL PDF HTML ☆

赞 0 踩 0