arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2605.03804 2026-05-26 cs.AI

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

ScrapMem: 一种基于生物启发的光学遗忘机制用于设备端个性化智能体记忆

Jiale Chang, Yuxiang Ren

AI总结 提出ScrapMem框架,通过光学遗忘机制压缩旧记忆并构建情节记忆图,在资源受限设备上实现高效多模态长期记忆,在ATM-Bench上取得51.0% Joint@10新最优,存储降低93%,召回率提升至70.3%。

Comments 10 pages, 4 figures

详情
AI中文摘要

对于LLM智能体而言,在资源受限的边缘设备上实现长期个性化记忆因高存储成本和多模态复杂性而具有挑战性。为此,我们提出ScrapMem,一个将多模态数据整合为“剪贴簿页面”的框架。ScrapMem引入了光学遗忘机制,一种逐步降低旧记忆分辨率的光学压缩机制,从而降低存储成本并抑制低价值细节。为保持语义一致性,我们构建了情节记忆图(EM-Graph),将关键事件组织成因果-时间结构。在多模态ATM-Bench上的大量实验表明,ScrapMem提供了三个主要优势:(1)强大性能,以51.0%的Joint@10分数实现了新的最优结果;(2)高存储效率,通过光学遗忘将内存使用量降低高达93%;(3)改进的召回率,通过结构化聚合将Recall@10提升至70.3%。ScrapMem为多模态LLM智能体在设备上的长期记忆提供了一种有效且存储高效的解决方案。

英文摘要

Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal complexity. To address this, we propose ScrapMem, a framework that integrates multimodal data into "Scrapbook Page." ScrapMem introduces Optical Forgetting, an optical compression mechanism that progressively reduces the resolution of older memories, lowering storage cost while suppressing low-value details. To maintain semantic consistency, we construct an Episodic Memory Graph (EM-Graph) that organizes key events into a causal-temporal structure. Extensive experiments on the multimodal ATM-Bench showcase that ScrapMem provides three main benefits: (1) strong performance, achieving a new state-of-the-art with a 51.0% Joint@10 score; (2) high storage efficiency, reducing memory usage by up to 93% via optical forgetting; and (3) improved recall, increasing Recall@10 to 70.3% through structured aggregation. ScrapMem offers an effective and storage-efficient solution for on-device long-term memory in multimodal LLM agents.

2605.03675 2026-05-26 cs.AI

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

MEMTIER:面向长期运行的自主AI智能体的分层内存架构与检索瓶颈分析

Bronislav Sidik, Lior Rokach

AI总结 提出MEMTIER三层内存架构,通过结构化事件存储、五信号加权检索、注意力归因权重更新、异步合并机制和PPO策略,在LongMemEval-S基准上将全上下文基线准确率从5%提升至38%,并支持本地6GB GPU运行。

Comments 11 pages, 1 figure, 5 tables. Under review

详情
AI中文摘要

长期运行的自主AI智能体面临一个记录充分的内存一致性问题:由于现有平面文件内存系统中的四种复合故障模式,工具执行成功率在72小时运行窗口内下降14个百分点。我们提出MEMTIER,这是OpenClaw智能体运行时的三方内存架构,引入了结构化事件JSONL存储、五信号加权检索引擎、注意力归因的认知权重更新循环、将事件事实提升到语义层的异步合并守护进程,以及基于PPO的检索权重自适应策略框架(基础设施已验证;性能提升待最终版本确认)。在完整的500问题LongMemEval-S基准测试(Wu等人,2025)上,MEMTIER在消费级6GB GPU上使用Qwen2.5-7B达到Acc=0.382,F1=0.412——比全上下文基线(0.050 -> 0.382,即5% -> 38%)提高了33个百分点。通过DeepSeek-V4-Flash事实预填充,单会话召回率达到0.686-0.714,超过了论文中RAG BM25 GPT-4o基线(0.560)在这些类别上的表现。时间推理提升至0.323,多会话综合提升至0.173,表明结构化语义预填充从根本上改变了轻量级检索所能达到的效果。所有阶段均在配备6GB GPU的消费级笔记本电脑上本地运行。

英文摘要

Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory architecture for the OpenClaw agent runtime that introduces a structured episodic JSONL store, a five-signal weighted retrieval engine, an attention-attributed cognitive weight update loop, an asynchronous consolidation daemon promoting episodic facts to a semantic tier, and a PPO-based policy framework for adapting retrieval weights (infrastructure validated; performance gains pending camera-ready). On the full 500-question LongMemEval-S benchmark (Wu et al., 2025), MEMTIER achieves Acc=0.382, F1=0.412 with Qwen2.5-7B on a consumer 6GB GPU - a +33 percentage point improvement over the full-context baseline (0.050 -> 0.382, i.e., 5% -> 38%). With DeepSeek-V4-Flash fact pre-population, single-session recall reaches 0.686-0.714, exceeding the paper's RAG BM25 GPT-4o baseline (0.560) on those categories. Temporal reasoning rises to 0.323 and multi-session synthesis to 0.173, demonstrating that structured semantic pre-population qualitatively changes what lightweight retrieval can achieve. All phases run locally on a consumer laptop with a 6GB GPU.

2605.03472 2026-05-26 cs.CL cs.AI

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

审计心理健康对话中的隐性谄媚:结构化临床状态诊断与干净匹配基准

Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

AI总结 针对心理健康对话模型中隐式谄媚(表面共情但强化消极认知)的问题,提出基于动态情感签名图(DESG)的结构化离线审计框架,通过临床状态转移评估响应方向,并在干净匹配基准上实现最优有害风险检测。

详情
AI中文摘要

心理健康对话模型越来越多地由基于AI的评估器进行评估,但这些评估器通常将表面共情、支持性或流畅性视为安全的证据。在本文中,我们研究了一种隐藏的失败模式,称为隐式谄媚:一个响应可能看似共情,但暗中强化灾难化、回避、绝望预测或CBT式标签。为了检查这个问题,我们引入了一个用于隐式谄媚检测的诊断基准,该基准基于三个代表性的心理健康对话来源构建,涵盖日常同伴支持、咨询式情感支持和危机导向互动,并进一步构建了一个泄漏审计的干净单响应匹配基准,包含500个上下文和1500个匹配响应窗口。然后,我们提出了动态情感签名图(DESG),一个结构化的离线审计框架,将基于LLM的状态提取与最终评分分离,并通过语义、情感和认知扭曲状态转移而非自由形式的LLM判断来评估临床方向。与元数据、表面风格、词汇、嵌入和基于规则的LLM基线不同,DESG对响应引起的临床状态变化方向进行评分;在泄漏审计的干净匹配基准上,DESG-StateRisk比最强的非DESG基线提高了0.0488 macro-F1,并实现了最佳的有害风险检测结果。这些结果表明,评估隐式谄媚需要显式的临床状态建模以及泄漏检查、捷径控制和竞争性基线。

英文摘要

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

2605.02764 2026-05-26 cs.CV

FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation

FoR-Net:学习聚焦困难区域以实现高效语义分割

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Yung-Che Wang, Meng-Qian Li, Chia-Min Lin, Jen-Shiun Chiang

AI总结 提出FoR-Net框架,通过可学习的重要性图和Top-K激活机制聚焦困难区域(如细长结构和物体边界),在有限计算资源下实现高效语义分割。

Comments 9 pages, 2 figures, 2 tables. Efficient semantic segmentation under resource-constrained settings. Code will be released

详情
AI中文摘要

我们提出FoR-Net,一种高效的语义分割框架,专注于识别和增强困难区域。FoR-Net不依赖沉重的全局建模,而是采用一种高效策略,通过可学习的重要性图和Top-K激活机制选择性强调信息丰富的区域。具体来说,选择器模块预测区域重要性,使模型能够聚焦于挑战性区域,如细长结构和物体边界。使用不同感受野的卷积分支实现多尺度推理,允许多样化的空间上下文聚合。我们在有限计算资源下对Cityscapes基准评估FoR-Net。尽管其设计高效且训练配置标准,FoR-Net仍取得了有竞争力的性能,并表现出对困难区域的改进关注。这些结果表明,选择性区域聚焦推理可以作为语义分割的一种实用且高效的替代方案。本工作探索了资源受限环境下的区域聚焦推理,并为开发高效且区域感知的分割模型提供了见解。

英文摘要

We present FoR-Net, an efficient semantic segmentation framework that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi-scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR-Net on the Cityscapes benchmark under limited computational resources. Despite its efficient design and standard training configuration, FoR-Net achieves competitive performance and exhibits improved attention to difficult regions. These results suggest that selective region-focused reasoning can serve as a practical and efficient alternative for semantic segmentation. This work explores region-focused reasoning under resource-constrained settings and provides insights for developing efficient and region-aware segmentation models.

2605.02495 2026-05-26 cs.LG cs.AI stat.ML

Efficient Preference Poisoning Attack on Offline RLHF

高效偏好投毒攻击离线RLHF

Chenye Yang, Weiyu Xu, Lifeng Lai

AI总结 针对离线RLHF中的偏好投毒攻击,提出基于梯度字典的二进制稀疏近似方法(BAL-A和BMP-A),实现高效标签翻转攻击。

Comments Accepted to ICML 2026

详情
AI中文摘要

离线人类反馈强化学习(RLHF)流程(如直接偏好优化DPO)在预收集的偏好数据集上训练,使其容易受到偏好投毒攻击。我们研究了对数线性DPO的标签翻转攻击。首先说明翻转一个偏好标签会在DPO梯度中引起与参数无关的偏移。利用这一关键性质,我们可以将目标投毒问题转化为结构化的二进制稀疏近似问题。为解决该问题,我们开发了两种攻击方法:二进制感知格点攻击(BAL-A)和二进制匹配追踪攻击(BMP-A)。BAL-A将二进制翻转选择问题嵌入二进制感知格点,并应用Lenstra-Lenstra-Lovász约简和Babai最近平面算法;我们提供了强制二进制系数并恢复最小翻转目标的充分条件。BMP-A将二进制匹配追踪适应于我们的非归一化梯度字典,并给出基于相干性的恢复保证和$K$翻转预算的鲁棒性(不可能性)证书。在合成字典和斯坦福人类偏好数据集上的实验验证了理论,并突出了字典几何如何决定攻击成功。

英文摘要

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

2605.01017 2026-05-26 cs.CL

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

心理上有效,计算上不可见:LLM 生成引发社会比较的帖子但无法检测

Hua Zhao, Jiapei Gu, Michelle Mingyue Gu

AI总结 本研究通过构建小红书社会比较读者诱发基准(XHS-SCoRE),发现 LLM 在生成引发社会比较的帖子时存在生成-检测不匹配,即该信号在领域内可学习但无法通过提示分类稳健访问。

Comments 19 pages, preprint Title change: Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

详情
AI中文摘要

我们引入了小红书社会比较读者诱发基准(XHS-SCoRE),这是一个基于读者视角的基准,用于检测纯文本小红书(RedNote)帖子是否从第一人称读者视角引发向上、向下或中性/无明确社会比较。该任务针对一种具有社会意义的关系性、行为上真实的信号,该信号不可简化为情感。在提示型 LLM 分类器和有监督的中文编码器中,我们发现了一致的生成-检测不匹配:该信号在领域内是文本可学习的,但无法通过基于提示的分类稳健访问。提示型 LLM 分类器表现出稳定的失败,特别是对引发比较的帖子进行中和以及模型特定的方向偏差。一项受控试点表明,即使基于提示的同一构念检测仍然脆弱,LLM 生成的小红书风格帖子也能改变感知地位和比较相关情感。XHS-SCoRE 为基于读者的比较检测提供了一个基准,并为研究社会意义的关系线索何时仅部分可见于基于提示的推理提供了一个诊断框架。

英文摘要

We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting whether text-only Xiaohongshu (RedNote) posts elicit Upward, Downward, or Neutral/no clear social comparison from a first-person reader perspective. The task targets a socially meaningful relational, behaviorally real signal not reducible to sentiment. Across prompted LLM classifiers and supervised Chinese encoders, we find a consistent generation--detection mismatch: the signal is textually learnable in-domain, but not robustly accessible to prompt-based classification. Prompted LLM classifiers show stable failures, especially neutralization of comparison-eliciting posts and model-specific directional skew. A controlled pilot shows that LLM-generated Xiaohongshu-style posts can shift perceived standing and comparison-related affect even when prompt-based detection of the same construct remains fragile. XHS-SCoRE contributes a benchmark for reader-grounded comparison detection and a diagnostic framework for studying when socially meaningful relational cues remain only partially visible to prompt-based inference.

2605.00419 2026-05-26 cs.LG cs.CL

Rethinking LLM Ensembling from the Perspective of Mixture Models

从混合模型的角度重新思考大语言模型集成

Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang

AI总结 本文提出混合模型式集成(ME),通过将集成重新解释为混合模型,随机选择单个模型生成下一个token,避免显式计算完整集成分布,实现1.78x-2.68x加速,并揭示了集成与token级路由方法的联系。

Comments ICML 2026 Spotlight

详情
AI中文摘要

模型集成是提升机器学习模型性能的成熟技术。传统上,这涉及对多个模型的输出分布进行平均,并选择最可能的标签。这一思想已自然扩展到大型语言模型(LLMs),在提升性能的同时也带来了巨大的计算成本。这种低效源于将传统集成实现直接应用于LLMs,需要为每个模型单独进行前向传播以显式计算集成分布。在本文中,我们提出了混合模型式集成(ME)。通过将集成重新解释为混合模型,ME在每一步随机选择一个模型来生成下一个token,从而避免显式计算完整的集成分布。ME在数学上等价于从集成分布中采样,但只需调用一个模型,使其比传统集成快1.78x-2.68x倍。此外,这一视角将LLM集成与token级路由方法联系起来,表明LLM集成是路由方法的一个特例。我们的发现为高效的LLM集成开辟了新途径,并激励了对LLM token级路由策略的进一步探索。我们的代码可在https://github.com/Kamichanw/Mixture-model-like-Ensemble获取。

英文摘要

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensembling. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/Kamichanw/Mixture-model-like-Ensemble.

2604.23853 2026-05-26 cs.AI

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

ClawTrace: 面向LLM智能体技能蒸馏的成本感知追踪

Boqin Yuan, Yue Su, Renchu Song, Sen Yang, Jing Qin

AI总结 针对技能蒸馏管道缺乏每步成本信号的问题,提出ClawTrace记录成本归因轨迹并生成TraceCard,通过CostCraft生成保留、剪枝和修复三类技能补丁,发现剪枝补丁作为质量护栏而保留补丁导致回归,主张按规则类型评估可复用技能。

Comments Accepted at Agent Skills '26 Workshop, ACM Conference on AI and Agentic Systems (CAIS 2026), San José, CA, May 26, 2026

详情
AI中文摘要

技能蒸馏管道从LLM智能体轨迹中学习可重用规则,但它们缺乏一个关键信号:每一步的成本。没有每步成本,管道无法区分添加缺失步骤以修复错误与移除从未影响结果的昂贵步骤。我们利用成本归因差距来探究蒸馏技能内部的规则类型是否以相同方式迁移到新任务。ClawTrace记录成本归因的智能体轨迹,并将每个会话编译成TraceCard;CostCraft读取TraceCard并编写三种技能补丁:保留、剪枝和修复。我们发现了一个聚合指标隐藏的模式。在30个保留的SpreadsheetBench任务上(两个种子),移除剪枝补丁大致使质量回归计数增加了三倍,而未降低中位成本。在整个84任务的SkillsBench迁移中,CostCraft未节省总成本。所有三个质量回归都追溯到保留通道,而两个质量提升都追溯到剪枝通道:剪枝补丁充当质量护栏,而保留补丁驱动回归。我们认为可重用的智能体技能应在规则类型层面进行评估,而不是作为整体指令包。为支持这一点,我们发布了ClawTrace、TraceCard模式以及全套类型化技能。

英文摘要

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We use the cost-attribution gap to ask whether the rule types inside a distilled skill transfer the same way to new tasks. ClawTrace records cost-attributed agent traces and compiles each session into a TraceCard; CostCraft reads TraceCards and writes three kinds of skill patches: preserve, prune, and repair. We find a pattern aggregate metrics hide. On 30 held-out SpreadsheetBench tasks across two seeds, removing prune patches roughly tripled the quality-regression count without lowering median cost. Across the full 84-task SkillsBench transfer, CostCraft saves no aggregate cost. All three quality regressions trace to the preserve lane, and both quality wins trace to the prune lane: prune patches act as quality guardrails while preserve patches drive regressions. We argue that reusable agent skills should be evaluated at the rule-type level, not as monolithic instruction packages. To support this, we release ClawTrace, the TraceCard schema, and the full set of typed skills.

2604.23728 2026-05-26 cs.CV cs.AI

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

ESIA:基于能量的时空交互感知框架用于行人意图预测

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

AI总结 提出ESIA框架,利用条件随机场和能量函数建模时空交互,通过结构一致性约束和模拟退火算法实现行人意图预测,在标准基准上达到最先进性能并提升可解释性。

Comments 13 pages, 6 figures, 3 tables

详情
AI中文摘要

自动驾驶的最新进展推动了行人意图预测的研究,该研究旨在通过建模时间动态、社交互动和环境背景来推断未来的过街决策和行动。然而,现有研究仍受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测中缺乏全局一致性,这损害了鲁棒性和可解释性。在这项工作中,我们提出了ESIA(基于能量的时空交互感知框架),一种新颖的基于条件随机场(CRF)的范式。我们将意图预测任务视为一个基于统一图表示的结构化预测问题,将行人和环境视为时空节点。为了表征它们的不同角色,我们为节点分配一元势能以捕捉个体意图,为边分配成对势能以编码社交和环境交互。这些势能被整合到一个统一的全局能量函数中,以确保行为预测的场景级一致性。为了在没有真实标签监督的情况下进一步约束推理,我们引入了结构一致性项来惩罚逻辑矛盾。该优化通过一种新颖的一元种子模拟退火(U-SSA)算法高效求解,该算法利用高置信度的一元先验快速收敛到高质量解。在标准基准上的大量实验表明,ESIA在现有方法中实现了最先进的性能,并具有更好的可解释性。

英文摘要

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

2604.23295 2026-05-26 cs.CL cs.AI

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

AI总结 本文通过适配Moshi架构,使用自定义印地语分词器和26,000小时真实对话数据训练,提出了首个开放、可复现的印地语全双工口语对话系统,实现了自然的打断、重叠和反馈行为。

详情
AI中文摘要

全双工口语对话系统能够模拟自然的对话行为,如打断、重叠和反馈,然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi,使用自定义印地语分词器,并在从14,695名说话者收集的26,000小时真实自发对话数据(具有独立的说话者通道)上进行训练,提出了首个开放、可复现的印地语全双工口语对话系统,从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成,我们替换了原始英语分词器,并重新初始化了依赖于文本词汇的参数,同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练,然后在1,000小时对话数据上进行微调。通过提示对话延续范式,结合自动评估指标和人工判断,评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

2604.23017 2026-05-26 cs.LG cs.NA math.CV math.NA

Complex Stochastic Gradient Descent and Directional Bias in Reproducing Kernel Hilbert Spaces

复随机梯度下降与再生核希尔伯特空间中的方向偏差

Natanael Alpay, Emeric Battaglia

AI总结 本文提出复随机梯度下降(Complex SGD)方法,在无解析性约束下证明其收敛性,并验证方向偏差从实域扩展到复域,在复再生核希尔伯特空间中通过核回归有效恢复超振荡函数和Blaschke乘积。

详情
AI中文摘要

随机梯度下降(SGD)是一种已知的随机迭代方法,因其实现简单和可扩展性而流行于大规模凸优化问题。某些目标函数,例如复值神经网络中的目标函数,受益于SGD和梯度下降(GD)中使用新定义的“梯度”进行更新,该梯度允许复参数。这种SGD/GD方法的复变体已被提出,但尚未提供无解析性约束的收敛保证。我们提出了一种允许复参数的SGD变体(复SGD),并在与实设置平行的假设下提供了收敛保证。值得注意的是,这些结果也扩展到GD,并且在相同的假设集下,我们确认了对于核回归问题,一些方向偏差结果从实域扩展到复域。我们提供了实证结果,证明了复SGD在使用复再生核希尔伯特空间的核回归问题中的有效性。特别地,我们展示了通过选择特定的损失函数,可以分别从Fock空间和Hardy空间中恢复超振荡函数和Blaschke乘积作为最优函数。

英文摘要

Stochastic Gradient Descent (SGD) is a known stochastic iterative method popular for large-scale convex optimization problems due to its simple implementation and scalability. Some objectives, such as those found in complex-valued neural networks, benefit from updates like in SGD and Gradient Descent (GD) with a newly defined ``gradient'' that allows for complex parameters. This complex variant of the SGD/GD methods has already been proposed, but convergence guarantees without analyticity constraints have not yet been provided. We propose a variant of SGD (complex SGD) that allows for complex parameters, and we provide convergence guarantees under assumptions that parallel those from the real setting. Notably, these results extend to GD as well, and with the same set of assumptions, we confirm that some directional bias results extend from the real to the complex setting for kernel regression problems. We provide empirical results demonstrating the efficacy of the complex SGD in kernel regression problems utilizing complex reproducing kernel Hilbert spaces. In particular, we demonstrate we may recover superoscillation functions and Blaschke products from the Fock Space and Hardy Space, respectively, as the optimal functions for a particular choice of a loss function.

2604.18970 2026-05-26 cs.LG cs.CR

Mechanistic Anomaly Detection via Functional Attribution

通过功能归因的机制性异常检测

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

AI总结 将机制性异常检测重新定义为功能归因问题,利用影响函数测量测试样本与参考集之间的功能耦合,在视觉模型后门检测、大语言模型后门检测以及对抗样本和分布外样本检测中取得最优或显著改进。

Comments ICML '26 Camera Ready

详情
AI中文摘要

我们通常可以使用真实标签验证神经网络输出的正确性,但无法可靠地确定输出是由正常还是异常的内部机制产生的。机制性异常检测(MAD)旨在标记这些情况,但现有方法要么依赖于潜在空间分析(易受混淆攻击),要么特定于特定架构和模态。我们将MAD重新定义为功能归因问题:询问来自可信集的样本在多大程度上可以解释模型的输出,其中归因失败表明异常行为。我们使用影响函数来实现这一点,通过参数空间采样测量测试样本与小型参考集之间的功能耦合。我们在多种异常类型和模态上进行了评估。对于视觉模型中的后门,我们的方法在BackdoorBench上实现了最先进的检测,在七种攻击和四个数据集上平均防御有效性评分(DER)为0.93(次优为0.83)。对于大语言模型,我们在几种后门类型(包括显式混淆的模型)上也取得了比基线显著的改进。除了后门,我们的方法可以检测对抗样本和分布外样本,并区分单个模型内的多种异常机制。我们的结果确立了功能归因作为一种有效的、模态无关的工具,用于检测部署模型中的异常行为。

英文摘要

We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

2604.18396 2026-05-26 cs.CL

River-LLM: Large Language Model Seamless Exit Based on KV Share

River-LLM:基于KV共享的大型语言模型无缝退出

Yingtao Shen, An Zou

AI总结 针对解码器-only架构中早期退出因KV缓存缺失导致的延迟和精度问题,提出无需训练的River-LLM框架,通过轻量级KV共享退出流实现令牌级无缝退出,并利用状态转移相似性预测KV误差指导退出决策,在数学推理和代码生成任务上获得1.53-2.16倍实际加速且保持高质量。

Comments Accepted to ACL 2026, 13pages, with appendix. Corrected some typos

详情
AI中文摘要

大型语言模型(LLM)在多个领域展现出卓越性能,但日益受到高推理延迟的制约。早期退出通过动态跳过冗余层加速推理,成为一种有前景的解决方案。然而,在仅解码器架构中,早期退出的效率受到KV缓存缺失问题的严重瓶颈,即跳过的层无法为后续令牌提供必要的历史状态。现有解决方案(如重新计算或掩码)要么引入显著延迟开销,要么导致严重精度损失,未能弥合理论层减少与实际墙钟加速之间的差距。本文提出River-LLM,一个无需训练的框架,能够实现无缝的令牌级早期退出。River-LLM引入轻量级KV共享退出流,使得骨干网络的缺失KV缓存能够在退出过程中自然生成并保留,无需昂贵的恢复操作。此外,我们利用解码器块内的状态转移相似性来预测累积KV误差,并指导精确的退出决策。在数学推理和代码生成任务上的大量实验表明,River-LLM在保持高生成质量的同时,实现了1.53至2.16倍的实际加速。

英文摘要

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.53 to 2.16 times of practical speedup while maintaining high generation quality.

2604.14054 2026-05-26 cs.LG cs.CL

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

$\pi$-Play: 通过特权自蒸馏实现的多智能体自对弈,无需外部数据

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

AI总结 提出$\pi$-Play框架,利用自对弈中生成的问答构建路径作为特权信息,结合自蒸馏实现密集反馈的多智能体协同进化,无需外部数据即可超越全监督搜索代理。

Comments 23 pages, 11 figures

详情
AI中文摘要

深度搜索代理已成为解决复杂信息寻求任务的有前景范式,但其训练仍面临稀疏奖励、弱信用分配和有限标注数据的挑战。自对弈提供了一种可扩展的减少数据依赖的途径,但传统自对弈仅通过稀疏结果奖励优化学生,导致学习效率低下。在这项工作中,我们观察到自对弈在任务生成过程中自然产生一个问题构建路径(QCP),这是一种捕获反向求解过程的中间产物。这揭示了一种新的特权信息来源:自对弈可以低成本、大规模地提供高质量特权信息用于自蒸馏,无需依赖人类反馈或精心设计的特权信息。基于这一洞察,我们提出特权信息自对弈($\pi$-Play),一种结合自对弈和自蒸馏的新型多智能体自进化框架。在$\pi$-Play中,考官生成任务及QCP,教师利用QCP作为特权上下文,通过自蒸馏对学生进行密集监督。这种设计将稀疏奖励的自对弈转变为密集反馈的协同进化。大量实验表明,无数据的$\pi$-Play超越了全监督搜索代理,并将进化效率相比传统自对弈提升了2-3倍。代码见 https://github.com/zhyaoch/pi-play。

英文摘要

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In $π$-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

2604.12383 2026-05-26 cs.SD

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

面向统一重建、理解和生成的语音VAE蒸馏损失函数研究

Changhao Cheng, Wei Wang, Wangyou Zhang, Dongya Jia, Jian Wu, Zhuo Chen, Yanmin Qian

AI总结 本文系统探索了语音VAE中不同对齐方法对重建、理解和生成任务性能的影响,并提出联合边缘对齐与自适应加权策略以实现最优整体性能。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

基于变分自编码器(VAE)的连续语音表示已成为传统频谱图或离散令牌特征在语音生成和重建中的有前途的替代方案。最近的研究试图通过与自监督学习(SSL)特征对齐来丰富VAE潜在表示中的结构信息,旨在获得更好的生成性能。然而,当考虑更多任务时,广泛使用的基于时间轴蒸馏的对齐方法是否最优尚不清楚。针对这一问题,本文系统探索了不同的对齐方法,并分析了它们在重建、理解和生成三个维度上对性能的影响。我们研究了蒸馏损失中的各种设计选择。大量实验表明,采用自适应加权的联合边缘对齐方法可以在实现可控平衡的同时获得最佳整体性能。

英文摘要

Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.

2604.11632 2026-05-26 cs.CL

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

CArtBench: 评估视觉语言模型在中国艺术理解、解读与真实性方面的能力

Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

AI总结 提出CARTBENCH基准,通过四个子任务评估视觉语言模型在中国艺术作品上的识别、推理、鉴赏和真伪鉴别能力,发现现有模型在复杂证据链接、风格断代和真伪诊断方面表现不足。

Comments under review

详情
AI中文摘要

我们介绍了CARTBENCH,一个基于博物馆的基准,用于评估视觉语言模型(VLM)在中国艺术作品上的表现,超越了短形式识别和问答。CARTBENCH包含四个子任务:CURATORQA用于基于证据的识别和推理,CATALOGCAPTION用于结构化的四部分专家风格鉴赏,REINTERPRET用于带有专家评分的可辩护的重新解读,以及CONNOISSEURPAIRS用于在视觉相似混淆下进行诊断性真伪鉴别。CARTBENCH通过将来自Wikidata的带有图像的故宫博物院物品与权威目录页面进行对齐构建,涵盖多个朝代的五个艺术类别。在九个代表性的VLM上,我们发现高整体CURATORQA准确率可能掩盖了在硬证据链接和风格到时期推断上的急剧下降;长形式鉴赏仍远未达到专家参考水平;而面向真伪的诊断性鉴别接近随机水平,这突显了当前模型在鉴赏级推理上的困难。

英文摘要

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

2604.11557 2026-05-26 cs.AI

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall: 统一LLM智能体的工具使用表示、数据与评估

Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen

AI总结 提出UniToolCall框架,通过标准化工具集构建、数据集生成和评估流程,结合22k+工具和390k+训练实例,引入锚点链接机制,在混合设置下使Qwen3-8B单轮严格精度达93.0%,超越GPT、Gemini和Claude。

Comments 21 pages, 10 figures, 9 tables. Code and datasets are publicly available at: https://github.com/EIT-NLP/UniToolCall

详情
AI中文摘要

工具使用能力是LLM智能体的基本组成部分,使其能够通过结构化函数调用与外部系统交互。然而,现有研究存在不一致的交互表示,很大程度上忽略了工具使用轨迹的结构分布,并依赖于不兼容的评估基准。我们提出了UniToolCall,一个统一的工具学习框架,标准化了从工具集构建、数据集生成到评估的整个流程。该框架整理了包含22k+工具的大型工具池,并通过结合10个标准化公共数据集与结构受控的合成轨迹,构建了包含390k+实例的混合训练语料库。它显式建模了多种交互模式,包括单跳与多跳、单轮与多轮,同时捕获了串行和并行执行结构。为了支持连贯的多轮推理,我们进一步引入了锚点链接机制,强制跨轮依赖关系。此外,我们将7个公共基准转换为统一的查询-动作-观察-答案(QAOA)表示,并在函数调用、轮次和对话级别进行细粒度评估。实验表明,在我们的数据集上微调Qwen3-8B显著提升了工具使用性能。在干扰项密集的Hybrid-20设置下,单轮严格精度达到93.0%,优于包括GPT、Gemini和Claude在内的商业模型。

英文摘要

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

2604.10783 2026-05-26 cs.AI cs.LG

Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment

从临床叙述中学习基于偏好的目标用于动态脓毒症治疗

Daniel J. Tan, Jayne Hui Zhen Chan, Kai Wen Hwang, Arturo Yong Yao Neo, Kay Choong See, Mengling Feng

AI总结 提出CN-PR框架,利用大语言模型从出院小结中提取轨迹级偏好,通过偏好优化学习奖励函数,在离线强化学习中改善脓毒症治疗结果。

详情
AI中文摘要

在医疗保健中为强化学习设计奖励函数仍然具有挑战性,因为临床有意义的结果稀疏、延迟且难以明确指定。尽管结构化临床数据捕获了生理状态,但它们往往无法反映患者轨迹的更广泛方面,如治疗反应、恢复动态和干预负担。相比之下,临床叙述编码了临床医生对疾病进展、治疗效果和恢复的纵向评估,提供了超越预定义结果指标的轨迹级监督的潜在来源。我们提出了临床叙述知情偏好奖励(CN-PR)框架,该框架通过将临床叙述视为轨迹级偏好的可扩展监督,直接从出院小结中学习奖励函数。使用大语言模型,我们推导出轨迹质量分数,并在患者轨迹之间构建成对偏好,通过基于偏好的优化来学习奖励。为了考虑叙述信息量的变异性,我们引入了一个任务相关性信号,根据监督与下游决策任务的相关性对其进行加权。我们在离线强化学习中评估了CN-PR在动态脓毒症治疗中的应用。学习到的奖励与轨迹质量分数表现出强烈的单调对齐,并产生了与改善恢复相关结果相关的策略,包括增加器官支持无天数和更快的休克解决,同时保持与基于结果的奖励基线相当的性能。这些发现在外部验证下得以保留。我们的结果表明,临床叙述为动态治疗方案中的奖励学习提供了可扩展且富有表现力的监督来源。

英文摘要

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are sparse, delayed, and difficult to explicitly specify. Although structured clinical data capture physiologic states, they often fail to reflect broader aspects of patient trajectories such as treatment response, recovery dynamics, and intervention burden. Clinical narratives, by contrast, encode longitudinal clinician assessments of disease progression, treatment effectiveness, and recovery, providing a potential source of trajectory-level supervision beyond predefined outcome metrics. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework that learns reward functions directly from discharge summaries by treating clinical narratives as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores and construct pairwise preferences between patient trajectories to learn rewards through preference-based optimization. To account for variability in narrative informativeness, we incorporate a task relevance signal that weights supervision according to its relevance to the downstream decision-making task. We evaluate CN-PR in dynamic sepsis treatment using offline RL. The learned reward demonstrated strong monotonic alignment with trajectory quality scores and produced policies associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining mortality performance comparable to outcome-based reward baselines. These findings were preserved under external validation. Our results suggest that clinical narratives provide a scalable and expressive source of supervision for reward learning in dynamic treatment regimes.

2604.08870 2026-05-26 cs.LG cs.AI

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

学习分析中的时间辍学风险:跨动态与早期窗口表示的协调生存基准

Rafael da Silva, Jeff Eicher, Gregory Longo

AI总结 本研究使用OULAD数据集,通过协调的生存分析基准(包括动态周表示和连续时间表示)评估辍学风险模型,发现时间行为特征比静态背景属性更具预测力。

Comments 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

详情
AI中文摘要

学生辍学是学习分析中持续关注的问题,然而比较研究经常在异质协议下评估预测模型,优先考虑区分度而非时间可解释性和校准。本研究引入了一个面向生存的基准,用于使用开放大学学习分析数据集(OULAD)进行时间辍学风险建模。比较了两个协调分支:一个动态周分支,采用人-时期表示的模型;以及一个可比较的连续时间分支,扩展了模型家族——基于树的生存模型、参数模型和神经网络模型。评估协议整合了四个分析层面:预测性能、消融、可解释性和校准。结果在每个分支内分别报告,因为跨分支单一排名在方法论上不合理。在可比较分支中,随机生存森林在区分度和特定时间点的Brier分数上领先;在动态分支中,泊松分段指数在紧密的五家族聚类中在综合Brier分数上略微领先。无重抽样自举变异将这些位置视为方向性信号而非绝对优势。消融和可解释性分析在所有家族中收敛于一个共同发现:主导预测信号主要不是人口统计学或结构性的,而是时间和行为性的。校准在更好区分的模型中证实了这一模式,但XGBoost AFT除外,它表现出系统性偏差。这些结果支持在学习分析中采用协调的多维基准的价值,并将辍学风险定位为一个时间行为过程,而非静态背景属性的函数。

英文摘要

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

2604.08213 2026-05-26 cs.CV cs.AI

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

AI总结 提出EditCaption两阶段后训练流程,通过人工精炼SFT和基于难度自适应错误感知DPO(HAE-DPO)提升图像编辑指令合成质量,显著降低关键错误率并超越现有模型。

详情
AI中文摘要

高质量的源-目标图像对及精确的编辑指令对于指令引导的图像编辑至关重要,但大规模构建此类训练三元组成本高昂。最近的流程通常依赖视觉语言模型自动合成编辑指令,但我们发现强大的VLM仍难以描述图像对之间的视觉变换。具体而言,它们表现出三种反复出现的失败模式:方向不一致、视角模糊和缺少细粒度属性。在400个图像对的人工评估中,多个开源VLM基线产生超过47%的关键错误率,使得许多合成指令不适合下游训练。为解决此问题,我们提出EditCaption,一种用于图像编辑指令合成的两阶段后训练流程。首先,通过基于GLM的自动字幕生成、EditScore过滤和人工精炼构建100K监督微调数据集。其次,收集10K人工标注的偏好对,其中每个被拒绝的指令都标注了其主要错误类型和严重程度。基于此数据集,我们提出难度自适应错误感知DPO(HAE-DPO),一种任务适配的DPO目标,它引入了基于人工标注的严重程度、失败模式类型和参考模型难度的自适应边界。在三个基准上的实验表明,我们的235B模型经过SFT+HAE-DPO后在开源和闭源模型中达到最先进性能,在Eval-400、HQ-Edit和ByteMorph-Bench上分别获得4.720、4.672和4.651分——在所有三个基准上均超越Gemini-3-Pro。人工评估证实关键错误率从47.75%降至17.50%,正确率从41.75%提升至70.25%,超越Gemini-3-Pro(66.00%)。

英文摘要

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

2604.07039 2026-05-26 cs.RO cs.AI

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

AEROS:一种具有具身能力模块的单智能体操作架构

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

AI总结 提出AEROS架构,将机器人建模为单一持久智能主体,通过可安装的具身能力模块扩展能力,实现模块化可扩展性、可组合能力执行和一致的系统级安全。

Comments Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables

详情
AI中文摘要

机器人系统缺乏一种原则性的抽象来统一组织智能、能力和执行。现有方法要么在单体架构中耦合技能,要么将功能分解为松散协调的模块或多个智能体,通常缺乏一致的标识和控制权限模型。我们认为,机器人应被建模为一个单一的持久智能主体,其能力通过可安装的包来扩展。我们将这一观点形式化为AEROS(智能体执行运行时操作系统),其中每个机器人对应一个持久智能体,能力通过具身能力模块(ECM)提供。每个ECM封装了可执行技能、模型和工具,而执行约束和安全保证由策略分离的运行时强制执行。这种分离实现了模块化可扩展性、可组合能力执行和一致的系统级安全。我们在PyBullet仿真中使用Franka Panda 7自由度机械臂评估了一个参考实现,进行了八项实验,涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析。每个条件下超过100次随机试验,AEROS在三个任务上实现了100%的任务成功率,而基线(BehaviorTree.CPP风格和ProgPrompt风格为92-93%,扁平流水线为67-73%);策略层阻止了所有无效动作,零误接受;运行时优势跨任务泛化,无需特定任务调整;ECM在运行时加载,交换后成功率为100%。

英文摘要

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

2604.05550 2026-05-26 cs.CL cs.CE

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

AutoSOTA:面向最先进AI模型发现的端到端自动化研究系统

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

AI总结 提出AutoSOTA系统,采用多智能体架构实现从论文复现到模型优化的全自动化,成功发现105个超越原始方法的新SOTA模型。

详情
AI中文摘要

人工智能研究越来越依赖于长时间的复现、调试和迭代优化以达到最先进(SOTA)性能,这催生了对能够加速经验模型优化全流程的系统的需求。在这项工作中,我们介绍了AutoSOTA,一个端到端的自动化研究系统,它将顶级AI论文中发布的最新SOTA模型推进为可复现且经验上改进的新SOTA模型。我们通过三个紧密耦合的阶段来形式化这个问题:资源准备与目标设定;实验评估;以及反思与构思。为了解决这个问题,AutoSOTA采用了一种多智能体架构,包含八个专门化的智能体,它们协同工作,将论文与代码和依赖项对应起来,初始化和修复执行环境,跟踪长期实验,生成并调度优化想法,并监督有效性以避免虚假收益。我们在从八个顶级AI会议收集的最新研究论文上评估AutoSOTA,并根据代码可用性和执行成本进行过滤。在这些论文中,AutoSOTA在自动复现和后续优化方面均取得了强大的端到端性能。具体来说,它成功发现了105个超越原始报告方法的新SOTA模型,平均每篇论文耗时约五小时。涵盖LLM、NLP、计算机视觉、时间序列和优化的案例研究进一步表明,该系统可以超越常规的超参数调优,识别架构创新、算法重新设计和工作流级别的改进。这些结果表明,端到端的研究自动化不仅可以作为性能优化器,还可以作为一种新型的研究基础设施,减少重复性实验负担,帮助将人类注意力重新引导到更高层次的科学创造力上。

英文摘要

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

2604.04707 2026-05-26 cs.CV

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

OpenWorldLib: 高级世界模型的统一代码库与定义

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Juanxi Tian, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang

AI总结 本文提出OpenWorldLib框架,基于对世界模型演化的分析给出清晰定义,并系统分类其核心能力,实现多任务模型的统一集成与高效推理。

Comments 28 pages, 6 figures

详情
AI中文摘要

世界模型作为人工智能中一个前景广阔的研究方向已引起广泛关注,但仍缺乏清晰统一的定义。本文中,我们介绍了OpenWorldLib,一个针对高级世界模型的全面且标准化的推理框架。借鉴世界模型的演化,我们提出一个明确的定义:世界模型是以感知为中心、具备交互和长期记忆能力、用于理解和预测复杂世界的模型或框架。我们进一步系统性地分类了世界模型的基本能力。基于这一定义,OpenWorldLib将不同任务的模型集成在统一框架内,实现高效复用和协同推理。最后,我们对世界模型研究的潜在未来方向提出了额外的思考和分析。代码链接:https://github.com/OpenDCAI/OpenWorldLib

英文摘要

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

2604.03318 2026-05-26 cs.CV

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

EgoMind: 通过多模态大语言模型中的语言推理激活空间认知

Zhenghao Chen, Huiqun Wang, Di Huang

AI总结 提出EgoMind框架,通过角色扮演描述和渐进空间分析的无几何空间推理方法,仅用少量数据即可在多基准测试中提升MLLMs的空间推理能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地应用于空间认知任务,期望它们能够理解并与复杂环境交互。现有工作大多通过引入3D先验或几何监督来改进空间推理,这虽然提升了性能,但带来了大量的数据准备和对齐成本。相比之下,纯2D方法由于捕获跨帧空间关系的能力有限,往往在多帧空间推理中表现不佳。为了解决这些限制,我们提出了EgoMind,一个思维链框架,通过角色扮演描述(联合构建跨帧的一致语言场景图)和渐进空间分析(逐步推理任务特定问题)实现无几何空间推理。仅使用5K自动生成的SFT样本和20K RL样本,EgoMind在VSI-Bench、SPAR-Bench、SITE-Bench和SPBench上取得了有竞争力的结果,证明了其在增强MLLMs空间推理能力方面的有效性,并突出了语言推理在空间认知中的潜力。代码和数据已发布在https://github.com/Hyggge/EgoMind。

英文摘要

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

2604.00499 2026-05-26 cs.LG

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

具有不确定性感知输出长度预测的LLM推理调度

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang

AI总结 针对LLM推理调度,提出基于对数t分布的输出长度分布模型,并设计Tail Inflated Expectation (TIE)指标替代点估计,以降低在线推理延迟并提高离线吞吐量。

Comments Accepted at ICML 2026

详情
AI中文摘要

为了调度LLM推理, extit{最短作业优先}(SJF)原则通过优先处理输出长度短的请求来避免队头(HOL)阻塞。现有方法通常为每个请求预测单个输出长度以促进调度。我们认为这种 extit{点估计}与LLM推理的 extit{随机}解码过程不匹配,其中输出长度本质上是 extit{不确定的},由何时采样到序列结束(EOS)标记决定。因此,每个请求的输出长度应拟合为分布而非单个值。通过对经验数据和随机解码过程的深入分析,我们观察到输出长度服从重尾分布,并可以用对数t分布拟合。在此基础上,我们提出一个简单的指标,称为Tail Inflated Expectation (TIE),用于替换SJF调度中的输出长度,该指标通过对数t分布的期望进行尾部概率调整,以考虑请求生成长输出的风险。为了评估我们的TIE调度器,我们将其与三个强基线进行比较,结果表明,TIE将在线推理的每token延迟降低了$2.31 imes$,并将离线数据生成的吞吐量提高了$1.42 imes$。

英文摘要

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

2603.28716 2026-05-26 cs.AI

Dynamic Dual-Granularity Skill Bank for Agentic RL

动态双粒度技能库用于智能体强化学习

Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dong Li, Dongbin Zhao

AI总结 提出D2Skill,一种动态双粒度技能库,通过任务技能和步骤技能分别提供高层指导和细粒度决策支持,利用配对基线回放和技能注入回放的性能差距更新技能和优化策略,在ALFWorld等任务上显著提升性能。

Comments 19 pages

详情
AI中文摘要

智能体强化学习可以从可重复使用的经验中显著受益,但现有的基于技能的方法主要提取轨迹级指导,并且通常缺乏维护不断演化的技能记忆的原则性机制。我们提出D2Skill,一种用于智能体强化学习的动态双粒度技能库,将可重复使用的经验组织成任务技能(用于高层指导)和步骤技能(用于细粒度决策支持和错误纠正)。D2Skill通过在同一策略下进行配对的基线回放和技能注入回放,利用它们的性能差距推导出事后效用信号,用于技能更新和策略优化。技能库完全由训练时的经验构建,通过反思不断扩展,并通过效用感知的检索和修剪进行维护。在ALFWorld、WebShop和Search-Augmented QA任务上的实验表明,D2Skill在不同规模的模型上显著优于无技能的基线。进一步的消融和分析表明,双粒度技能建模和动态技能维护对这些增益至关重要,而学习到的技能表现出更高的效用,能够跨评估设置迁移,并且仅带来适度的训练开销。

英文摘要

Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld, WebShop, and Search-Augmented QA tasks show that D2Skill substantially improves performance over skill-free baselines across models of different scales. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

2603.18908 2026-05-26 cs.AI

Characterizing Linear Alignment Across Language Models

表征语言模型间的线性对齐

Matt Gorbett, Suman Jana

AI总结 研究独立训练的大语言模型间是否存在线性对齐,并探索其在文本生成、嵌入分类、分布外检测及隐私保护跨孤岛推理中的应用。

详情
AI中文摘要

语言模型似乎越来越多地学习到相似的表示,尽管训练目标、架构和数据模态存在差异。这种独立训练模型之间新兴的兼容性为跨模型对齐下游目标带来了新的机会。此外,这种能力解锁了新的潜在应用领域,例如在安全、隐私或竞争约束禁止直接数据或模型共享的场景中。在这项工作中,我们研究了表示收敛在多大程度上实现了大语言模型之间的实用线性对齐。具体来说,我们学习独立模型最终隐藏状态之间的仿射变换,并在文本生成、嵌入分类和分布外检测中经验性地评估这些映射。我们发现,模型对之间的性能基本保持不变,并首次证明线性对齐有时能够实现跨独立训练模型的文本生成。我们进一步强调了线性对齐在隐私保护跨孤岛推理中的潜在应用。该框架在共享公共数据集上学习仿射变换,并使用同态加密来保护客户端查询。通过仅加密线性分类操作,该方法实现了亚秒级推理延迟。

英文摘要

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, this capability unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we investigate the extent to which representational convergence enables practical linear alignment between large language models. Specifically, we learn affine transformations between the final hidden states of independent models and empirically evaluate these mappings across text generation, embedding classification, and out-of-distribution detection. We find that performance is largely preserved across model pairs, and show for the first time that linear alignment sometimes enables text generation across independently trained models. We further highlight a potential application of linear alignment for privacy-preserving cross-silo inference. The framework learns an affine transformation over a shared public dataset and uses homomorphic encryption to protect client queries. By encrypting only the linear classification operation, the method achieves sub-second inference latency.

2603.16105 2026-05-26 cs.CL cs.AI

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要:用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

AI总结 提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal,通过最大化词汇多样性来选择校准数据,在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能,且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情
AI中文摘要

训练后模型压缩对于增强大型语言模型(LLMs)的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法,但较少关注选择最合适的数据集(所谓的校准数据)来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中,我们通过分析内在数据属性而非模型特定信号,解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}},一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明,我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是,在下游性能方面,它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵,而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度,平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

2603.12983 2026-05-26 cs.CL cs.AI

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要?用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

AI总结 提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架,利用现成大语言模型生成伪标签,无需人工标注即可在错误跨度检测任务上超越监督基线。

详情
AI中文摘要

错误跨度检测(ESD)是机器翻译(MT)评估中的一个关键子任务,旨在识别翻译错误的位置和严重程度。虽然对人工标注数据微调模型能提升ESD性能,但获取此类数据成本高昂且标注者之间容易不一致。为解决这一问题,我们提出一种基于最小贝叶斯风险(MBR)解码的新型自演化框架,命名为用于ESD的迭代MBR蒸馏,该框架通过利用现成的大语言模型(LLM)生成伪标签,消除了对人工标注的依赖。在WMT Metrics Shared Task数据集上的大量实验表明,仅在这些自生成伪标签上训练的模型在系统和跨度层面上均优于未适应的基础模型和基于人工标注的有监督基线,同时保持有竞争力的句子级性能。

英文摘要

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

2603.09943 2026-05-26 cs.AI

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

PathMem: 面向病理学多模态大模型的认知对齐记忆转换

Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu

AI总结 提出PathMem框架,通过长期记忆与工作记忆的动态转换机制,实现结构化病理知识整合与可解释记忆控制,在WSI报告生成和开放诊断任务上达到SOTA。

详情
AI中文摘要

计算病理学需要视觉模式识别与结构化领域知识(包括分类学、分级标准和临床证据)的动态整合。在实践中,诊断推理需要将形态学证据与正式诊断和分级标准联系起来。尽管多模态大语言模型(MLLMs)展现出强大的视觉语言推理能力,但它们缺乏结构化知识整合和可解释记忆控制的显式机制。因此,现有模型在推理过程中难以一致地融入病理学特定的诊断标准。受人类病理学家层级记忆过程的启发,我们提出了PathMem,一种面向病理学MLLMs的以记忆为中心的多模态框架。PathMem将结构化病理知识组织为长期记忆(LTM),并引入记忆变换器(Memory Transformer),通过多模态记忆激活和上下文感知知识接地建模从LTM到工作记忆(WM)的动态转换,从而实现用于下游推理的上下文感知记忆细化。PathMem在多个基准测试中达到SOTA性能,在WSI-Bench报告生成(WSI-Precision提升12.8%,WSI-Relevance提升10.1%)和开放式诊断任务上分别比先前的基于WSI的模型提升9.7%和8.9%。

英文摘要

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.