arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17795 2026-05-19 cs.LG cs.CV

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

当准确性不够时:噪声标签学习与分布外检测之间的不确定性崩溃

Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma, Yanhui Gu

AI总结 本文研究了噪声标签学习与分布外检测之间的不确定性崩溃问题,提出了一种通用的ACC-OOD基准,揭示了高准确率并不保证分布外可靠性,提出虚拟边距正则化方法来缓解这一问题。

详情
AI中文摘要

噪声标签学习(LNL)通常通过封闭集分类准确率进行评估,但部署时往往需要分类器能够拒绝分布外(OOD)输入。我们提出了一种学习者无关的ACC-OOD基准,冻结LNL检查点,并在合成和真实噪声标签上评估它们,使用标准化的近/远OOD路由和事后评分。该基准揭示了一种反复出现的失败模式:高封闭集准确率不保证OOD可靠性,因为低置信度、被错误分类的分布内样本可能在噪声训练下与OOD输入占据的得分和特征区域重叠。我们称之为这种病理现象不确定性崩溃。这种结构重叠可能导致高准确率的LNL方法在标准OOD评分下失去ID错误/OOD界面的分离性。作为干预措施,我们研究了虚拟边距正则化(VMR),一种轻量级的修复探针,主要通过PSSCL展示,通过在可信ID批次上合成边界虚拟异常值并扩大能量边距。VMR在不替换主机目标或牺牲封闭集准确率的情况下,部分减少了由崩溃引起的远OOD失败。这些结果支持LNL基准,同时报告封闭集泛化、开放世界可靠性以及结构重叠诊断。

英文摘要

Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.

2605.17792 2026-05-19 cs.LG physics.geo-ph

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

HydroAgent: 通过模拟器引导的强化学习缩小前沿大语言模型与人类专家在水文模型校准之间的差距

Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong

AI总结 本文研究如何利用前沿大语言模型(LLM)代理替代人类水文模型师进行水文模型校准,提出HydroAgent方法,通过模拟器引导的强化学习(RLSF)进行微调,以提高模型在不同流域中的适应性和准确性。

详情
AI中文摘要

校准分布式水文模型是操作水资源管理中的关键瓶颈——径流预测、水库调度、干旱监测、基础设施设计和洪水预测都依赖于此。每个流域都需要专家将水文图谱特征转化为高维参数向量的调整,而这种工作流程无法在不同流域之间转移。我们问:前沿大语言模型(LLM)代理能否替代人类水文模型师?如果不能,需要什么条件?我们对九个前沿LLM代理——Claude Opus 4.6/4.7、Sonnet 4.6、GPT-5/5.4/5.4-pro和Gemini 2.5-pro/3.1-pro/3-flash——在由美国国家气象局用于暴雨预报的运营CREST分布式水文模型上进行基准测试。最佳的二十轮次Nash-Sutcliffe效率(NSE)在四个保留的水文站上跨越329-40,792平方公里的范围从-0.16(GPT-5.4)到0.75(Sonnet 4.6);上限在所有三个供应商和能力层级中都保持一致,最强的模型集中在0.65-0.75范围内,除了Opus-4.7在其中一个水文站外,没有其他模型达到人类专家的参考水平。我们认为这个差距不是参数数量的问题,而是领域基础的问题。然后我们提出了HYDROAGENT,通过监督微调2,576条专家校准轨迹和使用NSE作为可验证奖励的组相对策略优化,对开放权重的Qwen3-4B进行微调——模拟器反馈的强化学习(RLSF)。对于地球系统科学,一个经过领域微调的策略,通过模拟器在环的强化学习,比扩展通用前沿模型更计算高效且物理上更忠实,而地球数据的多模态丰富性——遥感、现场时间序列和预报员叙述——使领域代理成为物理科学中人工智能发展的杠杆方向。

英文摘要

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.

2605.17790 2026-05-19 cs.AI

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE:一种用于可靠自动方程发现的自反思代理框架

Jiarui Su, Songjun Tu, Bei Sun, Xiaojun Liang

AI总结 本文提出STRIDE框架,通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆,提升自动方程发现的可靠性,实验表明其在多个LLM基础上提升了准确性、OOD鲁棒性和结构恢复能力。

Comments 23 pages, 15 figures

详情
AI中文摘要

基于LLM的方程发现为从数据中恢复符号定律提供了有前途的途径,但许多系统仍依赖于以生成为中心的循环,提出候选者、拟合参数、评分结果并重用选定的例子。此类循环在不可靠的拟合下可能误判有用的骨架,丢弃需要修复的近正确方程,并积累冗余记忆提供有限的指导。我们提出了STRIDE,一种自反思代理框架,通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆来提高可靠性。通过将拟合分数和候选行为转化为共享反馈,STRIDE使方程能够在闭环发现过程中被提出、评估、细化和重用。在具有代表性的符号回归基准和LSR-Synth套件上的实验表明,STRIDE在多个LLM基础上提高了准确性、OOD鲁棒性和结构恢复能力,消融分析和分析确认了其核心组件的贡献。

英文摘要

LLM-based equation discovery offers a promising route to recovering symbolic laws from data, but many systems still rely on generation-centered loops that propose candidates, fit parameters, score results, and reuse selected examples. Such loops can misjudge useful skeletons under unreliable fitting, discard near-correct equations that require repair, and accumulate redundant memories that provide limited guidance. We propose STRIDE, a self-reflective agent framework that improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic--executor repair, and diversity-preserving semantic memory. By turning fitted scores and candidate behavior into shared feedback, STRIDE enables equations to be proposed, assessed, refined, and reused within a closed-loop discovery process. Experiments on representative symbolic-regression benchmarks and LSR-Synth suites show that STRIDE improves accuracy, OOD robustness, and structural recovery across multiple LLM backbones, with ablations and analyses confirming the contribution of its core components.

2605.17789 2026-05-19 cs.CL cs.AI

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

SocialMemBench: AI记忆系统是否准备好应对社交群体环境?

Olukunle Owolabi

AI总结 本文提出SocialMemBench,一个针对多党社交群体的AI记忆系统评估基准,通过人类验证的合成社交网络,测试记忆系统在处理共享历史、群体规范和成员退出等复杂社交场景中的能力。

详情
AI中文摘要

为单用户对话设计的AI记忆系统在应用于多党社交群体环境时会表现出典型故障。这一差距对当今构建的社会助手尤为重要:嵌入聊天平台的群体作用代理,以及需要全面用户模型的主动个人助理代理。现有记忆基准评估的是二元或职场对话;没有针对多党社交群体,其中记忆必须将事实锚定在共享历史而非职业角色,区分群体规范与个体例外,并在成员退出后正确归因。我们引入SocialMemBench,一个涵盖五个典型(亲密朋友、家庭、娱乐、兴趣社区、熟人网络)和三个群体规模层级(4-30成员)的人类验证合成社交群体网络的基准,包含430个角色和7,355次对话轮次,产生1,031个问题-答案对,覆盖九个问题类别。每个类别隔离一种架构能力,五个失败模式(单流融合、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范-个体融合)是可测试的假设;我们的两项研究探针Subject-Mem和SMG提供了证据,其余三个仍待解决。在所有43个网络中,评估的四个开源记忆框架(Mem0、LangMem、Graphiti、Cognee)在问题加权范围内聚集在0.12-0.18,95%置信区间重叠,远低于未压缩检索参考0.345和匹配回答者完整上下文参考0.369(GPT-4o-mini)。当前的记忆系统显示出可测量的差距。

英文摘要

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

2605.17787 2026-05-19 cs.LG

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

重新审视LLM预训练中Adam与SGD的差距:大有效学习率的作用

Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Mingyi Hong

AI总结 本文通过实证和理论分析,发现SGD在LLM预训练中表现较差的原因在于其无法维持与Adam相媲美的有效学习率,而大有效学习率需求源于小梯度范数和大权重-梯度比,且在大批次大小下更加明显。通过简单剪枝机制,SGD在大学习率下能恢复大部分Adam性能,实验显示验证损失差距从超过50%降至约3.5%。

详情
AI中文摘要

人们普遍认为随机梯度下降(SGD)在预训练大型语言模型(LLMs)时比自适应优化器如Adam表现更差。然而,这一差距的根源仍不清楚。本文认为,SGD无法维持与Adam相比更大的有效学习率是导致差异的主要原因。通过分析LLM预训练动态,我们发现训练过程中梯度范数较小且权重-梯度比较大,这一现象在预训练中常见的大批次大小下更加显著,需要较大的有效学习率。然而,我们发现输出层梯度幅度在不同token类别间差异显著,且训练过程中经常出现大梯度尖峰。这些因素严重限制了SGD的可接受学习率。基于这一理解,我们展示出简单的剪枝机制能够稳定SGD在大学习率下的表现,使其恢复大部分Adam的性能。在大规模实验中,使用1B参数的LLaMA模型和1M token批次大小预训练时,大学习率SGD与Adam的验证损失差距从超过50%降至仅约3.5%。

英文摘要

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

2605.17780 2026-05-19 cs.CV

Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection

基于网络知识先验的高效数据表面缺陷检测学习

Hang-Cheng Dong, Guodong Liu, Dong Ye, Bingguo Liu

AI总结 本文提出了一种基于网络知识先验的知识引导损失函数,通过在训练过程中整合模型可解释性,提升数据高效表面缺陷检测的性能和可信赖度。

详情
AI中文摘要

基于深度学习的方法已成为工业缺陷检测的事实标准。然而,它们的数据渴求性和固有的

英文摘要

Deep learning-based methods have become the de facto standard for industrial defect detection. However, their data-hungry nature and inherent "black-box" characteristics often lead to performance bottlenecks and limited trustworthiness in real-world applications. To address these challenges, this paper proposes a novel knowledge-guided loss function that seamlessly integrates model interpretability into the training process without incurring any additional inference cost. Our method operates in two phases: first, a primary classification network is trained, and its explanations, in the form of saliency maps, are generated as prior knowledge. Second, a multi-task learning framework is established, where the main task performs classification, and an auxiliary task imposes consistency between the saliency maps of the final model and the primary model. This consistency is enforced by a dedicated knowledge-guided loss term, effectively acting as a powerful regularizer to steer the model towards robust feature representations. Extensive experiments on multiple public defect datasets demonstrate that our approach consistently enhances the performance of baseline models in terms of accuracy and AP. Moreover, visual analysis reveals that the proposed method yields more concentrated and human-intelligible saliency maps. This work presents a simple yet effective paradigm for bridging the gap between model performance and interpretability, paving the way for more reliable and high-performing vision systems in industrial quality inspection.

2605.17777 2026-05-19 cs.CV

Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation

通过紧凑的高斯场景表示和加速的密集姿态估计实现高效的稀疏到密集视觉定位

Zizhuo Li, Songchu Deng, Linfeng Tang, Jiayi Ma

AI总结 本文提出了一种高效的视觉定位方法LiteLoc,通过去除冗余的色彩字段和优化密集姿态估计,显著提升了内存和计算效率,同时保持了定位性能。

Comments IEEE/CAA JAS 2026

详情
AI中文摘要

本文提出LiteLoc,一种基于3D高斯点云(3DGS)的新型高效局部化器。先前最先进的稀疏到密集局部化器STDLoc在定位能力上表现出色,但存在严重的存储冗余和计算延迟问题。通过重新审视其设计决策,我们推导出两个简单但高效的改进方法,使LiteLoc在内存和计算效率上大幅提升,同时更易于训练。关键发现是,继承自Feature 3DGS的色彩场对定位功能上是无用的,但其重建高频光度细节需要大量的高斯基元,导致紧密耦合的色彩-特征表示,产生显著的内存开销和次优的特征场优化。为此,我们提出了一种无色彩解耦的特征场,通过保留仅任务必要的特征属性,构建紧凑的高斯场景表示,从而消除约94%的冗余存储,而不会损失与定位相关的信息。我们进一步发现,主要的计算瓶颈在于密集的视角-n-点(PnP)求解器,其中大多数匹配贡献饱和的几何约束,精度提升有限。因此,我们提出了一种压缩策略,将密集匹配压缩到5%的代表性匹配子集,从而在鲁棒估计中实现了近19倍的速度提升,同时性能下降 negligible。大量实验表明,LiteLoc在多个场景中超越了STDLoc,具有显著的效率优势,为对延迟敏感的视觉定位打开了新的前景。

英文摘要

This letter presents LiteLoc, a novel and efficient localizer built on 3D Gaussian Splatting (3DGS). The previous state-of-the-art (SoTA) sparse-to-dense localizer, STDLoc, has shown remarkable localization capability but suffers from severe storage redundancy and computational latency. By revisiting its design decisions, we derive two simple yet highly effective improvements that cumulatively make LiteLoc much more efficient in both memory and computation, while also being easier to train. One key observation is that the color field, inherited directly from Feature 3DGS, is functionally useless for localization. Yet, its reconstruction of high-frequency photometric details necessitates excessive Gaussian primitives, resulting in a tightly coupled color-feature representation with significant memory overhead and sub-optimal feature field optimization. To resolve this, we propose a color-free decoupled feature field that constructs a compact Gaussian scene representation by retaining only task-essential feature attributes, thereby eliminating approximately 94% of redundant storage with no loss of localization-relevant information. We further find that the primary computational bottleneck lies in the dense Perspective-n-Point (PnP) solver, where most matches contribute saturated geometric constraints with diminishing accuracy gains. Accordingly, we propose a condensing strategy that distills dense matches into a subset of 5% representative matches, enabling a nearly 19-fold speedup in robust estimation with negligible performance drop. Extensive experiments show that LiteLoc surpasses STDLoc in multiple scenes with considerable efficiency benefits, opening up exciting prospects for latency-sensitive visual localization.

2605.17775 2026-05-19 cs.CL cs.AI

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

在百万笔记规模上系统评估LLM重新表述的合成临床笔记质量

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

AI总结 本研究系统评估了LLM生成的合成临床笔记的质量,包括内在、外在和事实性评估,发现尽管在粗粒度任务中保留了核心临床信息和预测效用,但在细粒度任务如ICD编码中丢失了细节,通过分块重述可以缓解这一问题,但会降低事实准确性。研究还发现合成错误主要源于临床情境的误解、时间混淆、测量误差和虚构声明,同时展示了这些合成笔记可以有效增强罕见ICD代码的特定任务训练。

详情
AI中文摘要

大型语言模型(LLMs)可以为各种应用生成或合成临床文本,从改善临床文档到增强临床文本分析。然而,评估通常集中在狭窄方面——例如相似性或效用比较——尽管这些方面是互补的,最好并行看待。在本研究中,我们旨在系统评估LLM生成的临床文本,包括在百万笔记规模上从MIMIC数据库重新表述的合成临床笔记的内在、外在和事实性评估。我们的分析显示,尽管存在显著的语言变化,合成笔记仍保留了核心临床信息和粗粒度任务的预测效用,但在像ICD编码这样的细粒度任务中会丢失细节。我们展示,通过分块重述而不是整体重述笔记可以显著缓解这种细节丢失,但会以减少事实准确性为代价。通过事实核查和错误分析,我们进一步发现合成错误主要由临床情境的误解、时间混淆、测量误差和虚构声明引起。最后,我们展示了这些合成笔记——尽管具有任务无关性——可以有效增强罕见ICD代码的特定任务训练。

英文摘要

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

2605.17772 2026-05-19 cs.CV

Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework

通过联合多目标和多模型优化框架实现通用物理对抗攻击

Ziyang Liu, Hongyuan Wang, Zijian Wang, Yinxi Lu, Yunzhao Zang, Zhiqiang Yan, Qianhao Ning

AI总结 本文提出了一种联合多目标和多模型优化框架(JMOF),通过定量相似性分析选择最优的替代模型集合,以解决物理对抗攻击中单个替代模型过拟合和优化目标的问题,同时通过双层机制平衡攻击效率与深度泛化,并通过正交梯度对齐策略解决跨模型梯度冲突,从而提升攻击效果和跨任务泛化能力。

Comments Under review

详情
AI中文摘要

物理对抗攻击通常会过度拟合单一替代模型和优化目标。虽然集成攻击可以缓解这一问题,但现有方法在受限的物理纹理空间中面临严重的梯度冲突,显著降低了跨模型可转移性。为弥合这一差距,本文提出了一种联合多目标和多模型优化框架(JMOF),该框架利用定量相似性分析来选择最优的替代模型集合。在JMOF中,双层机制共同抑制预测输出并平化中间特征分布,平衡攻击效率与深度泛化。此外,正交梯度对齐(OGA)策略解决跨模型梯度冲突,将相互排斥的梯度转化为协同优化方向。广泛的模拟和现实世界实验表明,JMOF在对抗多种黑盒检测器方面优于最先进的基线方法。关键的是,JMOF表现出显著的跨视觉任务泛化能力,能够生成同时欺骗目标检测、语义分割或单目深度估计模型的攻击。这项研究推进了物理对抗攻击的泛化极限,为评估现实部署中视觉AI的脆弱性提供了稳健的框架。

英文摘要

Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.

2605.17766 2026-05-19 cs.CV

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

LatentUMM: 双重潜在对齐用于统一多模态模型

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang

AI总结 本文提出LatentUMM,通过构建增强的共享潜在空间,显式对齐映射到和从潜在空间的转换,提高跨模态一致性。实验表明,该方法在多种架构上一致提升了多模态一致性。

详情
AI中文摘要

统一多模态模型(UMMs)通过学习共享的潜在空间,在理解和生成方面取得优异表现,但往往在这些能力之间存在功能不一致。我们发现,这一问题并非源于共享表示的不足,而是源于映射到和从潜在空间的转换之间缺乏显式对齐。因此,生成和重新编码可能遵循不一致的轨迹,在模态转换时导致语义漂移。在本文中,我们提出了LatentUMM,一个构建增强共享潜在空间的框架,以显式对齐这些转换并提高跨模态一致性。LatentUMM包含两个阶段。第一阶段,双潜在对齐在模态和容量层面强制一致性:跨模态对齐使用更强的嵌入模型来施加结构化的跨模态语义,而双容量对齐在生成和重新编码下强制双向一致性。第二阶段,潜在动态稳定化通过随机潜在滚动和偏好优化提高鲁棒性,倾向于保留语义一致性的轨迹。实验表明,LatentUMM在多种架构上一致提高了多模态一致性。代码可在:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM。

英文摘要

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

2605.17765 2026-05-19 cs.LG

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

AURORA:用于医疗基础模型中几何表示学习的上下文正交化

Yuanyun Zhang, Shi Li

AI总结 本文提出AURORA框架,通过上下文潜在几何进行正交化,以解决医疗基础模型中潜在表示的语义模糊和上下文变化不稳定性问题,提升了模型在不同机构分布变化下的鲁棒性和预测性能。

详情
AI中文摘要

近年来,医疗基础模型通过大规模自监督学习实现了强大的预测性能,但其潜在表示经常将生理严重程度、干预强度、观察结构和机构工作流程整合到共享嵌入方向中。尽管在下游预测中有效,这些表示在上下文变化下仍然语义模糊且不稳定。我们引入AURORA,即通过正交化关系对齐的适应性不确定性感知表示,这是一种基于上下文潜在几何的医疗表示学习新框架。与优化单一统一嵌入流形不同,AURORA将表示分解为对应于不同上下文因素的正交语义子空间,并在每个子空间内学习关系一致性目标。这诱导出既语义解耦又几何可解释的潜在空间。在多个临床预测和检索任务中,AURORA在重建、对比和自蒸馏基线方面表现一致优于,同时显著提高了上下文解耦、邻域纯度和机构分布变化下的鲁棒性。我们的结果表明,潜在几何本身是医疗基础模型设计的重要轴线,且根据上下文语义显式结构化表示空间为传统预测压缩目标提供了补充方向。

英文摘要

Recent healthcare foundation models have achieved strong predictive performance through large scale self supervised learning, yet their latent representations frequently entangle physiologic severity, intervention intensity, observational structure, and institutional workflow into shared embedding directions. While effective for downstream prediction, such representations remain semantically opaque and unstable under contextual shift. We introduce AURORA, Adaptive Uncertainty aware Representations through Orthogonalized Relational Alignment, a new framework for healthcare representation learning based on contextual latent geometry. Rather than optimizing a single unified embedding manifold, AURORA decomposes representations into orthogonal semantic subspaces corresponding to distinct contextual factors and learns relational consistency objectives within each subspace. This induces latent spaces that are both semantically disentangled and geometrically interpretable. Across multiple clinical prediction and retrieval tasks, AURORA consistently outperforms reconstruction, contrastive, and self distillation baselines while substantially improving contextual disentanglement, neighborhood purity, and robustness under institutional distribution shift. Our results suggest that latent geometry itself constitutes an important axis of healthcare foundation model design and that explicitly structuring representation space according to contextual semantics provides a complementary direction beyond conventional predictive compression objectives.

2605.17762 2026-05-19 cs.AI

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

表面形式神经稀疏检索:面向工业音乐搜索的鲁棒模糊匹配

Paul Greyson, Zhichao Geng, Wei Zhang, Yang Yang

AI总结 本文提出了一种鲁棒的神经稀疏检索系统,通过改进的稀疏检索架构和领域特定的子词分词策略,提升了工业音乐搜索中对拼写错误、转置和发音变异的鲁棒性,实现了更高的召回率和更低的延迟。

Comments accepted at SIGIR 2026 industry track

详情
AI中文摘要

在亚马逊音乐的规模下进行音乐搜索面临独特挑战:查询经常由于拼写错误、转置和发音变异而偏离索引元数据,但检索系统必须在毫秒级延迟约束下运行。我们的现有学习到检索系统,即高置信度索引(HCI),从客户行为中学习查询-实体关联,依赖于持续的『探索』来选择候选。传统的n-gram匹配能够实现这种探索,但存在语义鲁棒性差和噪声高,限制了系统从长尾查询中学习的能力。在本工作中,我们提出了一种鲁棒的神经稀疏检索系统,旨在最大化探索效率。我们将最先进的『推理自由』稀疏检索架构适应到音乐领域,并结合一种有效的领域特定的细粒度子词分词策略。我们的方法利用短长度的token约束(最大3个字符)来强制学习表面形式的鲁棒性而非词法记忆。通过在离线索引阶段预计算神经嵌入和术语扩展,使在线处理减少到最小的tokenization和IDF加权,从而实现查询编码的几乎零延迟开销。在600万文档生产语料库上的评估显示,召回率@10达到91.4%(相比传统的三元组为57.7%),在可比的吞吐量下。对HCI反馈循环的模拟显示了探索效率的提高,稳定召回率比生产三元组高0.8%。消融研究表明,我们的稀疏训练方法驱动了性能提升,而领域特定的预训练提供了比大规模通用预训练更具成本效益的替代方案。

英文摘要

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

2605.17759 2026-05-19 cs.CV

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

FrequencyBooster: 高保真像素扩散的全频建模

Lichen Ma, Zipeng Guo, Yu He, Xiaolong Fu, Luohang Liu, Jingling Fu, Junshi Huang, Yan Li

AI总结 本文提出FrequencyBooster,一种能够提升像素扩散模型全频建模能力的框架,通过高容量解码器提取高频细节和低频语义,从而在保持全局结构的同时实现更精确的像素生成。

详情
AI中文摘要

为克服基于VAE的潜在扩散模型固有的保真度瓶颈和优化偏差,像素空间扩散模型作为一种具有吸引力的端到端范式而出现。然而,现有的像素扩散模型往往难以在计算效率与高频率细节保留之间取得平衡。它们通常依赖于基于块的压缩或受限的局部解码,导致一种'频谱妥协',即高频和精细像素信息被抑制。为了解决这些挑战,我们提出了FrequencyBooster,一种新的框架,旨在为像素扩散模型赋予全频建模能力,而无需显著的开销。该方法的核心是一个高容量解码器,专门用于提取详尽的高频细节和低频语义,后者来源于Diffusion Transformer (DiT) 主干网络。与以往牺牲全局上下文以换取局部细化的工作不同,FrequencyBooster利用高维特征表示,在保持全局结构完整性的同时实现了更优的像素级精度。在ImageNet上的大量实验表明,我们的方法效果显著:在仅320个epoch内,我们的模型在256×256分辨率下达到最先进的FID为1.60。此外,在512×512分辨率下,FrequencyBooster达到FID为1.69,显著优于现有的像素空间和潜在空间生成模型。

英文摘要

To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency with the preservation of high-frequency details. They frequently resort to patch-based compression or restricted local decoding, leading to a "spectral compromise" where high-frequency and fine-grained pixel information are suppressed. To address these challenges, we propose \textbf{FrequencyBooster}, a novel framework designed to empower pixel diffusion with full-frequency modeling capabilities without prohibitive overhead. The core of our method is a high-capacity decoder that specializes in extracting exhaustive high-frequency details and low-frequency semantics, the latter of which is derived from a Diffusion Transformer (DiT) backbone. Unlike prior works that sacrifice global context for local refinement, FrequencyBooster leverages high-dimensional feature representations to maintain global structural integrity while achieving superior pixel-level precision. Extensive experiments on ImageNet demonstrate the effectiveness of our approach: our model achieves a state-of-the-art FID of \textbf{1.60} at $256 \times 256$ resolution within only 320 epochs. Furthermore, at $512 \times 512$ resolution, FrequencyBooster attains an FID of \textbf{1.69}, significantly outperforming existing pixel-space and latent-space generative models.

2605.17758 2026-05-19 cs.LG

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Memisis:协调和评估表格健康数据的合成数据

Nitish Nagesh, Mahdi Bagheri, Arshia Harish Puthran, Pengbao Zhou, Muhjaazee Love, Aadi Sharma, Ian Harris, Amir M. Rahmani

AI总结 本文提出Memisis工具,通过结合现有合成数据工具、大语言模型和先进评估指标,协调和评估合成数据,以提高下游预测任务和临床决策的质量。

详情
AI中文摘要

合成数据在医疗领域被广泛用于创建与原始数据相似但不涉及隐私问题的数据集。在隐私、效用和公平性方面生成和评估合成数据对于促进高质量数据的可用性以支持下游预测任务和临床决策至关重要。我们提出了Memisis,一个工具,通过利用现有的合成数据工具、大语言模型的威力以及最先进的评估指标来协调和评估合成数据。我们的工具创建了一个统一的工作流用于数据生成、验证和评估。用户可以控制训练大小、训练周期以及合成行的数量。而不是通过调整合成数据的参数,交互式代理允许用户指定其合成数据生成目标,工具将通过利用现有工具并执行必要的评估来协调工作流。在演示中,我们使用了一个开源的 schizophrenia 数据集,其中包含与种族和性别相关的受保护属性,三种不同的合成器和一个本地语言模型来协调工作流。我们观察到 CTGAN、TVAE 和 GaussianCopula 在公平性和效用指标上表现相当。工作流允许用户在数据生成和评估过程中拥有灵活性和控制。

英文摘要

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

2605.17757 2026-05-19 cs.LG cs.AI cs.DC cs.PF

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR: 2位KV缓存量化中的离线频谱协方差感知旋转

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

AI总结 本文提出OSCAR方法,通过离线估计注意力感知的协方差结构,实现2位KV缓存量化的高效和准确,同时开发了可部署的系统,提升了LLM服务框架的性能和效率。

Comments 35 pages, 10 figures

详情
AI中文摘要

INT2 KV-cache量化对于长上下文LLM服务具有吸引力,但实现准确性和可部署性仍然具有挑战。简单的旋转如Hadamard变换可以减少异常值,但仍然在INT2层面失效,因为它们与下游注意力不对齐。我们提出了OSCAR,一种超低比特KV缓存量化方法,通过离线估计注意力感知的协方差结构,并利用这些结构推导出固定旋转和截断阈值用于量化。这样,KV量化就与注意力实际消耗的协方差结构对齐。更重要的是,我们不仅提供了理论依据,还开发了一个完全可部署的OSCAR系统,包含一个定制的INT2注意力内核,该内核与分页KV缓存服务和融合内核流水线保持兼容,从而无缝集成到现代LLM服务框架中,如SGLang和vLLM。我们评估了我们的方法在最近的推理模型上,使用最多32k token的推理轨迹进行跨5个任务的测试。在Qwen3-4B-Thinking-2507和Qwen3-8B上,OSCAR将BF16精度差距分别减少到3.78和1.42个点,而朴素旋转INT2几乎归零。我们进一步将OSCAR扩展到Qwen3-32B和GLM-4.7(358B参数),其中它仍然与BF16保持有效相当。在长上下文-RULER-NIAH(最多128K)上,OSCAR在Qwen3模型上保持稳健,而朴素旋转INT2崩溃。从系统层面来看,OSCAR将KV缓存内存减少约8倍,在相同内存预算下,大批次大小下吞吐量提高最多7倍,并且由于内存带宽开销减少,单批次解码速度比BF16快最多3倍。

英文摘要

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

2605.17755 2026-05-19 cs.CL cs.AI

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距:多版本训练提升ICD代码预测,尤其是罕见代码

Jinghui Liu, Anthony Nguyen

AI总结 本文研究了通过结合不同ICD版本的数据训练版本无关模型的有效性,以解决ICD代码预测中的长尾问题和罕见代码性能瓶颈,实验表明多版本训练在提升罕见代码的微F1指标和频繁代码的宏指标方面均取得显著效果。

详情
AI中文摘要

临床编码将临床文档映射到标准化的医疗代码,这是一个关键但耗时的行政任务,可以通过自动化来改进。当前ICD编码模型通常针对特定版本的代码进行优化。然而,实际上ICD系统持续演进,不同版本在不同时期和地区被采用。此外,ICD编码面临长尾问题,罕见代码性能可能成为开发可实施模型的瓶颈。我们探讨了通过结合不同ICD版本的数据训练版本无关模型的可行性,这可能有助于解决这些挑战。我们将在修改后的标签注意力模型中加入ICD-9数据进行ICD-10预测训练,并发现尽管存在版本不匹配,加入ICD-9数据使18K个罕见ICD代码的微F1指标相比仅使用ICD-10训练提高了27%。在8K个频繁ICD-10代码上,多版本训练也显著提升了宏指标,并且模型参数更少。

英文摘要

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

2605.17749 2026-05-19 cs.LG stat.ML

Testable and Actionable Calibration for Full Swap Regret

可检验且可操作的全面交换懊悔校准

Konstantina Bairaktari, Lunjia Hu, Huy L. Nguyen, Jonathan Ullman

AI总结 本文提出了一种新的校准度量标准SCDL,该度量标准在不削弱任何要求的前提下,既可操作又可检验,同时具备连续性和一致性等理想特性,并通过实验验证了其在实际中的优越性能。

详情
AI中文摘要

人工智能生成的预测越来越多地影响关键任务中的决策制定,因此必须具有可信度。校准是衡量可信度的一种广泛使用的度量标准,要求预测与真实频率匹配,并可以像真实概率一样对待某一结果。然而,定义校准是微妙的,设计良好的校准误差度量标准一直是最近研究的活跃主题。第一个目标是找到可操作的校准度量标准,即能够向决策者说明当预测被视为真实概率时的效用损失,这被称为交换懊悔。第二个目标是找到可检验的校准度量标准,即校准误差可以从少量预测和结果中测量出来。尽管这些是基本要求,但目前没有现有的校准度量标准能够完全满足这两个属性,所有现有的度量标准都通过限制交换懊悔的弱化观念来放松可操作性,或通过具有次优估计误差来放松可检验性。我们介绍了一种新的校准度量标准,称为软分箱校准决策损失(SCDL),我们证明其在不削弱任何要求的前提下是完全可操作的,并且可检验性具有几乎最优的误差率。此外,SCDL还满足其他理想属性,如连续性和一致性。我们还提供了一组实验,证明了SCDL与其他度量标准的理论优势在实践中导致更好的性能。

英文摘要

AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.

2605.17748 2026-05-19 cs.CV

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

通过全局-局部自适应交互释放视觉Transformer在图像质量评估中的潜力

Yu Li, Puchao Zhou, Yachun Mi, Yanfeng Wu, Xiaoming Wang, Shaohui Liu

AI总结 本文提出了一种全局-局部自适应交互框架,通过双流特征提取机制和交互式全局-局部融合,提升图像质量评估的预测精度和鲁棒性,同时减少可训练参数数量。

详情
Journal ref
Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. [10567]-[10571], 2026
AI中文摘要

在盲图像质量评估(BIQA)领域,准确预测自然环境中真实失真图像的感知质量仍然极具挑战性,因为存在多样的复杂失真。尽管现有方法已取得显著准确性,但其可扩展性常受限于主观注释的高成本和可用数据集的有限规模。近年来,大规模预训练视觉模型的进步引入了强大的语义和表征能力,但其在IQA任务中的应用受到显著的计算需求和次优微调效率的阻碍。为克服这些限制,我们引入了全局-局部交互适配器(GLIA),一种新的框架,通过双流特征提取机制与交互式全局-局部融合有效利用预训练的视觉Transformer。通过同时保留全局语义信息和细粒度局部细节,我们的方法在显著减少可训练参数的同时,实现了优越的预测精度和鲁棒性。在多个基准上的广泛实验验证了我们方法的有效性和优越性。

英文摘要

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

2605.17746 2026-05-19 cs.AI cs.HC

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

实验中的代理,代理中的实验:一种面向人工智能增强型实验科学的设计语法

Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

AI总结 本文提出SEED框架,用于表示实验条件为类型化的代理-流程图,以支持实验设计的自动化生成和评估,通过在医疗分诊任务中的实验证明其有效性,并讨论了新颖性、可重复性等治理问题。

详情
AI中文摘要

人工智能系统正成为组织和知识工作中的积极参与者。它们越来越多地与人类互动,协调工作流程,并在多代理安排中运作。因此,理解其影响需要的不仅仅是测量输出准确性,还需要关于机制、委托、反馈和控制的证据。实验仍然是这一任务的核心,但它们也面临递归挑战:我们需要为代理设计实验来研究这些安排,我们可能需要为实验设计设计代理以帮助搜索可能设计的扩展空间。然而,人类-人工智能和代理工作流程的实验条件仍然大多以散文形式指定,这使得它们难以比较、重用或审计。我们将其框架为AI增强型知识生产的流程表示、可追溯性和治理问题。我们引入SEED(结构编码用于实验发现),一个将实验条件表示为类型化代理-流程图的框架。SEED支持三种设计功能:将条件描述为交互结构、评估结构新颖性相对于编码的先前设计、以及在可行性和治理约束下生成候选设计。我们报告了一项轻量级的实证可行性测试,比较了图盲和SEED引导生成在医疗分诊设计任务中的表现。在这一诊断对比中,SEED引导的候选设计显示出更清晰的代理-流程变化、假设和治理检查,支持了该语法作为设计辅助工具的可行性。评论最后指出围绕新颖性、可重复性、有效性、探究多样性以及问责制的治理张力。

英文摘要

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

2605.17743 2026-05-19 cs.CV

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

MoASE++: 基于领域自适应在线蒸馏的激活稀疏专家混合模型用于持续测试时间适应

Ronyu Zhang, Aosong Cheng, Gaole Dai, Yulin Luo, Jiaming Liu, Li Du, Huanrui Yang, Dan Wang, Leyuan Fang, Yuan Du, Shanghang Zhang

AI总结 本文提出MoASE++,通过结合领域自适应在线蒸馏的激活稀疏专家混合模型,解决持续测试时间适应中领域无关结构与领域特定纹理分离的问题,提升模型在动态视觉环境中的持续适应能力。

详情
AI中文摘要

持续测试时间适应旨在将源预训练模型适应非平稳、未标记的目标流,同时保持过去的能力,但纹理偏见的骨干网络可能导致误差累积和灾难性遗忘。受人类视觉系统分离形状和纹理过程的启发,我们引入MoASE,一种插件式混合专家模型,利用具有空间可微置零的激活稀疏专家,将领域无关的结构与领域特定的纹理分离,形成互补的高激活和低激活路径,同时高阶和低阶瓶颈多样化表示。激活稀疏门产生输入自适应的SDD阈值以精确选择令牌,领域感知路由器利用纹理敏感线索为每个样本分配专家权重。为遏制对未标记流的确认偏见并稳定监督,我们引入领域自适应在线蒸馏构成MoASE++,包括基于EMA锚定的在线反KL蒸馏和基于熵和置信度的增强策略,使同一视图的预测对齐并提高鲁棒性-可塑性平衡。在分类(CIFAR-10/100-C,ImageNet-C)和语义分割(Cityscapes->ACDC)上的广泛实验表明,MoASE++在动态视觉环境中持续适应方面表现出一致的最先进性能,提供了一种原理明确、可控的持续适应方法。

英文摘要

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

2605.17742 2026-05-19 cs.CV cs.HC

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

UST-Hand: 一种面向3D自监督手姿态估计的不确定性感知时空点云交互网络

Tianhao Han, Haoyang Zhang, Liang Xie, Haochen Chang, Kun Gao, Yuan Cheng, Pengfei Ren, Erwei Yin

AI总结 本文提出UST-Hand,一种通过估计手姿态不确定性分布并构建概率点云特征空间的自监督学习框架,以更稳定地建模复杂的时空关系,从而在三个具有挑战性的数据集上实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

Comments Accepted by CVPR 2026

详情
AI中文摘要

手动标注准确的3D手姿态非常耗时且劳动密集。现有的自监督手姿态估计方法利用输入图像与渲染输出之间的差异或多视角一致性约束作为驱动因素来优化网络并逐步提高姿态精度。然而,这些方法对噪声伪标签高度敏感,并忽略了充分利用细粒度空间相关性的重要性,这削弱了模型训练的稳定性。为了解决这些问题,我们提出了UST-Hand,一种自监督学习框架,该框架估计手姿态的不确定性分布,并构建一个概率点云特征空间,从而能够建模复杂的时空关系。UST-Hand采用条件归一化流模型来捕捉手姿态分布,并采样多样假设,从而在噪声伪标签监督下实现稳健学习,具有增强的稳定性。这些多假设被映射到统一的概率3D点云空间中进行多视角和时间特征交互,全面探索手运动模式和细粒度空间相关性。在三个具有挑战性的数据集上的广泛实验表明,UST-Hand实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

英文摘要

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

2605.17737 2026-05-19 cs.SD

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

声纹分析:面向语音深度伪造检测的说话者特定音素指纹

Jun Xue, Tong Zhang, Zhuolin Yi, Yihuan Huang, Yi Chai, Yiyang Zhang, Yanzhen Ren

AI总结 本文提出了一种基于音素的语音分析框架PVP,通过微音学建模捕捉说话者特有的发音模式,实现对语音深度伪造的高效检测,并提供细粒度的音素级可解释性。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

生成式人工智能的快速发展使音频深度伪造越来越难以与真实人类语音区分,对公众人物等目标人物构成重大威胁。当前的检测系统主要依赖通用的黑盒模型,无法捕捉说话者特有的发音特征且缺乏可解释性。本文提出Phoneme-based Voice Profiling (PVP),一种新颖的个性化防御框架。通过将检测范式从宏观语音分析转向微观音学建模,PVP捕捉了目标人物习惯性发音模式下的独特声学分布。具体而言,我们的框架利用轻量级高斯混合模型(GMM)对说话者特定的发音实现进行建模,仅需从真实参考语音中估计。这种设计实现了数据高效的建模,并且能够稳健地泛化到之前未见过的伪造攻击,而无需进行重的伪造特定训练。此外,我们引入了首个大规模的中文目标人物深度伪造数据集以基准测试说话者特定的检测。实验结果表明,PVP在目标人物伪造场景中显著优于最先进的通用检测器,实现了显著的EER降低,同时提供细粒度的音素级可解释性用于法医分析。代码和数据可在:https://github.com/JunXue-tech/PVP 获取。

英文摘要

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP

2605.17734 2026-05-19 cs.AI

Harnessing LLM Agents with Skill Programs

通过技能程序 harnessing LLM agents

Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

AI总结 本文提出 HASP 框架,通过将技能转化为可执行程序函数(PFs)来提升 LLM agent 在复杂任务中的表现,其核心方法是通过 PFs 在失败状态时介入并修正行动,主要贡献是通过模块化设计实现推理、训练和自改进的多场景应用。

Comments 40 pages, 7 figures

详情
AI中文摘要

为复杂和长周期任务提供可重用技能已成为一种流行且成功的做法。然而,这些经验通常编码为文本指导,缺乏明确的机制来决定何时以及如何介入 agent 循环。为弥合这一差距,我们引入 HASP(通过技能程序 harnessing LLM agents),一种新的框架,将技能升级为可执行程序函数(PFs)。与被动建议不同,PFs 作为可执行的护栏,在易出错的状态下激活,并修改下一步行动或注入修正上下文。HASP 高度模块化:可以在推理时直接介入 agent 循环,训练后提供结构化监督,或通过进化验证的教师评审 PFs 实现自改进。实证上,HASP 在网页搜索、数学推理和编码任务中相比训练自由和训练方法取得了显著提升。例如,在网页搜索推理中,推理时的 PFs 使平均表现比(多循环)ReAct Agent 提高 25%,而训练后和受控进化则比 Search-R1 提高 30.4%。为了深入理解 HASP,我们的机制分析揭示了 PFs 如何触发和介入,技能如何内化,以及稳定技能库进化的必要性。

英文摘要

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

2605.17733 2026-05-19 cs.AI cs.LG

Divergence-Suppressing Couplings for Rectified Flow

修正流的发散抑制耦合

Yimeng Min, Carla P. Gomes

AI总结 本文提出了一种修正流的发散抑制耦合方法,通过在耦合生成过程中抑制学习到的速度场中的发散成分,从而减少轨迹的扭曲,提升生成效果。

详情
AI中文摘要

修正流的潜力在于生成自我生成的耦合,其轨迹是直的或几乎如此。在实践中,基础流模型生成的轨迹可能会弯曲和交织,导致耦合继承这种扭曲。本文指出,这种轨迹交织通常与学习到的速度场中非零发散区域相关,其中局部扩张或收缩会扭曲轨迹并推动粒子远离理想终点。我们随后提出了一种修正流的发散抑制耦合,这是一种离线修正,可减小耦合生成过程中学习到的速度场的发散成分。该修正仅在每次耦合对生成时支付一次,且在训练过程中被摊销,因此部署运行的时钟时间成本与标准修正流相同。实验证明,这种离线修改在2D合成基准和图像生成任务上都带来了稳定改进。

英文摘要

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

2605.17729 2026-05-19 cs.CV cs.AI cs.LG

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

AI总结 本文提出了一种基于回放的领域增量持续学习方法,用于在跨领域变化中保持肺炎检测的鲁棒性和一致性,通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权,实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率,优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情
AI中文摘要

深度学习模型在肺炎检测中实现了高准确性,但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法,旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示,以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明,所提出的方法实现了88.66%的平均准确率,优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

2605.17727 2026-05-19 cs.CV

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

GraSP-VL: 长度作为视觉-语言表示的语义粒度接口

Zesheng Li, Chengchang Pan, Honggang Qi

AI总结 本文研究如何将嵌入长度转化为可控的语义访问接口,提出GraSP-VL方法,通过学习共享的近正交前缀变换,实现视觉-语言嵌入的语义层次递进接口,并在多个数据集上验证了其有效性。

Comments Preprint

详情
AI中文摘要

冻结的视觉-语言嵌入包含从物体身份到属性、关系和完整描述意义的多级语义信号,但这些信号通过固定长度的向量接口暴露。我们研究是否可以将嵌入长度转化为可控的语义访问接口。我们提出了GraSP-VL,它在冻结VLM嵌入上学习了一个共享的近正交前缀变换。GraSP-VL实现了语义马特罗什卡接口:短前缀被分配粗粒度的语义角色,而更长的前缀逐步暴露更细粒度的语言基础区分。由于变换在图像和文本嵌入之间共享,并且保持了全维度几何,前缀行为的变化不会改写原始VLM空间。在包含20,147个示例的COCO/Flickr30K注释池上,GraSP-VL达到了阶梯评分53.01和难负样本选择性89.76,同时保持全空间漂移低于10^-6。它还转移到SugarCrepe-clean数据集,达到86.03的对象准确率和11.96的平均外部涌现,并保持全维度零样本CIFAR-100准确率。这些结果表明,冻结的VLM嵌入可以重新组织为可截断的语义前缀接口,而不是仅仅压缩。

英文摘要

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

2605.17721 2026-05-19 cs.AI

EXG: Self-Evolving Agents with Experience Graphs

EXG: 基于经验图的自演化代理

Yuxin Jin, Siyuan Zhang, Hanchen Wang, Lu Qin, Ying Zhang, Wenjie Zhang

AI总结 本文提出EXG,一种基于经验图的自演化代理框架,通过结构化组织积累的成功与失败经验,提升代理在复杂任务中的解决质量和资源效率。

详情
AI中文摘要

基于大型语言模型(LLM)的代理在复杂推理和问题解决中表现出强大的能力,但大多数部署的代理行为静态,执行过程中获得的知识难以随时间系统性改进。为此,越来越多的研究探索如何在部署过程中通过经验使代理改进,但现有方法要么依赖于单一任务的随意反思,要么采用无结构的记忆积累碎片化经验。为了解决这一限制,我们引入EXG,一种经验图框架,用于自演化代理,明确将积累的成功与失败组织成结构化、关系化的表示。EXG是首个为自演化代理设计的经验图,支持在执行过程中实时增长图以实现跨任务经验重用,以及离线重用整合的经验图作为外部记忆模块。这种设计也使EXG能够作为可插拔组件为现有自演化代理服务,将先前经验组织成统一的经验图,并在部署过程中提高解决方案质量和资源效率。在代码生成和推理基准上的广泛实验表明,EXG在在线和离线评估中均优于基于反思和记忆的基线,在性能-效率权衡上表现更优。我们的结果表明,将经验结构化为图提供了一个原理性基础,以实现可扩展且可迁移的自演化代理行为。

英文摘要

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

2605.17719 2026-05-19 cs.CV

Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Patch-MoE Mamba: 一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构

Diego Adame, Fabian Vazquez, Jose A. Nunez, Huimin Li, Jinghao Yang, Erik Enriquez, DongChul Kim, Haoteng Tang, Bin Fu, Pengfei Gu

AI总结 本文提出了一种基于补丁顺序的专家混合状态空间架构Patch-MoE Mamba,以解决现有Mamba分割模型在像素级方向扫描破坏局部二维空间结构以及简单求和融合方向无法适应多样物体大小、形状和边界的问题。

详情
AI中文摘要

基于CNN和Transformer的架构在医学图像分割中已取得优异性能,但CNN在建模长距离依赖性方面存在限制,而Transformer则常面临二次计算和内存复杂度的问题。状态空间模型,尤其是基于Mamba的网络,提供了一种高效的替代方案,具有线性序列复杂度。然而,现有的Mamba分割模型仍面临两个限制:像素级方向扫描会破坏局部二维空间结构,而简单的求和融合方向无法适应多样化的物体大小、形状和边界。为了解决这些问题,我们提出了Patch-MoE Mamba,一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构。它引入了一种分层的补丁顺序扫描机制,能够在保留局部空间邻域的同时捕捉多尺度上下文,并引入了基于MoE的方向融合模块,通过四个方向专家、一个可学习的连接专家和残差方向聚合,自适应地结合多个Mamba扫描器输出。在五个公开的息肉分割基准和ISIC 2017/2018皮肤病变分割数据集上的实验表明了Patch-MoE Mamba的有效性和通用性。

英文摘要

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

2605.17714 2026-05-19 cs.CL

From Documents to Segments: A Contextual Reformulation for Topic Assignment

从文档到段落:一种用于主题分配的上下文重述

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

AI总结 本文提出了一种基于段落的主题分配方法(SBTA),通过将主题分配到短小且连贯的文本段落而非整个文档,以解决传统主题模型中文档多主题问题导致的主题污染问题,从而提升主题分析的清晰度和可解释性。

Comments Findings of ACL 2026

详情
AI中文摘要

传统的主题建模方法为每个文档分配一个单一主题。然而,在实践中,许多现实世界文档,如产品评论或开放式调查回答,包含多个不同的主题。这种不匹配常常导致主题污染,即不相关主题被合并到一个主题中,使得难以识别真正专注于特定主题的文档。我们通过引入基于段落的主题分配(SBTA),一种对主题建模的重述方法,将主题分配给段落:短小、连贯的文本片段,每个片段表达一个单一主题。通过在段落层面建模主题结构,我们的方法产生更清晰和可解释的主题,并更好地支持多主题文档的分析。为了支持系统评估,我们构建了一个SemEval-STM数据集,灵感来自基于方面的情感分析。文档首先通过大型语言模型(LLMs)分解为基于主题的段落,随后通过人工校验确保段落质量。我们还提出了一种基于段落的词入侵任务扩展,使人类能够在主题实际分配的粒度上评估主题连贯性。在多个模型和评估指标上,我们证明SBTA提高了聚类质量和可解释性。总体而言,这项工作提供了一个实用、可扩展的框架,用于异构文本语料库中细粒度的主题分析,其中文档自然涵盖多个主题。

英文摘要

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

2605.17710 2026-05-19 cs.CL eess.AS

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sometin Beta Pass Notin (SBPN): 通过知识蒸馏改进尼日利亚语言的多语言语音识别

Sewade Ogun

AI总结 本文提出SBPN模型,通过两阶段知识蒸馏方法提升尼日利亚多种语言的语音识别性能,显著降低词错误率并优于现有多语言模型。

Comments 25 pages

详情
AI中文摘要

尽管现代多语言自动语音识别(ASR)系统支持多种尼日利亚语言,但其性能始终落后于英语和法语等高资源语言。尼日利亚语言存在独特的建模挑战,包括数据稀缺、不一致的正字法、声调符号、多样化的口音、频繁的代码切换和本地化专有名词。为解决这些挑战,我们开发了一个多语言ASR框架,采用两阶段蒸馏过程。首先,我们利用学生-教师知识蒸馏从现有单语言模型中学习,基于稳健的语言特定N-gram语言模型进行条件化。其次,我们使用伪标签数据进行迭代自我改进以进一步提高准确性。我们的方法显著缩小了性能差距,平均在单语言基线上实现了29%的词错误率(WER)减少。我们的模型在主要基准上也优于现有最先进的多语言模型,包括Common Voice和Fleurs。我们引入Sometin Beta Pass Notin(SBPN),一个覆盖约鲁巴、豪萨、伊博、尼日利亚皮钦语和尼日利亚英语的多语言ASR模型。SBPN以两种大小发布:SBPN-Base(120 M参数)和SBPN-Large(600 M参数)。通过发布这些作为开放基础模型,我们旨在为该地区丰富的语音和文化景观的研究提供ASR资源。

英文摘要

Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yorùbá, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.