arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2160
2510.12773 2026-05-20 cs.CL cs.AI cs.LG

Dr.LLM: Dynamic Layer Routing in LLMs

Dr.LLM:大语言模型中的动态层路由

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

AI总结 本文提出Dr.LLM,一种通过在预训练模型中加入轻量级每层路由器来实现动态层路由的框架,该方法在不改变基础权重的情况下,通过显式监督训练路由器,提高推理的计算效率和准确性。

Comments Published at ICLR 2026

详情
AI中文摘要

大语言模型(LLMs)处理每个token时都会通过transformer堆栈的所有层,这导致简单查询的计算浪费以及更复杂的查询需要更深层次推理时的灵活性不足。适应深度方法可以提高效率,但先前的方法依赖于成本高昂的推理时间搜索、架构更改或大规模重新训练,在实践中虽然提高了效率,但常常导致准确性下降。我们介绍了Dr.LLM,即大语言模型中的动态层路由,一种可回退的框架,该框架为预训练模型配备了轻量级每层路由器,决定跳过、执行或重复一个块。路由器通过显式监督进行训练:使用蒙特卡洛树搜索(MCTS),我们推导出高质量的层配置,以在计算预算下保持或提高准确性。我们的设计,包括窗口池化以实现稳定的路由、聚焦损失与类别平衡以及瓶颈MLP路由器,确保在类别不平衡和长序列下具有鲁棒性。在ARC(逻辑)和DART(数学)上,Dr.LLM在每个示例上平均节省5层的同时,将准确性提高了最高3.4个百分点。路由器能够泛化到域外任务(MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval)时,仅导致0.85%的准确性下降,同时保持效率,并在某些情况下优于先前的路由方法。总体而言,Dr.LLM展示了通过显式监督训练的路由器可以回退冻结的LLMs,以实现预算意识、准确性驱动的推理,而无需改变基础权重。代码可在https://github.com/parameterlab/dr-llm上获得。

英文摘要

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr. LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr. LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr. LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights. Code is available at https://github.com/parameterlab/dr-llm.

2510.11344 2026-05-20 cs.CV

MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression

MMAP: 一种多倍率和原型感知架构,用于预测空间基因表达

Hai Dang Nguyen, Nguyen Dang Huy Pham, The Minh Duc Nguyen, Dac Thai Nguyen, Hang Thi Nguyen, Duong M. Nguyen

AI总结 本文提出MMAP架构,通过多倍率和原型增强方法,解决空间基因表达预测中的局部特征粒度不足和全局空间上下文覆盖不足的问题,实验表明其在多个评估指标上均优于现有最先进方法。

Comments Received Best Paper Award at the 2025 Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)

详情
AI中文摘要

空间转录组学(ST)能够测量基因表达的同时保留空间信息,为组织结构和疾病病理提供关键见解。最近的发展探索了使用经苏木精和伊红染色的整张滑扫图像(WSI)通过深度神经网络预测转录组-wide基因表达谱。这项任务通常被框架为回归问题,其中每个输入对应从WSI中提取的局部图像块。然而,从组织学图像预测空间基因表达仍是一个具有挑战性的问题,因为视觉特征与分子信号之间存在显著的模态差距。最近的研究尝试将局部和全局信息纳入预测模型中。然而,现有方法仍然存在两个关键限制:(1)局部特征提取的粒度不足,(2)全局空间上下文的覆盖不足。在本工作中,我们提出了一种新的框架,MMAP(多倍率和原型增强架构),同时解决这两个挑战。为了增强局部特征的粒度,MMAP利用多倍率块表示来捕捉精细的组织学细节。为了提高全局上下文的理解,它学习了一组潜在原型嵌入,这些嵌入作为滑片级信息的紧凑表示。广泛的实验结果表明,MMAP在多个评估指标上均优于所有现有最先进方法,包括平均绝对误差(MAE)、平均平方误差(MSE)和皮尔逊相关系数(PCC)。

英文摘要

Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

2510.09872 2026-05-20 cs.LG cs.AI

WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

WARC-Bench:基于网络存档的GUI子任务执行基准

Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi

AI总结 本文提出WARC-Bench,一个基于网络存档的GUI子任务执行基准,通过438个任务评估多模态AI代理在子任务上的能力,实验表明SFT和RLVR方法在提升子任务执行效果上取得显著成果。

详情
AI中文摘要

训练能够导航复杂现实网站的网络代理需要它们掌握子任务——多个UI组件上的短周期交互(例如在日期选择器中选择正确日期或在容器中滚动以提取信息)。我们介绍了WARC-Bench(网络存档基准),一个新型的网络导航基准,包含438个任务,旨在评估多模态AI代理在子任务上的能力。WARC-Bench利用Web ARChive文件实现动态且逼真的网页沙盒交互。我们证明WARC-Bench对领先的计算机使用模型具有挑战性,最高观察到的成功率仅为64.8%。为了提高开源模型在子任务上的表现,我们探索了两种常见的训练技术:监督微调(SFT)和具有可验证奖励的强化学习(RLVR)。实验表明,SFT模型在基准上的成功率为48.8%。在数据稀缺的情况下,通过RLVR训练SFT检查点,将分数提高到52.8%,在WARC-Bench上优于许多前沿模型。我们的分析得出结论:掌握这些子任务对于稳健的网络规划和导航至关重要,而这一能力并未被现有基准充分评估。

英文摘要

Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

2510.09174 2026-05-20 cs.LG

Robustness and Regularization in Hierarchical Re-Basin

层次化重盆地中的鲁棒性与正则化

Benedikt Franke, Florian Heinrich, Markus Lange, Arne Raulf

AI总结 本文研究了Git Re-Basin在模型合并中的鲁棒性和正则化问题,提出了一种层次化模型合并方案,显著优于标准的MergeMany算法,并发现Re-Basin在合并模型中引入了对抗鲁棒性和扰动鲁棒性,但实验显示其性能下降比原始作者报告的更大。

Comments Published in 32th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2024

详情
AI中文摘要

本文对Git Re-Basin进行了深入研究,这是一种新颖的模型合并方法。我们提出了一种层次化模型合并方案,其性能显著优于标准的MergeMany算法。通过我们的新算法,我们发现Re-Basin在合并模型中引入了对抗鲁棒性和扰动鲁棒性,其效果随着参与层次化合并的模型数量增加而增强。然而,在我们的实验中,Re-Basin引起的性能下降比原始作者报告的要大得多。

英文摘要

This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

2510.08986 2026-05-20 cs.CL cs.CE cs.CY

CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China

CAPC-CG:一个大规模、专家指导的中国适应性政策沟通LLM注释语料库

Bolun Sun, Charles Chang, Yuen Yuen Ang, Ruotong Mu, Yuchen Xu, Zhengxin Zhang, Pingxu Hao

AI总结 本文介绍了CAPC-CG语料库,该语料库是首个开放的中国政策指令注释语料库,基于Ang的适应性政策沟通理论,采用五色分类法对清晰和模糊语言类别进行标注,旨在支持下游任务和多语言NLP研究。

Comments Accepted for publication in the Proceedings of ACL Main 2026

详情
AI中文摘要

我们介绍了CAPC-CG,即中国适应性政策沟通(中央政府)语料库,这是首个开放的中国政策指令注释语料库,基于Ang的适应性政策沟通理论。涵盖1949-2023年,该语料库包括中国最高当局发布的国家法律、行政法规和部级规章。每个文档被分割成段落,产生总计330万个单位。此外,我们还发布了全面的元数据、双轮标注框架和由专家和受训编码器开发的黄金标注集。标注者间协议在指令标签上达到Fleiss's kappa为K=0.86,表明高可靠性用于监督建模。我们提供了基于几种大语言模型(LLMs)的基线分类结果,以及我们的标注代码本,并描述了数据集中的模式。此次发布旨在支持下游任务和多语言NLP研究。

英文摘要

We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

2510.07538 2026-05-20 cs.CV

Low-Compute Watermark Removal via Dual-Domain Natural Projection

基于双域自然投影的低计算量水印移除

Pragati Shuddhodhan Meshram, Varun Chandrasekaran

AI总结 本文提出了一种轻量级且无需训练的攻击方法DAWN,通过在互补频率和语义空间中投影水印图像,以低计算成本实现高效的水印移除,同时保持结构和语义的完整性。

详情
AI中文摘要

有效的语义水印移除需要在三个竞争性目标之间取得平衡:高移除成功率、低感知失真和低计算成本。然而,现有的单图像攻击通常只优化前两个目标,实现强大的水印抑制,但依赖于昂贵的多步骤优化,限制了实际部署。在本文中,我们证明这种权衡是根本性的:目前没有任何方法能够同时实现这三个属性。我们引入DAWN,一种轻量级、无需训练的攻击方法,专门针对低计算成本的领域,同时保持竞争性的移除性能。DAWN通过将带水印的图像投影到自然图像先验上,在互补的频率和语义空间中压制偏离自然统计的水印信号,然后应用解耦的感知对齐步骤以最小化伪影来恢复视觉一致性。在多样化的像素、频率和潜在空间水印方案中,DAWN一致地降低了可检测性,同时保持结构和语义的保真度,证明了仅通过适度的感知退化即可实现高效的、低资源水印移除。我们的代码可在https://github.com/Pragati-Meshram/DAWN上获得。

英文摘要

Effective removal of semantic watermarks requires balancing three competing objectives: \emph{high removal success}, \emph{low perceptual distortion}, and \emph{low computational cost}. However, existing single-image attacks typically optimize only for the first two, achieving strong watermark suppression but relying on expensive, multi-step optimization that limits practical deployment. In this work, we show that this trade-off is fundamental: no current approach achieves all three properties simultaneously. We introduce \textsc{DAWN}, a lightweight, training-free attack that explicitly targets the low-cost regime while maintaining competitive removal performance. \textsc{DAWN} works by projecting a watermarked image onto natural-image priors in complementary frequency and semantic spaces, suppressing watermark signals that deviate from natural statistics, and then applying a decoupled perceptual-alignment step to restore visual consistency with minimal artifact. Across diverse pixel-, frequency-, and latent-space watermarking schemes, \textsc{DAWN} consistently reduces detectability while preserving structural and semantic fidelity, demonstrating that efficient, low-resource watermark removal is feasible with only modest perceptual degradation. Our code is available at https://github.com/Pragati-Meshram/DAWN.

2510.05746 2026-05-20 cs.AI cs.CL cs.LG

ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

ARM:为通用多智能体系统发现代理推理模块

Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav

AI总结 本文提出了一种新的自动多智能体系统设计范式,通过优化链式推理(CoT)来发现代理推理模块(ARM),该模块通过在代码空间中进行树搜索,利用执行轨迹的反思来进化,从而提升多智能体系统的泛化能力。

Comments 29 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)驱动的多智能体系统(MAS)在各种复杂推理任务上取得了最先进的结果。最近的研究提出了自动化设计MAS的方法,消除了手动工程的需要。然而,这些方法表现不佳,通常与简单的基线相当或更差。此外,它们需要为每个新任务领域进行昂贵的架构重新发现,并且在没有现有标注验证集的领域中需要昂贵的数据注释。关键的洞察是简单的链式推理(CoT)推理往往与这些复杂系统竞争,表明MAS的基本推理单元CoT值得进一步研究。为此,我们提出了一种新的自动MAS设计范式,将焦点转向优化CoT推理。我们引入了代理推理模块(ARM),即CoT的代理泛化,其中每个细粒度推理步骤由专门的推理模块执行。该模块通过在代码空间中进行树搜索来发现,从简单的CoT模块开始,利用执行轨迹的反思进行进化。最终的ARM作为一个通用的推理构建块,可以作为直接的递归循环或作为学习元协调器中的子程序使用。我们的方法显著优于手动设计的MAS和最先进的自动MAS设计方法。关键的是,由ARM构建的MAS表现出卓越的泛化能力,在不同的基础模型和任务领域中保持高性能,而无需进一步优化。

英文摘要

Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

2510.05431 2026-05-20 cs.CL

Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

基于LLM生成信任指标的自过滤蒸馏用于可靠专利分类

Yongmin Yoo, Xu Zhang, Longbing Cao

AI总结 本文提出自过滤蒸馏方法,通过将LLM生成的推理作为信任指标而非真实标签,提升专利分类的可靠性,实验显示在USPTO-2M数据集上宏F1指标提升了38.7%。

详情
AI中文摘要

按照分类方案组织大规模专利语料库是信息管理的核心任务,决定先例检索、技术知识发现和知识产权决策的准确性和效率。近期方法将大语言模型生成的自然语言推理蒸馏到紧凑的学生模型中,但这些推理中固有的逻辑错误、标签不匹配和分类学不一致在训练过程中被无差别吸收,影响分类可靠性并传播误差至下游信息流程。而非事后纠正这些错误,我们提出自过滤蒸馏(SFD),通过将LLM生成的推理重新解释为信任指标而非真实监督,直接将质量保证嵌入学习过程。SFD整合三种无监督信号到统一的信任分数中,动态调节每个训练实例的贡献:自我一致性,量化独立生成推理之间的一致性;类别蕴含对齐,评估推理与分配CPC类定义之间的语义一致性;LLM同意评分,通过独立验证者评估外部合理性。在包含超过两百万专利的USPTO-2M基准上,SFD在四种学生架构上实现了宏F1指标高达38.7%的相对提升,信任分数与专家判断之间的强相关性(r=0.685)证实该框架不仅提供准确预测,还提供可分解的置信度语义,使大规模专利知识组织能够实现可审计和自文档化的分类结果。

英文摘要

Organizing large-scale patent corpora according to classification schemes is a core information management task that determines the accuracy and efficiency of prior art retrieval, technology knowledge discovery, and intellectual property decision-making. Recent approaches distill natural language rationales generated by large language models (LLMs) into compact student models, yet logical errors, label mismatches, and taxonomy misalignments inherent in these rationales are indiscriminately absorbed during training, undermining classification reliability and propagating errors throughout downstream information processes. Rather than correcting such errors post-hoc, we propose Self-Filtered Distillation (SFD), which embeds quality assurance directly into the learning process by reinterpreting LLM-generated rationales as trust indicators rather than ground-truth supervision. SFD integrates three unsupervised signals into a unified trust score that dynamically modulates each training instance's contribution: Self-Consistency, which quantifies agreement among independently generated rationales; Class Entailment Alignment, which evaluates semantic coherence between a rationale and its assigned CPC class definition; and LLM Agreement Scoring, which assesses external plausibility through an independent verifier. On the USPTO-2M benchmark comprising over two million patents, SFD achieves up to 38.7\% relative improvement in Macro-F1 across four student architectures, and the strong correlation between trust scores and expert judgments ($r = 0.685$) confirms that the framework provides not only accurate predictions but also decomposable confidence semantics that enable auditable and self-documenting classification outcomes for large-scale patent knowledge organization.

2510.03824 2026-05-20 cs.LG cs.AI stat.ML

Proximal Diffusion Neural Sampler

近端扩散神经采样器

Wei Guo, Jaemoo Choi, Yuchen Zhu, Molei Tao, Yongxin Chen

AI总结 本文提出了一种名为近端扩散神经采样器(PDNS)的框架,通过在路径测度空间上应用近端点方法,解决神经采样器在训练过程中遇到的多模式目标分布和模式崩溃问题,通过分阶段的简单子问题逐步逼近目标分布,促进模式的全面探索。

Comments Accepted at ICLR 2026 (https://openreview.net/forum?id=XTHQqS7ObC)

详情
AI中文摘要

学习基于扩散的神经采样器以从未归一化目标分布中抽取样本的任务可以被视为路径测度上的随机最优控制问题。然而,当目标分布是多模式且存在显著的模式分离屏障时,神经采样器的训练可能会面临挑战,可能导致模式崩溃。我们提出了一种名为近端扩散神经采样器(PDNS)的框架,通过在路径测度空间上应用近端点方法来解决这些问题。PDNS将学习过程分解为一系列更简单的子问题,逐步创建一条接近目标分布的路径。这种分阶段的程序会逐步细化路径以接近目标分布,并促进对所有模式的彻底探索。为了实现实用且高效的实现,我们用近端加权去噪交叉熵(WDCE)目标实例化每个近端步骤。通过在连续和离散采样任务中的广泛实验,包括分子动力学和统计物理中的挑战性场景,我们展示了PDNS的有效性和鲁棒性。我们的代码可在https://github.com/AlexandreGUO2001/PDNS上获得。

英文摘要

The task of learning a diffusion-based neural sampler for drawing samples from an unnormalized target distribution can be viewed as a stochastic optimal control problem on path measures. However, the training of neural samplers can be challenging when the target distribution is multimodal with significant barriers separating the modes, potentially leading to mode collapse. We propose a framework named Proximal Diffusion Neural Sampler (PDNS) that addresses these challenges by tackling the stochastic optimal control problem via proximal point method on the space of path measures. PDNS decomposes the learning process into a series of simpler subproblems that create a path gradually approaching the desired distribution. This staged procedure traces a progressively refined path to the desired distribution and promotes thorough exploration across modes. For a practical and efficient realization, we instantiate each proximal step with a proximal weighted denoising cross-entropy (WDCE) objective. We demonstrate the effectiveness and robustness of PDNS through extensive experiments on both continuous and discrete sampling tasks, including challenging scenarios in molecular dynamics and statistical physics. Our code is available at https://github.com/AlexandreGUO2001/PDNS.

2510.03485 2026-05-20 cs.AI

Learning Efficient Guardrails for Compliance

学习高效的合规性防护措施

Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, Muhao Chen

AI总结 本文提出PolicyGuardBench基准,通过6万条策略轨迹对评估合规性,训练出轻量级的PolicyGuard模型,实现高准确率和高效推理,展示了小规模下准确且可推广的合规防护措施的可行性。

Comments 16 pages, 5 figures. Accepted by ICML 2026

详情
AI中文摘要

自主网络代理越来越多地用于长期任务,但其遵循现实政策的能力相较于标准安全目标仍严重不足。为解决这一差距,我们引入PolicyGuardBench,一个包含6万条策略轨迹对的基准,旨在通过完整轨迹和新型前缀基于的违规检测任务评估合规性。使用此数据集,我们训练了PolicyGuard,一个轻量级的防护模型,实现了高检测准确率同时保持高推理效率。值得注意的是,我们的模型表现出强大的泛化能力,在未见过的领域中仍能保持高性能。这些贡献建立了一个全面研究政策合规性的框架,表明在小规模下准确且可推广的防护措施是可行的。

英文摘要

Autonomous web agents are increasingly deployed for long-horizon tasks, yet their ability to adhere to real-world policies remains critically underexplored compared to standard safety objectives. To address this gap, we introduce PolicyGuardBench, a benchmark of 60k policy-trajectory pairs designed to evaluate compliance through both full-trajectory and novel prefix-based violation detection tasks. Using this dataset, we train PolicyGuard, a lightweight guardrail model that achieves strong detection accuracy while maintaining high inference efficiency. Notably, our model demonstrates robust generalization capabilities, preserving high performance even on unseen domains. These contributions establish a comprehensive framework for studying policy compliance, showing that accurate and generalizable guardrails are feasible at small scales.

2510.01499 2026-05-20 cs.LG cs.AI cs.GT

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

超越多数投票:利用高阶信息进行LLM聚合

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, Haifeng Xu

AI总结 本文提出Optimal Weight和Inverse Surprising Popularity两种算法,通过结合一阶和二阶信息,有效缓解多数投票的局限性,提升多智能体LLM聚合的可靠性。

Comments Accepted into ICML 2026

详情
AI中文摘要

随着多智能体大语言模型(LLM)推理的快速发展,如何有效聚合多个LLM的答案已成为一个根本性挑战。标准多数投票将所有答案视为同等重要,未能考虑模型间的潜在异质性和相关性。在本文中,我们设计了两种新的聚合算法,称为最优权重(OW)和反惊讶流行度(ISP),利用一阶和二阶信息。我们的理论分析显示,这些方法在温和假设下能够证明性地缓解多数投票的固有局限,从而产生更可靠的集体决策。我们在合成数据集、流行的LLM微调基准如UltraFeedback和MMLU,以及现实世界医疗场景ARMMAN上实证验证了我们的算法。我们的算法在多个基准上均优于标准基线,建立了稳健且无需训练的多智能体LLM聚合框架。

英文摘要

With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.

2510.00660 2026-05-20 cs.CV

Unsupervised Unfolded rPCA (U2-rPCA): Deep Interpretable Clutter Filtering for Ultrasound Microvascular Imaging

无监督展开rPCA(U2-rPCA):用于超声微血管成像的深度可解释杂波过滤

Huaying Li, Chuling Ye, Manfei Liao, Xiaobo Qu, Liansheng Wang, Yinran Chen

AI总结 本文提出了一种无监督展开rPCA(U2-rPCA)方法,通过迭代加权最小二乘(IRLS)rPCA基础进行展开,结合稀疏增强单元,以提高对稀疏微流信号的捕捉能力,从而在超声微血管成像中实现更高效的杂波过滤。

详情
AI中文摘要

高灵敏度杂波过滤是超声微血管成像中的基本步骤。奇异值分解(SVD)和鲁棒主成分分析(rPCA)是主要的杂波过滤策略。然而,这两种策略在特征建模和组织与血流分离方面对于高质量微血管成像有限。最近,基于深度学习的杂波过滤在更彻底地分离组织和血流信号方面显示出潜力。然而,现有的监督滤波器面临缺乏可解释性和训练真实数据的问题。虽然可解释性问题可以通过算法深度展开来解决,但训练真实数据仍然无法解决。本文提出了一种无监督展开rPCA(U2-rPCA)方法,该方法保留了数学可解释性,并且对学习标签不敏感。具体而言,U2-rPCA是从具有内在低秩和稀疏正则化的迭代加权最小二乘(IRLS)rPCA基础展开而来。此外,稀疏增强单元被插入到网络中,以增强其捕捉稀疏微流信号的能力。U2-rPCA就像一个自适应滤波器,它通过部分图像序列进行训练,然后用于后续帧。在硅基数据集和公开的活体数据集上的实验验证显示,U2-rPCA在与SVD滤波器、rPCA基础和另一种深度学习滤波器相比时表现出优越性。特别是,所提出的方法将功率多普勒图像的对比噪声比(CNR)从1.91 dB提高到8.48 dB,相比其他方法。此外,通过消融研究验证了U2-rPCA构建模块的有效性。

英文摘要

High-sensitivity clutter filtering is a fundamental step in ultrasound microvascular imaging. Singular value decomposition (SVD) and robust principal component analysis (rPCA) are the main clutter filtering strategies. However, both strategies are limited in feature modeling and separation of tissue and blood flow for high-quality microvascular imaging. Recently, deep learning-based clutter filtering has shown potential in more thoroughly separating tissue and blood flow signals. However, the existing supervised filters face the lack of interpretability and the training ground truth. While the interpretability issue can be addressed by algorithm deep unfolding, the training ground truth remains unsolved. This paper proposes an unsupervised unfolded rPCA (U2-rPCA) method that preserves mathematical interpretability and is insusceptible to learning labels. Specifically, U2-rPCA is unfolded from an iteratively reweighted least squares (IRLS) rPCA baseline with intrinsic low-rank and sparse regularization. In addition, a sparse-enhancement unit is plugged into the network to strengthen its capability to capture the sparse micro-flow signals. U2-rPCA is like an adaptive filter that is trained with part of the image sequence and then used for the following frames. Experimental validations on a in-silico dataset and public in-vivo datasets demonstrated the outperformance of U2-rPCA when compared with the SVD filter, the rPCA baseline, and another deep learning-based filter. Particularly, the proposed method improved the contrast-to-noise ratio (CNR) of the power Doppler image by 1.91 dB to 8.48 dB compared to other methods. Furthermore, the effectiveness of the building modules of U2-rPCA was validated through ablation studies.

2510.00600 2026-05-20 cs.RO cs.AI cs.CV cs.LG

Hybrid Training for Vision-Language-Action Models

视觉-语言-动作模型的混合训练

Pietro Mazzaglia, Cansu Sancaktar, Markus Peschl, Daniel Dijkman

AI总结 本文提出混合训练框架,旨在使视觉-语言-动作模型在推理时能够根据需要生成思考过程或直接预测动作,从而在保持性能提升的同时提高推理效率。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

使用大型语言模型生成中间思考过程(即链式思考,CoT)再提供答案,已成为解决复杂语言任务的有效方法。在机器人领域,类似的具身CoT策略,即在执行动作前生成思考,也已被证明在使用视觉-语言-动作模型(VLAs)时能够提高性能。然而,这些技术会增加模型生成输出的长度以包含思考过程,从而影响推理时间。在现实世界执行中,如机器人操作场景,延迟代理的动作会严重影响方法的实用性,因为任务需要长序列的动作。然而,生成长链式思考是否是实现性能提升的必要条件?在本文中,我们探索了混合训练(HyT)的概念,这是一种框架,使VLAs能够从思考中学习并受益于相关的性能提升,同时在推理时允许省略CoT生成。此外,通过学习有条件地预测多样化的输出,HyT在推理时提供了灵活性,使模型能够直接预测动作、生成思考或遵循指令。我们评估了所提出的方法在一系列模拟基准和真实世界实验中的表现。

英文摘要

Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.

2509.23108 2026-05-20 cs.AI cs.CL

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

人工幻象:大语言模型中的涌现性心智 imagery

Morgan McCarty, Jorge Morales

AI总结 本研究探讨了纯语言能否驱动视觉 imagery,发现大语言模型在视觉 imagery 任务中表现优于人类,表明可能存在非图示性的涌现性心智 imagery,挑战传统认知科学观点。

Comments 34 pages, 10 figures, 3 tables

详情
AI中文摘要

视觉 imagery 是否可以仅由语言驱动?这一想法与传统认知科学观点相悖,即视觉心智 imagery 只能通过图示性表示实现。大语言模型(LLMs)提供了初步证据,表明通过命题性表示的视觉 imagery 是可能的,并且可能比人类想象更稳健。我们为一个经典任务创建了数十种新项目,该任务被认为只能通过图示性表示解决(即仅靠语言不足以完成)。受试者被要求想象一系列组合字母和形状的变换并识别结果图像。我们发现最佳的 LLMs 在人类(n=100)表现上显著更好(p<0.0001),表明存在人工幻象或非图示性的涌现性“视觉”心智 imagery。此外,我们测试了具有可变推理令牌分配的推理模型,发现模型在更长的推理链中表现最佳,显示了语言对任务的影响——仅靠语言可能就足够。我们检验了三种涌现 imagery 假设:纯命题性 imagery、带有视觉-语言先验的命题性 imagery 或图示性视觉 imagery(经典视觉 imagery)。本研究不仅提供了大语言模型之前未报告的涌现性认知能力的证据,也重新引发了关于心智 imagery 是否需要图示格式的讨论。

英文摘要

Can visual imagery be driven solely by language? This idea goes against cognitive science's traditional view that visual mental imagery is only possible through pictorial representations. Large Language Models (LLMs) provide nascent evidence not only that visual mental imagery via propositional-representations is possible, but that it can be more robust than human imagination. We created dozens of novel items for an extension to a classic task which is argued to be solvable exclusively via pictorial representations (i.e., language alone would be insufficient). Subjects were asked to imagine a series of compositional letter and shape transformations and identify the resultant "image". We found that the best LLMs performed significantly better than humans ($n = 100$ human participants, $p < .0001$), indicating the existence of an artificial phantasia, or emergent "visual" mental imagery that may not be pictorial. Furthermore, we tested reasoning models with variable reasoning-token allocation and found that models perform best with longer reasoning chains, demonstrating a linguistic impact on the task -- language alone may be sufficient. We examined three emergent imagery hypotheses: pure propositional imagery, propositional imagery with visio-linguistic priors, or pictorial visual imagery (classical visual imagery). Our study not only presents evidence for a previously unreported emergent cognitive capacity of LLMs, but also reignites debate on the requirement for a pictorial format in mental imagery.

2509.22292 2026-05-20 cs.CV cs.AI

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

通过场景分割策略对文本到视频模型进行劫持

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim

AI总结 本文提出了一种新的黑盒劫持方法SceneSplit,通过将有害叙述分割成多个良性场景,利用场景组合作为约束来引导最终输出,从而提高生成有害视频的可能性,验证了当前文本到视频模型的安全机制存在漏洞。

Comments ICLR 2026. Project page at https://velpegor.github.io/SceneSplit/

详情
AI中文摘要

随着文本到视频(T2V)模型的快速发展,对其安全风险的关注也日益增加。尽管最近的研究已经探讨了像LLM、VLM和文本到图像(T2I)模型等模型中的漏洞,但T2V模型仍然鲜有研究,存在显著的安全缺口。为了解决这一缺口,我们引入了SceneSplit,一种新颖的黑盒劫持方法,其通过将有害叙述分割成多个场景,每个场景本身都是无害的。这种方法利用场景组合作为强大的约束,来引导最终的输出空间。虽然每个场景单独对应一个宽泛且安全的空间,其中大多数结果都是无害的,但它们的顺序组合会共同限制这个空间,将其缩小到一个危险区域,从而显著增加生成有害视频的可能性。这种核心机制通过迭代场景操纵进一步增强,可以绕过此受限危险区域内的安全过滤器。此外,一个重用成功攻击模式的策略库进一步提高了攻击的整体效果和鲁棒性。为了验证我们的方法,我们在T2VSafetyBench上的11个安全类别上评估了SceneSplit在T2V模型上的表现。我们的结果表明,它在Luma Ray2上实现了77.2%的平均攻击成功率,在Hailuo上为84.1%,在Veo2上为78.2%,在Kling V1.0上为78.6%,在Sora2上为68.6%,显著优于现有基线。通过这项工作,我们证明了当前T2V安全机制容易受到利用叙述结构的攻击,为理解和改进T2V模型的安全性提供了新的见解。

英文摘要

Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories from T2VSafetyBench on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, 78.2% on Veo2, 78.6% on Kling V1.0, and 68.6% on Sora2, significantly outperforming the existing baselines. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

2509.22258 2026-05-20 cs.CV cs.AI

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

超越分类准确度:Neural-MedBench与更深层次推理基准的需求

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

AI总结 本文提出Neural-MedBench,一个专门用于测试多模态神经病学推理能力的基准,揭示现有医疗数据集过于强调分类准确度的问题,并通过系统评估发现模型推理失败而非感知误差主导性能下降,强调需要兼顾广度与深度的评估框架。

Comments 23 pages, 12 figures

详情
Journal ref
ICLR'2026
AI中文摘要

近期视觉-语言模型(VLMs)在标准医疗基准上取得了显著进展,但其真正的临床推理能力仍不清楚。现有数据集主要强调分类准确度,导致模型在高风险诊断推理上仍存在不足。我们引入Neural-MedBench,一个紧凑且推理密集的基准,专门用于探测多模态临床推理在神经病学中的极限。Neural-MedBench整合多序列MRI扫描、结构化电子健康记录和临床笔记,并涵盖三大核心任务家族:鉴别诊断、病变识别和推理生成。为确保可靠评估,我们开发了结合LLM评分、临床验证和语义相似度指标的混合评分流程。通过系统评估最先进的VLMs,包括GPT-4o、Claude-4和MedGemma,我们发现其性能相比传统数据集显著下降。错误分析显示,推理失败而非感知误差主导模型不足。我们的发现强调了需要双轴评估框架:以广度为导向的大数据集用于统计泛化,以深度为导向的紧凑基准如Neural-MedBench用于推理保真度。我们发布Neural-MedBench于https://neuromedbench.github.io/作为开放且可扩展的诊断测试床,引导未来基准的扩展,并实现严谨而成本有效的临床可信AI评估。

英文摘要

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

2509.21698 2026-05-20 cs.CL

GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures

GRAB:一种风险分类——面向财务披露中无监督主题发现的基准测试

Ying Li, Tiejun Ma

AI总结 本文提出GRAB,一个专门针对财务披露中无监督主题发现的基准测试,通过结合FinBERT词注意力、YAKE关键词信号和基于分类法的短语匹配,生成无需人工标注的句子标签,从而评估无监督主题模型。

Comments 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Generative AI in Finance

详情
AI中文摘要

在10-K风险披露中的风险分类对于监管和投资至关重要,但目前尚无公开的基准测试评估此类任务的无监督主题模型。我们提出了GRAB,一个专门针对金融领域的基准测试,包含来自8247份文件的161万个句子,并通过结合FinBERT词注意力、YAKE关键词信号和基于分类法的短语匹配生成无需人工标注的句子标签。标签基于一个将193个术语映射到21个细粒度类型(嵌套在五个宏观类别下的类型)的风险分类法;21种类型指导弱监督,而评估则在宏观层面进行。GRAB通过固定的数据集划分和稳健的指标——准确率、宏F1、主题BERTScore以及基于熵的有效主题数,统一了评估。该数据集、标签和代码使经典、基于嵌入、神经网络和混合主题模型在财务披露上的可重复、标准化比较成为可能。

英文摘要

Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics--Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.

2509.21196 2026-05-20 cs.LG cs.CV

Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

微分-积分神经算子用于长期湍流预测

Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

AI总结 本文提出了一种基于物理原理的微分-积分神经算子,通过并行分支学习不同的物理算子,以提高长期湍流预测的稳定性与鲁棒性,从而在2D Kolmogorov流基准测试中实现了更精确的预测。

详情
AI中文摘要

准确预测湍流的长期演变是科学计算中的重大挑战,对气候建模和航空航天工程等应用至关重要。现有的深度学习方法,特别是神经算子,在长期自回归预测中常常失败,导致灾难性误差累积和物理保真度的丧失。这种失败源于它们无法同时捕捉湍流动力学所支配的不同的数学结构:局部、耗散效应和全局、非局部相互作用。在本文中,我们提出了微分-积分神经算子(\method{}),一种基于算子分解的原理方法。\method{}通过并行分支显式建模湍流的演变,学习不同的物理算子:一个局部微分算子,由一个受约束的卷积网络实现,该网络可以证明收敛于导数;以及一个全局积分算子,由Transformer架构捕捉,学习数据驱动的全局核。这种基于物理的分解使\method{}具有卓越的稳定性和鲁棒性。通过在具有挑战性的2D Kolmogorov流基准测试中的广泛实验,我们证明\method{}在长期预测中显著优于最先进的模型。它能够抑制数百个时间步上的误差累积,保持涡旋场和能量谱的高保真度,并建立了物理一致、长程湍流预测的新基准。

英文摘要

Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.

2509.17428 2026-05-20 cs.CL

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

QWHA: 量化感知的沃尔什-哈达玛适应用于大型语言模型的参数高效微调

Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

AI总结 本文提出QWHA方法,通过将基于傅里叶变换的适配器与沃尔什-哈达玛变换结合,改进量化感知参数高效微调,有效减少量化误差并降低计算成本。

Comments 25 pages, 9 figures, 14 tables

详情
Journal ref
ICLR 2026 Poster
AI中文摘要

大型语言模型(LLMs)的高效部署需求推动了量化(减少推理成本)和参数高效微调(降低训练开销)的研究。为此,我们开发了量化感知参数高效微调(QWHA),以生成准确且高效的量化模型。在该设定中,减少量化误差在微调前至关重要。然而,现有依赖低秩适应的方法存在表示能力有限的问题。最近的傅里叶相关变换(FT)基于适配器具有比低秩适配器更大的表示能力,但其直接整合到量化模型中往往导致无效的误差减少和计算开销增加。为克服这些限制,我们提出了QWHA,一种通过使用沃尔什-哈达玛变换(WHT)作为变换内核,并结合新的适配器初始化方案(包括自适应参数选择和值细化)将FT基于适配器整合到量化模型中的方法。我们证明QWHA有效减轻量化误差并促进微调,其设计显著降低了计算成本。实验结果表明,QWHA在低比特量化精度上一致优于基线,并在现有FT基于适配器上实现了显著的训练加速。代码可在https://github.com/vantaa89/qwha获取。

英文摘要

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

2509.14968 2026-05-20 cs.LG cs.NI

FAWN: A MultiEncoder Fusion-Attention Wave Network for Integrated Sensing and Communication Indoor Scene Inference

FAWN:一种多编码器融合-注意力波网络用于集成感知与通信室内场景推断

Carlos Barroso-Fernández, Alejandro Calvillo-Fernandez, Antonio de la Oliva, Carlos J. Bernardos

AI总结 本文提出FAWN,一种基于Transformer架构的多编码器融合-注意力波网络,用于整合感知与通信的室内场景推断,通过融合Wi-Fi和5G信号提高环境感知精度。

Comments 7 pages, 6 figures and tables, less than 5500 words. Under revision at IEEE Communication Magazine

详情
AI中文摘要

下一代无线技术有望实现万物互联和智能化的时代。随着对智能需求的增长,网络必须学会更好地理解物理世界。然而,部署专用硬件来感知环境并不总是可行,主要是由于成本和/或复杂性。集成感知与通信(ISAC)在解决这一挑战上迈出了重要一步。在ISAC中,被动感知作为一种成本效益高的解决方案,利用无线通信来感知环境,而不干扰现有通信。然而,当前大多数解决方案仅限于一种技术(主要是Wi-Fi或5G),限制了最大精度。由于不同技术使用不同的频谱,我们看到有必要整合多种技术以扩大覆盖范围。因此,我们利用ISAC被动感知,提出FAWN,一种用于ISAC室内场景推断的多编码器融合-注意力波网络。FAWN基于原始Transformer架构,融合Wi-Fi和5G信息,使网络能够理解物理世界而不干扰当前通信。为了测试我们的解决方案,我们构建了一个原型并将其集成到真实场景中。结果表明,在84%的时间内,误差低于0.6米。

英文摘要

The upcoming generations of wireless technologies promise an era where everything is interconnected and intelligent. As the need for intelligence grows, networks must learn to better understand the physical world. However, deploying dedicated hardware to perceive the environment is not always feasible, mainly due to costs and/or complexity. Integrated Sensing and Communication (ISAC) has made a step forward in addressing this challenge. Within ISAC, passive sensing emerges as a cost-effective solution that reuses wireless communications to sense the environment, without interfering with existing communications. Nevertheless, the majority of current solutions are limited to one technology (mostly Wi-Fi or 5G), constraining the maximum accuracy reachable. As different technologies work with different spectrums, we see a necessity in integrating more than one technology to augment the coverage area. Hence, we take the advantage of ISAC passive sensing, to present FAWN, a MultiEncoder Fusion-Attention Wave Network for ISAC indoor scene inference. FAWN is based on the original transformers architecture, to fuse information from Wi-Fi and 5G, making the network capable of understanding the physical world without interfering with the current communication. To test our solution, we have built a prototype and integrated it in a real scenario. Results show errors below 0.6 m around 84% of times.

2509.14839 2026-05-20 cs.CV

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

MapAnything: 评估单目度量深度模型用于3D城市资产定位

Miriam Louise Carnot, Jonas Kunze, Erik Quinten Fastermann, Eric Peukert, André Ludwig, Bogdan Franczyk

AI总结 本文提出MapAnything框架,通过单目图像自动映射城市物体和事件,利用度量深度估计模型计算物体坐标,验证其在复杂城市环境中的精度,展示其在交通标志和道路损坏等实际应用中的有效性。

详情
AI中文摘要

城市管理部门越来越多地依赖全面的数据库和数字孪生,如交通标志和树木以及涂鸦或道路损坏等事件,以有效监控城市状况。数字化提高了对持续更新的空间数据集的需求,但当前的数据采集和维护过程仍涉及大量人工劳动,带来了显著的可扩展性挑战。本文介绍了MapAnything,一种新颖的地理定位框架,能够从单个单目图像自动映射城市物体和事件。通过利用先进的度量深度估计模型,Map Anything准确计算物体的地理坐标,将2D图像数据转换为有价值的3D空间信息。该方法集成了估计的相机到物体距离与几何原理和已知相机规格。我们展示了该框架的详细验证,将其距离估计精度与高精度LiDAR点云在复杂城市环境中的对比。我们的评估提供了在各种距离区间和语义区域(如道路和植被)上的空间性能的细致分析。最后,我们通过具体的使用案例,如映射交通标志和道路路面损坏,展示了该框架的实际有效性,并提供了将其整合到自动化城市库存系统中的建议。

英文摘要

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

2509.14787 2026-05-20 cs.RO

COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy

COMPASS:基于主动感知策略的受限空间操作规划

Qixuan Li, Chen Le, Dongyue Huang, Jincheng Yu, Xinlei Chen

AI总结 本文提出COMPASS框架,通过主动感知策略在受限和杂乱环境中实现安全操作,提高了操作成功率。

Comments Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

在受限和杂乱环境中进行操作仍然是一个重大挑战,由于部分可观测性和复杂的配置空间。有效在这些环境中进行操作需要一种智能探索策略来安全地理解和搜索目标。在本文中,我们提出COMPASS,一种多阶段探索和操作框架,其特征是具有操作意识的基于采样的规划器。首先,我们通过近场意识扫描减少碰撞风险,以构建局部碰撞图。此外,我们采用多目标效用函数来寻找同时具有信息性和有利于后续操作的视角。此外,我们执行一种受限操作优化策略,以生成遵守障碍物约束的操作姿态。为了系统评估方法在这些困难下的性能,我们提出了一个包含四个挑战性场景的受限空间探索和操作基准。与为其他机器人设计的探索方法和仅考虑信息增益的方法相比,我们的框架在模拟中将操作成功率提高了24.25%。现实世界实验展示了我们的方法在受限环境中进行主动感知和操作的能力。

英文摘要

Manipulation in confined and cluttered environments remains a significant challenge due to partial observability and complex configuration spaces. Effective manipulation in such environments requires an intelligent exploration strategy to safely understand the scene and search the target. In this paper, we propose COMPASS, a multi-stage exploration and manipulation framework featuring a manipulation-aware sampling-based planner. First, we reduce collision risks with a near-field awareness scan to build a local collision map. Additionally, we employ a multi-objective utility function to find viewpoints that are both informative and conducive to subsequent manipulation. Moreover, we perform a constrained manipulation optimization strategy to generate manipulation poses that respect obstacle constraints. To systematically evaluate method's performance under these difficulties, we propose a benchmark of confined-space exploration and manipulation containing four level challenging scenarios. Compared to exploration methods designed for other robots and only considering information gain, our framework increases manipulation success rate by 24.25% in simulations. Real-world experiments demonstrate our method's capability for active sensing and manipulation in confined environments.

2508.16112 2026-05-20 cs.AI

IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

IR-Agent: 专家启发的LLM代理用于从红外光谱中解析分子结构

Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Kibum Kim, Chanyoung Park

AI总结 本文提出IR-Agent,一种新的多代理框架,用于从红外光谱中解析分子结构,通过模拟专家驱动的分析过程,提升结构解析的准确性。

Comments ICLR 2026

详情
AI中文摘要

光谱分析为解析未知材料提供了关键线索。在各种技术中,红外光谱(IR)在实验室环境中扮演着重要角色,因为它具有高可访问性和低成本。然而,现有方法往往无法反映专家分析过程,并且在整合多种化学知识方面缺乏灵活性,这对于现实世界的分析场景至关重要。在本文中,我们提出了IR-Agent,一种新的多代理框架,用于从红外光谱中解析分子结构。该框架旨在模拟专家驱动的红外分析过程,并具有内在的可扩展性。每个代理专门处理红外解释的特定方面,其互补作用使集成推理成为可能,从而提高整体结构解析的准确性。通过广泛的实验,我们证明了IR-Agent不仅在实验红外光谱上提高了基线性能,而且在各种化学信息形式上表现出强大的适应性。

英文摘要

Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.

2508.14134 2026-05-20 cs.LG cs.AI

ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification

ERIS: 一种面向分布外时间序列分类的能量引导特征解耦框架

Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang

AI总结 本文提出ERIS框架,通过能量引导机制和语义指导,解决时间序列分类中分布外数据的可靠特征解耦问题,提升模型鲁棒性和泛化能力。

详情
Journal ref
Information Fusion 135, 104407 (2026)
AI中文摘要

理想的时间序列分类(TSC)应能捕捉不变表示,但实现对分布外(OOD)数据的可靠性能仍是一个核心障碍。这一障碍源于模型内在地将领域特定和标签相关特征纠缠在一起,导致虚假相关性。尽管特征解耦旨在解决这一问题,但当前方法大多缺乏必要的语义方向,无法隔离真正普遍的特征。为此,我们提出一个端到端的Energy-Regularized Information for Shift-Robustness(ERIS)框架,以实现引导且可靠的特征解耦。核心思想是有效的解耦不仅需要数学约束,还需要语义指导来锚定分离过程。ERIS集成了三个关键机制来实现这一目标。具体来说,我们首先引入一种能量引导校准机制,为分离过程提供关键的语义指导,使模型能够自我校准。此外,一个权重层面正交性策略强制领域特定和标签相关特征之间的结构性独立,从而减轻它们的干扰。此外,一个辅助对抗泛化机制通过注入结构化扰动来增强鲁棒性。在四个基准测试中的实验表明,ERIS在统计上显著优于最先进的基线方法,始终保持最佳性能排名。

英文摘要

An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial generalization mechanism enhances robustness by injecting structured perturbations. Experiments across four benchmarks demonstrate that ERIS achieves a statistically significant improvement over state-of-the-art baselines, consistently securing the top performance rank.

2507.15698 2026-05-20 cs.CL cs.AI cs.LG

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

CoLD: 用于数学推理过程中奖励模型的反事实引导长度偏差消除

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

AI总结 本文提出CoLD,一种通过反事实引导消除过程奖励模型中长度偏差的统一框架,旨在提高多步骤推理的准确性和简洁性,同时提升下游强化学习性能和跨领域泛化能力。

详情
AI中文摘要

过程奖励模型(PRMs)在评估和引导大型语言模型(LLMs)的多步推理中起着核心作用,特别是在数学问题解决中。然而,我们发现现有PRMs存在普遍的长度偏差:即使语义内容和逻辑有效性未变,它们也倾向于对较长的推理步骤赋予更高的分数。这种偏差会削弱奖励预测的可靠性,并导致推理过程中输出过于冗长。为了解决这一问题,我们提出了CoLD(Counterfactually-Guided Length Debiasing),一种统一的框架,通过三个组件减轻长度偏差:显式的长度惩罚调整、一个训练以捕捉虚假长度相关信号的学得偏差估计器,以及一种联合训练策略,强制奖励预测的长度不变性。我们的方法基于反事实推理,并受因果图分析的启发。在MATH500和GSM-Plus上的广泛实验表明,CoLD提高了步骤选择的准确性,并鼓励了更简洁、逻辑有效的推理。此外,它一致提高了下游RL性能,并通过减轻长度偏差在跨领域中泛化,展示了CoLD强大的泛化能力。

英文摘要

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

2507.10614 2026-05-20 cs.LG cs.AI

Fine-tuning Large Language Model for Automated Algorithm Design

微调大语言模型用于自动化算法设计

Fei Liu, Rui Zhang, Xi Lin, Zhichao Lu, Qingfu Zhang

AI总结 本文探讨了微调大语言模型以提升其在自动化算法设计中的性能,提出了一种多样性感知的排名策略和直接偏好优化方法,通过实验验证了任务特定微调在不同算法设计任务中的有效性。

详情
AI中文摘要

将大语言模型(LLMs)整合到自动化算法设计中已展现出巨大潜力。一种常见的方法是将LLMs嵌入到搜索过程中,以迭代生成和优化候选算法。然而,现有大多数方法依赖于为通用编码任务训练的现成LLMs,留下一个关键问题:是否需要专门针对算法设计训练的LLMs?如果是,如何有效获得此类LLMs,并且它们在不同算法设计任务中有多好的泛化能力?在本文中,我们通过探索针对算法设计的LLMs微调,初步回答了这些问题。我们引入了一种多样性感知的排名(DAR)采样策略,以平衡训练数据的多样性和质量,然后利用直接偏好优化来高效地对齐LLMs的输出与任务目标。我们的实验主要在Llama-3.2-1B-Instruct和Llama-3.1-8BInstruct上进行,针对三个不同的算法设计任务,此外,openPangu-Embedded模型还作为辅助比较在可允许集合问题上进行评估。结果表明,微调后的LLMs在较小的Llama-3.2-1B-Instruct上显著优于其现成的对应者,并在可允许集合问题上与较大的Llama-3.1-8B-Instruct匹配。此外,我们观察到良好的泛化能力:在特定算法设计任务上微调的LLMs在相关任务中也表现出色。这些发现突显了LLMs在算法设计中任务特定适应的价值,并为未来研究开辟了新途径。我们的代码可在https://github.com/RayZhhh/dpo-aad上公开获取。

英文摘要

The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks, leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a preliminary step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank-based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments are primarily conducted on Llama-3.2-1B-Instruct and Llama-3.1-8BInstruct across three distinct algorithm design tasks, with openPangu-Embedded models additionally included as auxiliary comparisons on the admissible set problem. Results suggest that fine-tuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs fine-tuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research. Our code is publicly available at https://github.com/RayZhhh/dpo-aad.

2507.10492 2026-05-20 cs.CV cs.AI cs.LG

BenchReAD: A systematic benchmark for retinal anomaly detection

BenchReAD: 一种系统性的视网膜异常检测基准

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

AI总结 本研究提出BenchReAD基准,旨在解决视网膜异常检测领域缺乏全面且公开的评估标准的问题,通过系统化的数据和算法分类,引入了全监督方法DRA,并改进为NFM-DRA,实现了SOTA性能。

Comments MICCAI 2025

详情
AI中文摘要

视网膜异常检测在筛查眼部和系统性疾病中起着关键作用。尽管其重要性,该领域的进展受到缺乏全面且公开可用的基准的阻碍,这对于公平评估和推进方法至关重要。由于这一限制,与视网膜图像相关的先前异常检测工作受到(1)异常类型有限且过于简单的限制,(2)测试集几乎饱和,以及(3)缺乏泛化评估的影响,导致实验设置说服力不足。此外,现有医学异常检测基准大多专注于单类监督方法(仅使用负样本训练),忽视了临床实践中大量可用的标记异常数据和未标记数据。为了填补这些差距,我们引入了视网膜异常检测的基准,该基准在数据和算法上都是全面且系统的。通过分类和评估先前方法,我们发现利用解耦异常表示的全监督方法(DRA)取得了最佳性能,但在遇到某些未见异常时性能显著下降。受单类监督学习中记忆库机制的启发,我们提出了NFM-DRA,将其与正常特征记忆结合,以缓解性能下降,建立新的SOTA。该基准可在https://github.com/DopamineLcy/BenchReAD上公开获取。

英文摘要

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

2507.05843 2026-05-20 cs.CV

USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

USIGAN: 用于弱配对图像IHC虚拟染色的不平衡自信息特征传输

Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin

AI总结 本文提出USIGAN方法,通过提取全局形态学语义来解决弱配对条件下IHC虚拟染色的不一致问题,改进生成结果的病理语义一致性。

详情
AI中文摘要

免疫组化(IHC)虚拟染色任务旨在从H&E图像生成虚拟IHC图像,同时保持与相邻切片的病理语义一致性。该任务通过生成模型实现形态结构与染色模式的跨域映射,为病理分析提供高效且经济的解决方案。然而,在弱配对条件下,相邻切片之间的空间异质性带来了显著挑战,可能导致不准确的一对多映射并生成与相邻切片病理语义不一致的结果。为了解决这个问题,我们提出了一种新的IHC虚拟染色的不平衡自信息特征传输方法,称为USIGAN,该方法在不依赖位置对应的情况下提取全局形态学语义。通过在联合边缘分布中移除弱配对项,我们有效减轻了弱配对对联合分布的影响,从而显著提高了生成结果的内容一致性和病理语义一致性。此外,我们设计了不平衡最优传输一致性(UOT-CTM)机制和病理自对应(PC-SCM)机制,以构建H&E与生成IHC在图像级别以及真实IHC与生成IHC图像集内的相关矩阵。在两个公开数据集上的实验表明,我们的方法在多个临床相关指标上表现优异,如IoD和Pearson-R相关性,证明了更好的临床相关性。

英文摘要

Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

2507.01123 2026-05-20 cs.CV cs.LG eess.IV

Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

利用多源卫星数据和地理区域的深度学习进行滑坡检测与制图

Rahul A. Burange, Harsh K. Shinde, Omkar Mutyalwar

AI总结 本文提出了一种综合方法,结合多源卫星影像和深度学习模型,以提高滑坡识别和预测的准确性,通过Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征,并评估多种地理空间分析技术对检测精度的影响,同时评估了多种先进的深度学习分割模型,如U-Net、DeepLabV3+和Res-Net,以确定其在滑坡检测中的有效性。

Comments 17 pages, 22 figures

详情
Journal ref
JETIR March 2025, Volume 12, Issue 3
AI中文摘要

滑坡对基础设施、经济和人类生命构成严重威胁,需要在多样化的地理区域中进行准确的检测和预测制图。随着深度学习和遥感技术的进步,自动化滑坡检测已变得更加有效。本文提出了一种综合方法,整合多源卫星影像和深度学习模型,以增强滑坡识别和预测。我们利用Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征。各种地理空间分析技术被用来评估地形特征、植被覆盖和降雨对检测精度的影响。此外,我们评估了多种先进的深度学习分割模型,包括U-Net、DeepLabV�+和Res-Net,以确定其在滑坡检测中的有效性。所提出的框架有助于发展可靠的早期预警系统,改进灾害风险管理,并促进可持续的土地利用规划。我们的发现为深度学习和多源遥感在创建稳健、可扩展和可转移的滑坡预测模型中的潜力提供了有价值的见解。

英文摘要

Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

2506.14148 2026-05-20 cs.SD cs.CL eess.AS

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

基于声学散射的非侵入式物体分类AI:一项关于头发评估的案例研究

Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

AI总结 本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。通过发射声学刺激并捕捉带有头发样本的头部对象散射信号,利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

Comments This paper has been retracted by the authors. Due to miscommunication, the authorship is incomplete and missing early contributions

详情
AI中文摘要

本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。当入射波与物体相互作用时,会生成散射声场,该声场编码了结构和材料属性。通过发射声学刺激并捕捉头部带头发样本对象的散射信号,我们利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

英文摘要

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.