arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2505.14479 2026-05-26 cs.AI cs.CL

A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

一种用于LLM可靠证明生成的神经符号方法：以欧几里得几何为例

Oren Sultan, Eitan Stern, Dafna Shahaf

AI总结提出一种结合LLM生成能力与结构化组件的神经符号方法，通过类比问题检索和形式验证器反馈，显著提升欧几里得几何证明的准确性。

Comments long paper

详情

AI中文摘要

大型语言模型（LLM）在需要严格逻辑推理和符号推理的形式化领域（如数学证明生成）中表现不佳。我们提出一种神经符号方法，结合LLM的生成优势与结构化组件以克服这一挑战。作为概念验证，我们专注于SAT级别的几何问题。我们的方法有两方面：（1）检索类比问题并利用其证明来指导LLM；（2）形式验证器评估生成的证明并提供反馈，帮助模型修正错误证明。我们的方法显著提高了不同模型族的证明准确性，在所有评估模型（OpenAI o1、GPT-5、Gemini-Flash-2.5和Claude Sonnet 4.6）上均取得了显著提升。基础模型的准确率从10%至44%提升至采用我们方法后的68%至96%，其中类比问题指导和验证器反馈均贡献了这些改进。更广泛地说，转向生成可证明正确结论的LLM有望大幅提高其可靠性、准确性和一致性，从而解锁需要可信赖性的复杂任务和关键现实应用。

英文摘要

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof of concept, we focus on SAT-level geometry problems. Our approach is two-fold: (1) We retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. Our method significantly improves proof accuracy across diverse model families, achieving significant gains across all evaluated models: OpenAI o1, GPT-5, Gemini-Flash-2.5, and Claude Sonnet 4.6. Accuracy increases from 10% to 44% for the base models to 68% to 96% with our approach, with both analogous problem guidance and verifier feedback contributing to these improvements. More broadly, shifting to LLMs that generate provably correct conclusions has the potential to dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2505.03631 2026-05-26 cs.CV

Generalizable Video Quality Assessment via Weak-to-Strong Learning

通过弱到强学习实现可泛化的视频质量评估

Linhan Cao, Wei Sun, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Yicong Peng, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

AI总结提出弱到强学习框架，结合同质/异质监督信号和迭代训练，无需人工标注即可提升视频质量评估的泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

视频质量评估（VQA）旨在预测与人类视觉感知一致的视频感知质量，是量化视频处理流程中质量退化的基本工具。主流的VQA范式依赖于人工标注数据集的监督训练，尽管取得了显著进展，但在未见视频内容上仍存在泛化能力差的问题。本文探索弱到强（W2S）学习作为一种无需依赖人工标注数据集的新范式来推进VQA。我们首先提供经验证据，表明直接的W2S策略使强学生模型不仅能在域内基准上匹配其弱教师，还能在分布外（OOD）基准上超越教师，揭示了VQA中独特的弱到强效应。基于这一洞察，我们提出一个新颖框架，从两个方面增强W2S学习：（1）通过可学习排序公式整合来自不同VQA教师（包括现成VQA模型和合成失真模拟器）的同质和异质监督信号；（2）迭代W2S训练，其中每个强学生被回收作为后续循环的教师，逐步聚焦于困难案例。大量实验表明，我们的方法在域内和OOD基准上均达到最先进结果，尤其在OOD场景中表现突出。我们的发现强调W2S学习是打破标注障碍、实现视频质量评估可扩展泛化的原则性途径。我们的数据和代码将在https://github.com/clh124/W2S-VQA提供。

英文摘要

Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers -- including off-the-shelf VQA models and synthetic distortion simulators -- via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment. Our data and code will be available at https://github.com/clh124/W2S-VQA.

URL PDF HTML ☆

赞 0 踩 0

2504.15404 2026-05-26 cs.CV

Context Aware Grounded Teacher for Source Free Object Detection

上下文感知的接地教师用于无源目标检测

Tajamul Ashraf, Rajes Manna, Partha Sarathi Purkayastha, Tavaheed Tariq, Janibul Bashir

AI总结针对无源目标检测中类别不平衡导致的上下文偏差和噪声伪标签问题，提出一种基于关系上下文模块和语义增强的偏差感知框架Grounded Teacher，通过关系正则化和语义增强提升少数类检测性能。

Comments Accepted in International Journal of Computer Vision (IJCV); Project Webpage: https://tajamul21.github.io/Grounded_Teacher/

详情

AI中文摘要

无源目标检测（SFOD）面临持续挑战，原因在于类别不平衡驱动的上下文偏差以及噪声伪标签下教师-学生训练的不稳定性。现有技术往往忽略上下文偏差和类别不平衡偏移，尤其是在医疗数据中。为解决此问题，我们提出Grounded Teacher（GT），一种偏差感知的无源框架，通过关系正则化和语义正则化来接地教师模型。为了显式建模类别间的方向性混淆，GT引入关系上下文模块（RCM），维护跨域上下文偏差的指数移动平均（EMA）估计。在此基础上，语义增强（SA）策略通过在源相似和源不相似的目标区域中进行自适应MixUp，选择性地增强少数类和易混淆类，从而提高少数类召回率而不过度拟合主导类别。为了在偏差伪标签下稳定学习，我们设计了语义感知损失（SAL），应用对角归一化权重，防止梯度爆炸，同时强调少数-多数类别的修正。此外，从大型视觉基础模型（LVFMs）导出的冻结专家分支在训练期间作为监督参考，在不增加推理开销的情况下改善伪标签质量。GT的行为驱动偏差量化使其能够跨领域广泛应用，无需依赖数据集先验。在Cityscapes-to-Foggy（50.8 mAP）和医学迁移（DDSM-to-INBreast上+5.9 AP50）上的评估显示出一致的增益和改进的少数类检测，且额外训练成本低于12%。代码和模型可在https://github.com/Tajamul21/Grounded-Teacher获取。

英文摘要

Source-free object detection (SFOD) faces persistent challenges due to class imbalance-driven context bias and instability in teacher-student training under noisy pseudo-labels. Existing techniques tend to ignore context bias and class-imbalance shifts, especially in medical data. To tackle this, we propose Grounded Teacher (GT), a bias-aware source-free framework that grounds the teacher model through relational and semantic regularization. To explicitly model directional confusion between classes, GT introduces a Relational Context Module (RCM) that maintains an exponential moving average (EMA) estimate of cross-domain contextual bias. Building upon this, a Semantic Augmentation (SA) strategy selectively augments minority and confusable classes through adaptive MixUp in both source-similar and source-dissimilar target regions, improving minority recall without overfitting dominant categories. To stabilize learning under biased pseudo-labels, we design a Semantic-Aware Loss (SAL) that applies diagonally normalized weights, preventing gradient explosion while emphasizing minority-majority corrections. Additionally, a frozen Expert branch derived from large vision foundation models (LVFMs) serves as a supervisory reference during training, refining pseudo-label quality without adding inference overhead. GT's behavior-driven bias quantification makes it broadly applicable across domains without relying on dataset priors. Evaluations on Cityscapes-to-Foggy (50.8 mAP) and medical transfers (+5.9 AP50 on DDSM-to-INBreast) show consistent gains and improved minority-class detection, with less than 12\% additional training cost. Code and model are available at https://github.com/Tajamul21/Grounded-Teacher.

URL PDF HTML ☆

赞 0 踩 0

2504.12474 2026-05-26 cs.CL cs.AI

Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

在文本属性图中整合结构信号与语义信号：BiGTex

Azadeh Beiranvand, Seyed Mehdi Vahidipour

AI总结提出BiGTex架构，通过堆叠图-文本融合单元实现GNN与LLM的双向注意力，以参数高效微调（LoRA）在节点分类和链接预测任务上达到最优性能。

Comments 26 pages, 4 figures

详情

DOI: 10.1016/j.mlwa.2026.100921
Journal ref: Machine Learning with Applications 24 (2026) 100921

AI中文摘要

文本属性图（TAGs）在表示学习中提出了独特挑战，要求模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。图神经网络（GNNs）擅长建模拓扑信息，但缺乏处理非结构化文本的能力。相反，大型语言模型（LLMs）精通文本理解，但通常不了解图结构。在这项工作中，我们提出了BiGTex（双向图文本），一种通过堆叠图-文本融合单元紧密集成GNN和LLM的新型架构。每个单元允许文本和结构表示之间的相互注意力，使信息能够双向流动：文本影响结构，结构指导文本解释。所提出的架构使用参数高效微调（LoRA）进行训练，保持LLM冻结同时适应任务特定信号。在五个基准数据集上的大量实验表明，BiGTex在节点分类中实现了最先进的性能，并有效泛化到链接预测。消融研究进一步强调了软提示和双向注意力在模型成功中的重要性。

英文摘要

Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

URL PDF HTML ☆

赞 0 踩 0

2504.05108 2026-05-26 cs.AI cs.LG cs.NE

Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning

利用大语言模型发现算法：进化搜索遇见强化学习

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, Caglar Gulcehre

AI总结提出通过强化学习微调持续优化大语言模型，结合进化搜索加速发现更优算法，在组合优化任务上验证有效性。

Comments 34 pages

详情

AI中文摘要

发现解决复杂问题的高效算法一直是数学和计算机科学中的重大挑战，多年来需要大量人类专业知识。近期，基于大语言模型（LLMs）的进化搜索在加速跨领域算法发现方面展现出潜力，特别是在数学和优化领域。然而，现有方法将LLM视为静态生成器，错过了利用进化探索获得的信号更新模型的机会。在这项工作中，我们提出通过强化学习（RL）微调持续优化搜索算子——即LLM，从而增强基于LLM的进化搜索。我们的方法利用进化搜索作为探索策略来发现改进的算法，而RL则基于这些发现优化LLM策略。我们在组合优化任务上的实验表明，将RL与进化搜索相结合加速了更优算法的发现，展示了RL增强的进化策略在算法设计中的潜力。

英文摘要

Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on combinatorial optimization tasks demonstrate that integrating RL with evolutionary search accelerates the discovery of superior algorithms, showcasing the potential of RL-enhanced evolutionary strategies for algorithm design.

URL PDF HTML ☆

赞 0 踩 0

2504.00816 2026-05-26 cs.CV physics.med-ph

Two-stage deep learning framework for the restoration of incomplete-ring PET images

用于修复不完整环PET图像的两阶段深度学习框架

Yeqi Fang, Rong Zhou

AI总结提出一种两阶段深度学习框架，无需飞行时间信息，通过投影域注意力U-Net预测缺失正弦图部分和级联U-Net与热启动扩散模型进行图像细化，从约50%缺失符合事件的不完整环数据中恢复高质量PET图像。

Comments 17 pages, 5 figures

详情

AI中文摘要

正电子发射断层扫描（PET）是一种重要的分子成像工具，广泛应用于医学。传统的PET系统依赖完整的探测器环来实现全角度覆盖和可靠的数据收集。然而，由于硬件故障、成本限制或特定临床需求，出现了不完整环PET扫描仪。标准重建算法由于数据完整性的降低和几何不一致性，在这些系统中往往性能下降。我们提出了一种两阶段深度学习框架，无需任何飞行时间（TOF）信息，即可从约50%缺失符合事件的数据中恢复高质量图像——这是之前基于CNN方法处理损失水平的两倍。该流程分两个阶段运行：投影域注意力U-Net首先通过利用相邻切片的空间上下文预测正弦图的缺失部分，然后使用OSEM算法重建完整数据，并将其传递给级联U-Net和热启动扩散模型进行图像细化。该模块从U-Net粗预测而非纯高斯噪声开始反向扩散过程。使用来自真实扫描的613个模拟脑体积（196个健康脑样本、217个阿尔茨海默病样本和200个轻度认知障碍样本），结果表明我们的模型成功保留了大部分解剖结构和示踪剂分布特征，PSNR为38.18至38.59 dB，SSIM为0.9904至0.9925。我们的两阶段深度学习框架有效地从超过50%的不完整环数据中恢复高质量PET图像，实现了接近完整的解剖保真度和鲁棒性能，无需TOF信息。

英文摘要

Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance degradation with these systems because of reduced data completeness and geometric inconsistencies. We present a two-stage deep-learning framework that, without incorporating any time-of-flight (TOF) information, restores high-quality images from data with about 50% missing coincidences - double the loss levels previously addressed by CNN-based methods. The pipeline operates in two stages: a projection-domain Attention U-Net first predicts the missing sections of the sinogram by leveraging spatial context from neighbouring slices, after which the completed data are reconstructed with OSEM algorithm and passed to a cascaded U-Net & warm-start diffusion model for image refinement. This module starts the reverse diffusion process from the U-Net coarse prediction rather than pure Gaussian noise. Using 613 simulated brain volumes from real scans (196 healthy brain samples, 217 Alzheimer's disease samples, and 200 Mild Cognitive Impairment samples), the result shows that our model successfully preserves most anatomical structures and tracer distribution features with PSNR of 38.18 to 38.59 dB and SSIM of 0.9904 to 0.9925. Our two-stage deep-learning framework effectively restores high-quality PET images from over 50% incomplete-ring data, achieving near-complete anatomical fidelity and robust performance without requiring TOF information.

URL PDF HTML ☆

赞 0 踩 0

2503.23670 2026-05-26 cs.CV

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

学习双射曲面参数化以通过网格变形从稀疏点云推断符号距离函数

Takeshi Noda, Chao Chen, Junsheng Zhou, Weiqi Zhang, Yu-Shen Liu, Zhizhong Han

AI总结提出一种动态变形网络结合双射曲面参数化和网格变形优化的方法，从稀疏点云端到端预测符号距离函数，显著优于现有方法。

Comments Accepted by Conference on Computer Vision and Pattern Recognition (CVPR) 2025. Project page:https://takeshie.github.io/Bijective-SDF

详情

AI中文摘要

从稀疏点云推断符号距离函数（SDF）仍然是曲面重建中的一个挑战。关键在于稀疏点云缺乏学习连续场所需的详细几何信息。为解决此问题，我们提出了一种新颖的方法，学习一个动态变形网络以端到端方式预测SDF。为了从稀疏点参数化连续曲面，我们提出了双射曲面参数化（BSP），从局部块学习全局形状。具体来说，我们为从参数域到3D局部块的稀疏点构建双射映射，将块整合到全局曲面中。同时，我们将网格变形优化（GDO）引入曲面逼近，以优化网格点的变形并进一步细化参数曲面。在合成和真实扫描数据集上的实验结果表明，我们的方法显著优于当前最先进的方法。项目页面：https://takeshie.github.io/Bijective-SDF

英文摘要

Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF

URL PDF HTML ☆

赞 0 踩 0

2503.19605 2026-05-26 cs.LG cs.CL math.ST stat.TH

Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

Rademacher复杂度和Dudley熵积分的泛化误差界的Lean形式化

Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda

AI总结本文在Lean 4中形式化了基于Rademacher复杂度的泛化误差界，通过形式化对称化论证、有界差异分析和McDiarmid不等式，并扩展到可数假设类及可分离拓扑索引集，最后应用得到线性预测器的经验Rademacher界和Dudley熵积分界。

Comments accepted at ITP2026

详情

AI中文摘要

理解和证明机器学习算法的泛化性能——即从训练误差获得测试误差的理论估计——是统计学习理论的核心主题。在用于推导此类保证的众多复杂度度量中，Rademacher复杂度提供了尖锐的、数据相关的界，其适用范围远超经典的VC维理论。在本研究中，我们基于Mathlib库中可用的测度论概率论，在Lean 4中形式化了Rademacher复杂度的泛化误差界。我们的开发提供了一个经过机械检查的流水线，从经验和期望Rademacher复杂度的定义开始，经过形式化的对称化论证和有界差异分析，通过形式化证明的McDiarmid不等式得到高概率一致偏差界。一个关键的技术贡献是可重用机制，通过归约到可数稠密子集，将结果从可数假设类（其中上确界的可测性在Mathlib中直接成立）提升到可分离拓扑索引集。作为抽象定理的工作应用，我们机械化了$\ell_2$和$\ell_1$正则化下线性预测器的标准经验Rademacher界，并且我们还形式化了基于覆盖数和链式构造的Dudley型熵积分界。

英文摘要

Understanding and certifying the generalization performance of machine learning algorithms -- i.e. obtaining theoretical estimates of the test error from the training error -- is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical VC-dimension theory. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularizations, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.

URL PDF HTML ☆

赞 0 踩 0

2502.21297 2026-05-26 cs.CL

Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind

说服应该是双盲的：基于因果心智理论的多领域对话数据集

Dingyi Zhang, Linhai Zhang, Fanglei Qu, Ziqing Zhuang, Deyu Zhou

AI总结提出基于因果心智理论的多智能体框架ToMMA构建双盲说服对话数据集CToMPersu，以解决现有数据集信息泄露问题，提升对话真实性和说服力。

Comments 6 pages

2502.15835 2026-05-26 cs.CL cs.AI cs.SE

Pragmatic Reasoning improves LLM Code Generation

语用推理提升LLM代码生成

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

AI总结提出CodeRSA方法，通过局部语用竞赛对候选代码进行重排序，以解决自然语言到代码生成中的歧义问题，在多个基准测试中取得最佳平均准确率。

详情

AI中文摘要

语用推理帮助对话者通过考虑共享上下文和反事实替代方案，从模糊或未充分指定的信息中推断出预期含义。自然语言到代码生成中也会出现类似的挑战，因为用户指令通常允许多个合理的候选程序。然而，直接的RSA风格推理是困难的，因为它需要对程序空间和替代指令的大空间进行概率估计。我们提出了CodeRSA，一种受RSA启发的重排序方法，通过对采样代码候选进行局部语用竞赛，使语用推理变得可行。CodeRSA构建候选诱导的替代指令，并估计哪些候选最独特地受到原始指令的支持，从而避免了对整个程序-指令空间的全局归一化。我们在HumanEval+、MBPP+和BigCodeBench上使用四个开放权重的指令跟随模型评估了CodeRSA。在12个模型-基准设置中，CodeRSA在10个设置中取得了最强的平均准确率，并在其余情况下保持竞争力。进一步分析表明，其收益来自于将局部成对语用比较与更广泛的全局支持相结合，这为自然语言不确定性下的语言到代码重排序提供了一个可扩展的方向。

英文摘要

Pragmatic reasoning helps interlocutors infer intended meaning from ambiguous or underspecified messages by considering shared context and counterfactual alternatives. Similar challenges arise in natural language-to-code generation, where user instructions often admit multiple plausible candidate programs. However, direct RSA-style inference is difficult because it requires probability estimation over large spaces of programs and alternative instructions. We propose CodeRSA, an RSA-motivated reranking method that makes pragmatic reasoning tractable through local pragmatic contests among sampled code candidates. CodeRSA constructs candidate-induced alternative instructions and estimates which candidates are most distinctively supported by the original instruction, avoiding global normalization over the full program-instruction space. We evaluate CodeRSA on HumanEval+, MBPP+, and BigCodeBench using four open-weight instruction-following models. CodeRSA achieves the strongest average accuracy in 10 of 12 model-benchmark settings and remains competitive in the remaining cases. Further analyses show that its gains come from combining local pairwise pragmatic comparison with broader global support, suggesting a scalable direction for language-to-code reranking under natural-language uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2502.08047 2026-05-26 cs.AI cs.MA

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

WorldGUI: 一个从任意起点进行桌面GUI自动化的交互式基准测试

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou

AI总结提出WorldGUI基准测试，覆盖10个桌面和Web应用，在多种系统构建的初始状态下评估GUI代理的规划鲁棒性，并引入WorldGUI-Agent框架通过三阶段批评提升动态环境下的可靠性。

Comments Technique Report

详情

AI中文摘要

近期GUI代理的进展显著提升了视觉定位能力，但稳健的规划仍然具有挑战性，特别是当环境偏离规范初始状态时。在实际应用中，用户通常在工作流程中请求帮助，此时软件可能已部分配置，步骤可能以不同顺序执行，或者界面可能与默认设置不同。这种任务状态变异性普遍存在，但在现有GUI基准测试中评估不足。为解决这一问题，我们引入了WorldGUI，一个涵盖十种广泛使用的桌面和Web应用的基准测试，其任务在多样化、系统构建的初始状态下实例化。这些变化捕捉了真实的人机交互场景，并能够诊断评估代理恢复、调整计划以及处理非默认上下文的能力。我们进一步提出了WorldGUI-Agent，一个简单且与模型无关的框架，围绕三个批评阶段组织规划和执行，提高了动态环境中的可靠性。实验表明，最先进的GUI代理在非默认初始条件下表现出显著的性能下降，揭示了有限的鲁棒性和脆弱的规划行为。我们的基准测试和框架为开发更适应性和可靠的GUI代理奠定了基础。代码和数据可在https://github.com/showlab/WorldGUI获取。

英文摘要

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.

URL PDF HTML ☆

赞 0 踩 0

2502.06018 2026-05-26 cs.LG cs.AI

Kolmogorov-Arnold Fourier Networks

Kolmogorov-Arnold 傅里叶网络

Jusheng Zhang, Yijia Fan, Kaitong Cai, Keze Wang, Wenhao Wang

AI总结针对KAN网络参数爆炸和高维任务中高频特征捕获能力不足的问题，提出Kolmogorov-Arnold傅里叶网络（KAF），通过谱重参数化将局部B样条表示转换为全局自适应谱表示，引入可训练随机傅里叶特征和自适应混合GELU-傅里叶激活机制，在CV、NLP、音频和PDE求解任务上取得最优性能。

Comments Code:https://github.com/kolmogorovArnoldFourierNetwork/KAF

详情

AI中文摘要

尽管基于Kolmogorov-Arnold的可解释网络（KAN）具有强大的理论表达能力，但在高维任务中面临严重的参数爆炸和捕获高频特征能力有限的问题。为解决这些问题，我们提出了Kolmogorov-Arnold傅里叶网络（KAF），通过谱重参数化从根本上重新定义了KAN范式。我们的主要贡献包括：（1）提出从局部的、基于网格的B样条表示到全局的、自适应的谱表示的基础基变换。这一转变改变了网络的归纳偏置，将参数复杂度从$O(G)$降低到$O(1)$，同时保持表达能力；（2）引入通过谱对齐策略初始化的可训练随机傅里叶特征（RFF），使模型能够打破固定核的平滑性限制，准确捕获高频分量；（3）实现自适应混合GELU-傅里叶激活机制，在训练过程中逐步增强频率表示。大量实验证明了KAF在计算机视觉（CV）、自然语言处理（NLP）、音频和偏微分方程（PDE）求解任务上的优越性，以更高的效率实现了最先进的性能。代码可在https://github.com/kolmogorovArnoldFourierNetwork/KAF获取。

英文摘要

Although Kolmogorov-Arnold-based interpretable networks (KANs) possess strong theoretical expressiveness, they suffer from severe parameter explosion and limited ability to capture high-frequency features in high-dimensional tasks. To address these issues, we propose the Kolmogorov-Arnold Fourier Network (KAF), which fundamentally redefines the KAN paradigm through spectral reparameterization. Our key contributions include: (1) proposing a fundamental basis transformation from the local, grid-based B-spline representation to a global, adaptive spectral representation. This shift changes the network's inductive bias, reducing parameter complexity from $O(G)$ to $O(1)$ while preserving expressiveness; (2) introducing trainable Random Fourier Features (RFF) initialized via a spectral alignment strategy, which allows the model to break the smoothness limitation of fixed kernels and accurately capture high-frequency components; and (3) implementing an adaptive hybrid GELU-Fourier activation mechanism that progressively enhances frequency representation during training. Comprehensive experiments demonstrate the superiority of KAF across computer vision (CV), natural language processing (NLP), audio, and partial differential equation (PDE) solving tasks, achieving state-of-the-art performance with improved efficiency. The code is available at https://github.com/kolmogorovArnoldFourierNetwork/KAF.

URL PDF HTML ☆

赞 0 踩 0

2501.19389 2026-05-26 cs.LG

Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs

联邦草图LoRA：一种用于异构协作微调大语言模型的灵活框架

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton

AI总结针对资源受限客户端上大语言模型微调中的异构性问题，提出联邦草图LoRA（FSLoRA），通过草图机制让客户端选择性更新服务器维护的全局LoRA模块子矩阵，并利用草图比例灵活适应客户端约束，提供收敛性分析，实验表明优于基线并提升训练效率。

Comments We propose Federated Sketching LoRA (FSLoRA), a theoretically grounded methodology for collaborative LLM fine-tuning that retains LoRA's flexibility while adapting to the communication and computational capabilities of individual clients

详情

AI中文摘要

在资源受限的客户端上微调大语言模型（LLMs）仍然是一个具有挑战性的问题。最近的工作将低秩适应（LoRA）技术与联邦微调相结合，以缓解与客户端模型大小和数据稀缺相关的挑战。然而，资源的异构性仍然是一个关键瓶颈：虽然更高秩的模块通常能提升性能，但不同的客户端能力限制了LoRA可行的秩范围。现有试图解决该问题的方法要么缺乏分析依据，要么增加额外的计算开销，为高效且理论扎实的解决方案留下了很大空白。为了解决这些挑战，我们提出了联邦草图LoRA（FSLoRA），它利用草图机制使客户端能够选择性地更新服务器维护的全局LoRA模块的子矩阵。通过调整决定客户端子矩阵秩的草图比例，FSLoRA灵活地适应客户端特定的通信和计算约束。我们提供了FSLoRA的严格收敛性分析，刻画了草图比例如何影响收敛速度。通过大量实验，我们证明FSLoRA优于基线，并在保持稳定收敛的同时显著提高了训练效率。

英文摘要

Fine-tuning large language models (LLMs) on resource-constrained clients remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with client model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying client capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable clients to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the clients, FSLoRA flexibly adapts to client-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through extensive experiments, we demonstrate that FSLoRA outperforms baselines and significantly improves training efficiency while preserving stable convergence.

URL PDF HTML ☆

赞 0 踩 0

2501.18278 2026-05-26 cs.LG

ReactEmbed: A Plug-and-Play Module for Unifying Protein-Molecule Representations Guided by Biochemical Reaction Networks

ReactEmbed: 一种基于生化反应网络统一蛋白质-分子表示的可插拔模块

Amitay Sicherman, Kira Radinsky

AI总结提出ReactEmbed模块，利用生化反应网络对齐蛋白质和分子嵌入，实现跨域统一表示，无需重新训练。

2501.18196 2026-05-26 cs.LG

GDformer: Going Beyond Subsequence Isolation for Multivariate Time Series Anomaly Detection

GDformer：超越子序列隔离的多变量时间序列异常检测

Qingxiang Liu, Xiaoliang Luo, Chenghao Liu, Sheng Sun, Di Yao, Lvchun Wang, Wei Yu, Yuxuan Liang

AI总结提出全局字典增强Transformer（GDformer），通过基于字典的交叉注意力机制学习整个序列中所有正常点的全局表示，并利用原型捕获正常点-全局相关权重分布，实现基于表示相似性的统一检测准则，在五个基准数据集上达到最先进性能。

详情

AI中文摘要

无监督的多变量时间序列异常检测是一项具有挑战性的任务，因为需要在不访问异常点的情况下推导出紧凑的检测标准。现有方法主要基于重构误差或关联分歧，两者都局限于有限视野的孤立子序列，难以提供统一的序列级标准。在本文中，我们提出了全局字典增强Transformer（GDformer），采用改进的基于字典的交叉注意力机制，以培养整个序列中所有正常点共享的全局表示。相应地，交叉注意力图反映了点与全局表示之间的相关权重，这自然导致了基于表示相似性的检测标准。为了促进更紧凑的检测边界，引入了原型来捕获正常点-全局相关权重的分布。GDformer在五个真实世界基准数据集上一致实现了最先进的无监督异常检测性能。进一步的实验验证了全局字典在不同数据集之间具有良好的可迁移性。

英文摘要

Unsupervised anomaly detection of multivariate time series is a challenging task, given the requirements of deriving a compact detection criterion without accessing the anomaly points. The existing methods are mainly based on reconstruction error or association divergence, which are both confined to isolated subsequences with limited horizons, hardly promising unified series-level criterion. In this paper, we propose the Global Dictionary-enhanced Transformer (GDformer) with a renovated dictionary-based cross attention mechanism to cultivate the global representations shared by all normal points in the entire series. Accordingly, the cross-attention maps reflect the correlation weights between the point and global representations, which naturally leads to the representation-wise similarity-based detection criterion. To foster more compact detection boundary, prototypes are introduced to capture the distribution of normal point-global correlation weights. GDformer consistently achieves state-of-the-art unsupervised anomaly detection performance on five real-world benchmark datasets. Further experiments validate the global dictionary has great transferability among various datasets.

URL PDF HTML ☆

赞 0 踩 0

2412.07333 2026-05-26 cs.CV cs.AI

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

基于扩散模型的姿态引导人物图像合成的融合嵌入

Donghwna Lee, Kirok Kim, Jisu Lee, Kyungha Min, Wooju Kim

AI总结提出FPDM框架，通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入，并作为条件信号生成，解决姿态引导人物图像合成中纹理保真度和一致性问题。

详情

AI中文摘要

姿态引导人物图像合成（PGPIS）旨在生成指定姿态下的人物图像，同时保留源图像的身份和外观。该技术促进了多种应用，包括虚拟试穿、数字化身、动画和手语生成。尽管最近基于扩散的PGPIS取得了高质量结果，但这些模型通常依赖于去噪过程中的隐式特征聚合。因此，细粒度纹理保持有限，即使对于相同身份，也难以确保在姿态和源外观变化下生成一致性。为解决这些限制，我们提出了基于扩散模型的融合嵌入PGPIS（FPDM），这是第一个通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入，并随后使用学习到的融合嵌入作为生成条件信号的框架。FPDM将图像-姿态融合（IPF）模块集成到我们提出的源增强姿态融合方法中，以学习与目标图像对齐的融合嵌入。然后，我们采用由源外观、目标姿态和学习到的融合嵌入引导的条件扩散模型。在DeepFashion基准和RWTH-PHOENIX-Weather 2014T数据集上的实验表明，在定量和定性评估中，与现有方法相比具有竞争力的性能，消融研究证实显式融合嵌入对齐显著提高了纹理保真度以及跨姿态和源外观变化的一致性。

英文摘要

Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of a source image. This technology facilitates diverse applications, including virtual try-on, digital avatars, animation, and sign language generation. Despite the high-quality results of recent diffusion-based PGPIS, these models typically depend on implicit feature aggregation within the denoising process. As a result, fine-grained texture preservation is limited, and even for the same identity, it is difficult to ensure consistent generation under variations in pose and source appearance. To address these limitations, we propose Fusion Embedding for PGPIS using a Diffusion Model (FPDM), the first framework that explicitly aligns fused source-pose embeddings with target image embeddings via contrastive learning, and subsequently employs the learned fusion embedding as a conditioning signal for generation. FPDM integrates an Image-Pose Fusion (IPF) module into our proposed Source-Enhanced Pose Fusion approach to learn a fusion embedding aligned with the target image. We then employ a conditional diffusion model guided by source appearance, target pose, and the learned fusion embedding. Experiments on the DeepFashion benchmark and the RWTH-PHOENIX-Weather 2014T dataset demonstrate competitive performance compared to existing methods in both quantitative and qualitative evaluations, with ablation studies confirming that explicit fusion embedding alignment substantially improves texture fidelity and consistency across pose and source appearance variations.

URL PDF HTML ☆

赞 0 踩 0

2410.12673 2026-05-26 cs.CV

MambaBEV: An EV-based 3D detection model with Mamba2

MambaBEV：基于Mamba2的BEV三维检测模型

Zihan You, Ni Wang, Hao Wang, Qichao Zhao, Jinxiang Wang

AI总结提出MambaBEV模型，利用Mamba2状态空间模型通过TemporalMamba时序融合模块和Mamba-based DETR头增强全局上下文建模，提升自动驾驶中大型物体的3D检测精度。

Comments ICPR2026

详情

AI中文摘要

自动驾驶中的精确3D物体检测依赖于鸟瞰图（BEV）感知和有效的时序融合。然而，现有基于卷积层或可变形自注意力的融合策略难以建模BEV空间中的全局上下文，导致大型物体的检测精度降低。为解决这一限制，我们提出了MambaBEV，一种新颖的基于BEV的3D物体检测模型，利用Mamba2——一种针对长序列处理优化的先进状态空间模型（SSM）。我们的关键贡献是TemporalMamba，一种时序融合模块，通过专为序列处理设计的BEV特征离散重排机制增强全局上下文建模。此外，我们引入了一个基于Mamba的DETR头以改进多物体表示。在nuScenes数据集上的评估表明，MambaBEV-base达到了51.7%的NDS和42.7%的mAP。此外，在端到端自动驾驶范式中的评估验证了其在运动预测和规划中的有效性。这些结果突显了状态空间模型在提升自动驾驶感知系统中全局上下文理解和大型物体检测方面的潜力。

英文摘要

Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion. However, existing fusion strategies based on convolutional layers or deformable self-attention struggle to model global context in BEV space, leading to reduced accuracy for large objects.To address this limitation, we propose MambaBEV, a novel BEV-based 3D object detection model that leverages Mamba2, an advanced state-space model (SSM) optimized for long-sequence processing. Our key contribution is TemporalMamba, a temporal fusion module that enhances global context modeling through a BEV feature discrete rearrangement mechanism tailored for sequential processing. In addition, we introduce a Mamba-based DETR head to improve multi-object representation. Evaluations on the nuScenes dataset demonstrate that MambaBEV-base achieves 51.7% NDS and an 42.7% mAP. Furthermore, evaluation within an end-to-end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.These results highlight the potential of state-space models for improving global context understanding and large-object detection in autonomous driving perception systems.

URL PDF HTML ☆

赞 0 踩 0

2410.01648 2026-05-26 cs.CL

DeIDClinic: A Risk-Aware Pseudonymization Framework for Clinical Text De-identification and Re-identification Risk Assessment

DeIDClinic：面向临床文本去标识化和重识别风险评估的风险感知假名化框架

Angel Paul, Dhivin Shaji, Lifeng Han, Warren Del-Pinto, Goran Nenadic, Suzan Verberne

AI总结提出DeIDClinic多层框架，集成领域自适应变换器模型（BioBERT、ClinicalBERT）和文档级风险评估模块（k-匿名、l-多样性、t-接近度等），在i2b2 2014数据集上实现高F1分数，支持隐私保护数据共享。

Comments Accepted by and Presented at: LEGAL-CALD-Pseudo2026 @LREC2026

详情

AI中文摘要

敏感文本数据的日益增多产生了对鲁棒去标识化方法的迫切需求，这些方法需在保持下游实用性的同时实现合规数据共享。本文提出DeID-Clinic，一个用于临床自由文本数据自动假名化和重识别风险评估的多层框架。我们的方法将领域自适应变换器模型（包括BioBERT和ClinicalBERT）集成到MASK去标识化框架中，以改进受保护健康信息（PHI）的检测和掩码。除了实体识别，我们引入了一个新颖的文档级风险评估模块，该模块结合k-匿名、l-多样性、t-接近度、上下文相似性和实体共现分析来量化残余重识别风险。在i2b2 2014去标识化数据集上进行的实验展示了强劲性能，多个实体类别的宏观F1分数超过0.96，同时能够对高风险文档进行定量优先级排序以便进一步审查。我们的结果突显了将神经去标识化与显式风险建模相结合的有效性，支持敏感领域的隐私保护数据共享。尽管在临床文本上评估，所提出的框架可推广到其他隐私关键领域，如法律和行政文档，其中可靠的假名化和风险感知匿名化至关重要。关键词：自动去标识化、风险评估、患者隐私、假名化、个人健康信息。

英文摘要

The increasing availability of sensitive textual data has created an urgent need for robust de-identification methods that enable compliant data sharing while preserving downstream utility. This paper presents DeID-Clinic, a multi-layered framework for automated pseudonymization and re-identification risk assessment of clinical free-text data. Our approach integrates domain-adapted transformer models, including BioBERT and ClinicalBERT, into the MASK de-identification framework to improve the detection and masking of protected health information (PHI). Beyond entity recognition, we introduce a novel document-level risk assessment module that quantifies residual re-identification risk using a combination of k-anonymity, l-diversity, t-closeness, contextual similarity, and entity co-occurrence analysis. Experiments conducted on the i2b2 2014 de-identification dataset demonstrate strong performance, achieving macro-level F1 scores above 0.96 for several entity categories, while enabling quantitative prioritization of high-risk documents for further review. Our results highlight the effectiveness of combining neural de-identification with explicit risk modeling, supporting privacy-preserving data sharing in sensitive domains. Although evaluated on clinical text, the proposed framework is generalizable to other privacy-critical domains such as legal and administrative documents, where reliable pseudonymization and risk-aware anonymization are essential. Keywords{Automated De-Identification, Risk Assessment, Patient Privacy, Pseudonymization, Personal Health Information}

URL PDF HTML ☆

赞 0 踩 0

2409.19727 2026-05-26 cs.LG cs.CV

Investigating the Effect of Network Pruning on Performance and Interpretability

探究网络剪枝对性能与可解释性的影响

Jonathan von Rad, Florian Seuffert

AI总结本文通过系统应用非结构化、结构化剪枝及连接稀疏方法，研究不同剪枝技术对GoogLeNet在ImageNet验证集上的分类性能和可解释性的影响，发现充分重训练后性能可接近甚至超越原始网络，且可解释性评分与剪枝率无显著关联。

Comments 4 pages, 6 figures

详情

AI中文摘要

深度神经网络（DNN）通常对其任务而言是过参数化的，可以通过移除权重进行大幅压缩，这一过程称为剪枝。我们研究了不同剪枝技术对GoogLeNet的分类性能和可解释性的影响。我们系统地应用非结构化剪枝、结构化剪枝以及连接稀疏性（输入权重剪枝）方法，并分析这些方法对网络在ImageNet验证集上性能的影响。我们还比较了不同的重训练策略，如迭代剪枝和一次性剪枝。我们发现，通过足够的重训练轮次，网络的性能可以接近默认GoogLeNet的性能——甚至在某些情况下超越它。为了评估可解释性，我们采用了Zimmermann等人开发的机制可解释性评分（MIS）。我们的实验表明，当使用MIS作为度量时，可解释性与剪枝率之间没有显著关系。此外，我们观察到，准确率极低的网络仍然可以获得高MIS分数，这表明MIS可能并不总是与可解释性的直观概念（例如理解正确决策的基础）一致。

英文摘要

Deep Neural Networks (DNNs) are often over-parameterized for their tasks and can be compressed quite drastically by removing weights, a process called pruning. We investigate the impact of different pruning techniques on the classification performance and interpretability of GoogLeNet. We systematically apply unstructured and structured pruning, as well as connection sparsity (pruning of input weights) methods to the network and analyze the outcomes regarding the network's performance on the validation set of ImageNet. We also compare different retraining strategies, such as iterative pruning and one-shot pruning. We find that with sufficient retraining epochs, the performance of the networks can approximate the performance of the default GoogLeNet - and even surpass it in some cases. To assess interpretability, we employ the Mechanistic Interpretability Score (MIS) developed by Zimmermann et al. . Our experiments reveal that there is no significant relationship between interpretability and pruning rate when using MIS as a measure. Additionally, we observe that networks with extremely low accuracy can still achieve high MIS scores, suggesting that the MIS may not always align with intuitive notions of interpretability, such as understanding the basis of correct decisions.

URL PDF HTML ☆

赞 0 踩 0

2409.00346 2026-05-26 cs.CV

SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

SMAFormer: 协同多注意力Transformer用于医学图像分割

Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun Zhou

AI总结提出SMAFormer，一种融合像素注意力、通道注意力和空间注意力的Transformer架构，通过协同多注意力块和特征融合调制器提升小肿瘤和器官的分割性能。

Comments Accepted by IEEE BIBM 2024

详情

AI中文摘要

在医学图像分割中，专门的计算机视觉技术，特别是基于注意力机制的Transformer和采用跳跃连接的残差网络，在提升性能方面发挥了重要作用。然而，先前的模型在分割小且形状不规则的肿瘤时常常表现不佳。为此，我们引入了SMAFormer，一种高效的基于Transformer的架构，它融合了多种注意力机制以增强对小肿瘤和器官的分割。SMAFormer能够捕获医学图像分割的局部和全局特征。该架构包含两个关键组件。首先，提出了协同多注意力（SMA）Transformer块，它结合了像素注意力、通道注意力和空间注意力的优势以丰富特征。其次，针对注意力机制转换和特征融合过程中产生的信息丢失问题，我们设计了一个特征融合调制器。该模块通过减轻重塑引起的信息损失来增强通道注意力和空间注意力之间的整合。为了评估我们的方法，我们在各种医学图像分割任务上进行了广泛实验，包括多器官、肝脏肿瘤和膀胱肿瘤分割，取得了最先进的结果。代码和模型可在 https://github.com/lzeeorno/SMAFormer 获取。

英文摘要

In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: https://github.com/lzeeorno/SMAFormer.

URL PDF HTML ☆

赞 0 踩 0

2407.00848 2026-05-26 cs.RO

EgoExo++: Integrating On-demand Exocentric Visuals with 2.5D Ground Surface Estimation for Interactive Teleoperation of Underwater ROVs

EgoExo++：结合按需外中心视觉与2.5D地面估计的水下ROV交互式遥操作

Adnan Abdullah, Ruo Chen, Ioannis Rekleitis, Md Jahidul Islam

AI总结针对水下ROV遥操作视野受限问题，提出EgoExo++方法，通过几何驱动的视觉SLAM合成外中心视图并实时估计2.5D地面，提升操作性能和用户体验。

Comments EgoExo++ (Accepted in IJRR), V7/V3, metadata updated, 16 pages

详情

AI中文摘要

水下ROV（遥控潜水器）对于海底探索和任务执行不可或缺，但基于自我中心（第一人称）视频流的典型遥操作引擎限制了人类操作员的视野，并限制了在复杂、非结构化水下环境中的精确操控。为解决这一问题，我们首先提出EgoExo，一种集成到视觉SLAM流水线中的几何驱动解决方案，从自我中心摄像头馈送中按需合成外中心（第三人称）视图。我们进一步提出EgoExo++，它超越2D外中心视图合成（EgoExo），实时增强分段平面2.5D地面估计。其无锚点空中视角支持地面相对推理，如间隙和基于地形的导航标记跟随。所涉及的计算是闭式的，仅依赖于自我中心视图和单目SLAM估计，这使得它可移植到现有遥操作引擎，并对不同水体特性具有鲁棒性。我们通过2自由度室内导航和6自由度水下洞穴探索在挑战性低光条件下的广泛实验验证了方法的几何精度。为评估操作优势，我们进行了两项用户研究，分别使用模拟和真实数据，每项涉及15名参与者，比较基线自我中心遥操作和EgoExo++。结果表明，系统可用性（SUS）提高，感知工作负荷（NASA-TLX）降低，客观遥操作性能显著提升，包括任务速度提高16%，路径偏差比降低5倍，碰撞事件减少（试验中2次对比5次）。此外，我们强调了EgoExo++增强视觉在支持共享自主和具身遥操作中的作用。EgoExo++的源代码包可在https://github.com/uf-robopi/EgoExo获取。

英文摘要

Underwater ROVs (Remotely Operated Vehicles) are indispensable for subsea exploration and task execution, yet typical teleoperation engines based on egocentric (first-person) video feeds restrict human operators' field-of-view and limit precise maneuvering in complex, unstructured underwater environments. To address this, we first propose EgoExo, a geometry-driven solution integrated into a visual SLAM pipeline that synthesizes on-demand exocentric (third-person) views from egocentric camera feeds. We further propose EgoExo++, which extends beyond 2D exocentric view synthesis (EgoExo) to augment a piecewise planar 2.5D ground surface estimation on-the-fly. Its anchor-free aerial viewpoint supports ground-relative reasoning, such as clearance and terrain-based navigation marker following. The computations involved are closed-form and rely solely on egocentric views and monocular SLAM estimates, which makes it portable across existing teleoperation engines and robust to varying waterbody characteristics. We validate the geometric accuracy of our approach through extensive experiments of 2-DOF indoor navigation and 6-DOF underwater cave exploration in challenging low-light conditions. To assess operational benefits, we conduct two user studies with simulation and real-world data, each involving 15 participants, comparing baseline egocentric teleoperation and EgoExo++. Results indicate improved system usability (SUS), reduced perceived workload (NASA-TLX), and significant gains in objective teleoperation performance, including 16% faster missions, 5-fold reduction in path deviation ratio, and fewer collision events (2 vs. 5 across trials). Furthermore, we highlight the role of EgoExo++ augmented visuals in supporting shared autonomy and embodied teleoperation. The source packages for EgoExo++ are available at: https://github.com/uf-robopi/EgoExo.

URL PDF HTML ☆

赞 0 踩 0

2406.12179 2026-05-26 cs.CV

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

一群大脑的智慧：通用大脑编码器

Roman Beliy, Navve Wasserman, Amit Zalcher, Michal Irani

AI总结提出一种基于体素中心架构的通用大脑编码器，通过跨注意力机制联合多主体/数据集/机器的fMRI数据，提升个体编码性能并实现快速迁移学习。

详情

AI中文摘要

图像到fMRI编码对于神经科学研究和实际应用都很重要。然而，这种“大脑编码器”通常针对每个受试者和每个fMRI数据集进行训练，因此局限于非常有限的训练数据。在本文中，我们提出了一种通用大脑编码器，它可以联合训练来自许多不同受试者/数据集/机器的数据。实现这一点的关键是我们新的以体素为中心的编码器架构，该架构为每个大脑体素学习一个独特的“体素嵌入”。我们的编码器通过直接计算大脑体素嵌入与多级深度图像特征之间的交叉注意力，来训练预测每个大脑体素对每张图像的响应。这种以体素为中心的架构使得每个大脑体素的功能角色能够从体素-图像交叉注意力中自然涌现。我们展示了这种方法的能力：(i) 结合来自多个不同受试者（“一群大脑”）的数据以改善每个个体的大脑编码，(ii) 在受试者、数据集和机器（例如3特斯拉、7特斯拉）之间进行快速有效的迁移学习，仅需少量训练样本，(iii) 使用学习到的体素嵌入作为探索大脑功能（例如，大脑中编码了什么以及在哪里编码）的强大工具。

英文摘要

Image-to-fMRI encoding is important for both neuroscience research and practical applications. However, such "Brain-Encoders" have been typically trained per-subject and per fMRI-dataset, thus restricted to very limited training data. In this paper we propose a Universal Brain-Encoder, which can be trained jointly on data from many different subjects/datasets/machines. What makes this possible is our new voxel-centric Encoder architecture, which learns a unique "voxel-embedding" per brain-voxel. Our Encoder trains to predict the response of each brain-voxel on every image, by directly computing the cross-attention between the brain-voxel embedding and multi-level deep image features. This voxel-centric architecture allows the functional role of each brain-voxel to naturally emerge from the voxel-image cross-attention. We show the power of this approach to (i) combine data from multiple different subjects (a "Crowd of Brains") to improve each individual brain-encoding, (ii) quick & effective Transfer-Learning across subjects, datasets, and machines (e.g., 3-Tesla, 7-Tesla), with few training examples, and (iii) use the learned voxel-embeddings as a powerful tool to explore brain functionality (e.g., what is encoded where in the brain).

URL PDF HTML ☆

赞 0 踩 0

2403.06636 2026-05-26 cs.RO

Design, Control, and Motion Strategy for DELTA: Transformable Multilink Multirotor for Air-Ground Hybrid Locomotion and Manipulation

DELTA：可变形多连杆多旋翼飞行器的设计、控制与运动策略——用于空地混合运动与操作

Kazuki Sugihara, Moju Zhao, Takuzumi Nishio, Kei Okada, Masayuki Inaba

AI总结本文提出一种新型多连杆多旋翼机器人DELTA，通过在每个连杆上安装推进器并利用关节驱动，实现了地面滚动、空中飞行及多种环境下的操作能力，并设计了基于非线性优化的实时控制方法和考虑接触约束的运动策略。

Comments 20 pages, 31 figures

详情

AI中文摘要

近年来，多模态运动能力使机器人能够在陆地和空中领域机动。然而，大多数此类机器人仅设计用于运动，很少具备实际任务所需的操作能力。通过添加机械臂，地面机器人可以执行操作，一些带有机械臂的无人机已展示了空中操作能力。尽管如此，这类多旋翼无法直接用于地面操作，且这种配置本身不适合空地混合运动。这是因为其推进器集中式结构难以同时实现足够的操作自由度（DoF）以及带接触和变形的稳定运动。因此，在本工作中，我们开发了一种新型多连杆多旋翼机器人，每个连杆上装有推进器，并能够与环境接触。该机器人可以利用关节驱动，在多种环境中执行地面滚动运动、空中飞行运动以及操作。首先，我们介绍了所提出机器人的最小配置设计。我们还描述了运动学模型，并基于该模型提出了每个组件的设计。其次，我们提出了一种基于非线性优化的实时控制方法，该方法考虑了接触和关节运动，可应用于各种多旋翼。第三，我们提出了包含空地混合多连杆多旋翼特有接触约束的运动策略，并基于多接触模型分析了操作能力的局限性。最后，我们使用实现的样机展示了两个领域中的多种运动。据我们所知，这是多连杆多旋翼首次展示空地混合运动与操作。

英文摘要

In recent years, multimodal locomotion capabilities have enabled robots to maneuver in both terrestrial and aerial domains. However, most of these robots are designed only for locomotion, and few possess the manipulation capabilities required for practical tasks. By adding a manipulator, ground robots can perform manipulation, and some drones with robotic arms have demonstrated aerial manipulation. Nonetheless, such multirotors cannot be directly used for manipulation on the ground, and this configuration itself is unsuitable for air-ground hybrid locomotion. This is because their thruster-centralized structure makes it difficult to achieve both sufficient degrees of freedom (DoF) for manipulation and stable motion with contact and transformation. Therefore, in this work, we develop a new multilink multirotor with thrusters on each link and capable of contact with the environments. This robot can perform terrestrial rolling locomotion, aerial flight locomotion, and manipulation in multiple environments using joint actuation. First, we introduce a minimal configuration design of the proposed robot. We also describe a kinematic model and propose a design for each component based on this model. Second, we propose a real-time control method based on nonlinear optimization that considers contact and joint motion, which can be applied to various multirotors. Third, we propose motion strategies that include contact constraints specific to air-ground hybrid multilink multirotors, and analyze the limitations of manipulation capabilities based on multi-contact model. Finally, we demonstrate a variety of motions in both domains using the implemented prototype. To the best of our knowledge, this is the first demonstration of air-ground hybrid locomotion and manipulation by a multilink multirotor.

URL PDF HTML ☆

赞 0 踩 0

2311.15487 2026-05-26 cs.LG cs.AI math-ph math.MP math.OC stat.ML

Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

全局 $\mathcal{L}^2$ 最小化：通过深度学习中的几何自适应梯度下降实现均匀指数速率

Thomas Chen

AI总结本文利用微分几何中黎曼度量的任意性，提出两种改进的梯度下降流（过参数化和欠参数化设置），在秩条件成立时证明其以均匀指数收敛速率驱动 $\mathcal{L}^2$ 代价到全局最小值，并推广到秩条件不成立的情形。

Comments AMS Latex, 21 pages. Typos corrected, references and comments added

详情

AI中文摘要

我们考虑深度学习网络中的监督学习场景，并利用黎曼度量选择的任意性（微分几何的一般事实）来定义梯度下降流。在标准的深度学习方法中，参数空间（权重和偏置）上的梯度流是相对于欧几里得度量定义的。而在这里，我们选择相对于深度学习网络输出层中的欧几里得度量的梯度流。这自然地在参数空间中诱导出两种改进的梯度下降流版本，一种适用于过参数化设置，另一种适用于欠参数化设置。在过参数化情况下，我们证明，只要秩条件成立，改进的梯度下降的所有轨道都以均匀指数收敛速率将 ${\mathcal L}^2$ 代价驱动到其全局最小值；因此，对于任何预先指定的接近全局最小值的程度，可以获得一个先验的停止时间。我们指出了后者与亚黎曼几何的关系。此外，我们将上述框架推广到秩条件不成立的情况；特别地，我们表明局部平衡只有在秩损失发生时才能存在，并且通常它们不是孤立点，而是参数空间中临界子流形的元素。

英文摘要

We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.

URL PDF HTML ☆

赞 0 踩 0

2306.02216 2026-05-26 cs.LG cs.CV

Forgettable Federated Linear Learning with Certified Data Unlearning

具有认证数据遗忘的可遗忘联邦线性学习

Ruinan Jin, Minghui Chen, Qiong Zhang, Xiaoxiao Li

AI总结提出一种基于预训练模型线性近似的联邦遗忘框架，通过联邦线性训练实现高效、安全且可认证的客户端数据遗忘。

Comments IEEE Transactions on Neural Networks and Learning Systems

详情

DOI: 10.1109/TNNLS.2026.3683398
Journal ref: IEEE Transactions on Neural Networks and Learning Systems, Early Access, pp. 1-10, 2026

AI中文摘要

联邦学习（FL）能够在分布式客户端之间进行协作模型训练，同时保护用户隐私。最近，联邦遗忘（FU）的出现旨在解决“被遗忘权”问题，并在无需重新训练整个FL系统的情况下移除中毒或目标客户端的影响。然而，许多FU方法需要与保留或目标客户端通信，引入额外的安全风险，或存储历史模型，限制了其效率和实用性。此外，由于非线性模型及其训练动态的复杂性，大多数用于深度神经网络（DNN）的FU方法缺乏理论认证。在这项工作中，我们引入了可遗忘联邦线性学习，这是一个用于DNN的训练和遗忘框架。我们的方法使用预训练模型线性近似DNN，并通过联邦线性训练实现与原始网络相当的性能。我们进一步提出了一种经过认证、高效且安全的遗忘策略，使服务器能够在不进行额外客户端通信或存储的情况下移除目标客户端的影响。在从小型到大型数据集上使用卷积神经网络和现代基础模型进行的广泛实验表明，我们的方法在模型准确性和有效的目标客户端遗忘之间取得了平衡。这项工作为高效且可信的FU提供了一个实用的流程。代码：https://github.com/Nanboy-Ronan/2F2L-Federated-Unlearning

英文摘要

Federated Learning (FL) enables collaborative model training across distributed clients while preserving user privacy. Recently, Federated Unlearning (FU) has emerged to address the "right to be forgotten" and to remove the influence of poisoned or target clients without retraining the entire FL system. However, many FU methods require communication with retained or target clients, introduce additional security risks, or store historical models, limiting their efficiency and practicality. Moreover, most FU methods for deep neural networks (DNNs) lack theoretical certification due to the complexity of nonlinear models and their training dynamics. In this work, we introduce Forgettable Federated Linear Learning, a training and unlearning framework for DNNs. Our approach uses pre-trained models to linearly approximate DNNs and achieve performance comparable to the original networks through Federated Linear Training. We further present a certified, efficient, and secure unlearning strategy that enables the server to remove a target client's influence without additional client communication or storage. Extensive experiments on small- to large-scale datasets, using both convolutional neural networks and modern foundation models, show that our method balances model accuracy with effective target-client unlearning. This work provides a practical pipeline for efficient and trustworthy FU. Code: https://github.com/Nanboy-Ronan/2F2L-Federated-Unlearning

URL PDF HTML ☆

赞 0 踩 0

2105.01215 2026-05-26 cs.RO

Lidar Scan Registration Robust to Extreme Motions

对极端运动鲁棒的激光雷达扫描配准

Simon-Pierre Deschênes, Dominic Baril, Vladimír Kubelka, Philippe Giguère, François Pomerleau

AI总结针对极端运动下点云畸变导致配准失败的问题，提出一种考虑轨迹运动不确定性和环境几何的去畸变方法，在200 m/s^2和800 rad/s^2的峰值加速度下，平移误差降低9.26%，旋转误差降低21.84%。

Comments 8 pages, 8 figures, published in 2021 18th Conference on Robots and Vision (CRV), Burnaby, Canada

详情

DOI: 10.1109/CRV52889.2021.00014
Journal ref: 2021 18th Conference on Robots and Vision (CRV), 2021, pp. 17-24

AI中文摘要

配准算法，如迭代最近点（ICP），在过去几十年中已被证明在移动机器人定位算法中有效。然而，当机器人承受极端速度和加速度时，它们容易失败。例如，这种运动可能在碰撞后发生，导致点云严重畸变。虽然过去已经探索了点云去畸变方法以提高定位和建图精度，但这些方法仍然依赖于高精度的里程计系统或理想的导航条件。在本文中，我们提出了一种方法，考虑了用于去畸变点云的轨迹的剩余运动不确定性以及环境几何，以提高当前配准算法的鲁棒性。我们在一个产生200 m/s^2和800 rad/s^2峰值加速度的3D地图测试台上将我们的方法与其他三种解决方案进行了比较。在这些极端场景中，我们证明了我们的方法将平移误差降低了9.26%，旋转误差降低了21.84%。所提出的方法具有足够的通用性，可以无需调整地集成到许多加权ICP的变体中，并支持在更恶劣地形中的定位鲁棒性。

英文摘要

Registration algorithms, such as Iterative Closest Point (ICP), have proven effective in mobile robot localization algorithms over the last decades. However, they are susceptible to failure when a robot sustains extreme velocities and accelerations. For example, this kind of motion can happen after a collision, causing a point cloud to be heavily skewed. While point cloud de-skewing methods have been explored in the past to increase localization and mapping accuracy, these methods still rely on highly accurate odometry systems or ideal navigation conditions. In this paper, we present a method taking into account the remaining motion uncertainties of the trajectory used to de-skew a point cloud along with the environment geometry to increase the robustness of current registration algorithms. We compare our method to three other solutions in a test bench producing 3D maps with peak accelerations of 200 m/s^2 and 800 rad/s^2. In these extreme scenarios, we demonstrate that our method decreases the error by 9.26 % in translation and by 21.84 % in rotation. The proposed method is generic enough to be integrated to many variants of weighted ICP without adaptation and supports localization robustness in harsher terrains.

URL PDF HTML ☆

赞 0 踩 0

2605.24524 2026-05-26 cs.LG cs.CL q-bio.NC

What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval

我们究竟在解码什么？非侵入式脑到语言检索的源归因

Xinyu Zhang, Sichao Liu, Runhao Lu, Alexandra Woolgar, Lihui Wang

AI总结针对非侵入式神经语言解码中结果被非刺激诱发源（如解码器先验、嵌入度量、信号时长等）膨胀的问题，提出一个审计框架，通过结构捷径、窗口级刺激锁定证据和跨窗口上下文聚合三种源分离，并引入组上下文偏差（GCB）作为可控的源归因干预，实现性能的源归因而非仅报告。

Comments 35 pages, 7 figures, 25 tables

详情

AI中文摘要

在非侵入式神经语言解码中，结果可能被非刺激诱发的神经证据源膨胀：解码器先验、基于嵌入的度量以及非神经结构干扰（如信号时长）。因此，方法学挑战在于归因：当报告的性能提升可以追溯到特定源时，它才更具信息性。我们将刺激锁定的MEG到音频检索重新构建为一个审计框架，将表观性能分离为三个源——结构捷径、窗口级刺激锁定证据和跨窗口上下文聚合——并为每个源提供诊断。在变长解码下，信号盲的高斯噪声达到66.3%的Rank@1（R@1），但一旦强制执行固定时长窗口和刺激身份分割，其性能骤降至接近随机，从而隔离了结构泄漏。在这些控制下，固定窗口检索恢复了可测量的MEG-音频可区分性，而一个神谕句子桶诊断显示，95.7%的Top-1错误选择了错误的句子，将剩余瓶颈定位到句子级竞争。我们使用组上下文偏差（GCB）审计这一上下文源，这是一种推理时的加性logit偏差，它跨窗口汇集句子一致的证据，同时保持基础检索分数和候选池固定。作为分数空间干预，GCB使上下文源变得可测量：在相同固定设置下，Gwilliams上的R@1从44%变为52%，MOUS上从22%变为29%。在此设计下，GCB是可审计的：其效应在随机分组扰动下崩溃，并在局部证据在MEG中衰减或在EEG中接近随机时消失，支持其作为受控源归因干预的使用。这些结果表明，脑到语言性能应进行源归因，而不仅仅是报告。

英文摘要

In non-invasive neural language decoding, results can be inflated by sources that are not stimulus-evoked neural evidence: decoder priors, embedding-based metrics, and non-neural structural nuisances such as signal duration. The methodological challenge is therefore attribution: a reported gain is more informative when it can be traced to a specific source. We recast stimulus-locked MEG-to-audio retrieval as an auditing framework that separates apparent performance into three sources - structural shortcuts, window-level stimulus-locked evidence, and cross-window contextual aggregation - and provides a diagnostic for each. Signal-blind Gaussian noise reaches 66.3% Rank@1 (R@1) under variable-length decoding but collapses to near chance once fixed-duration windows and stimulus-identity splits are enforced, isolating structural leakage. Under these controls, fixed-window retrieval recovers measurable MEG-audio discriminability, while an oracle sentence-bucket diagnostic shows that 95.7% of Top-1 errors select the wrong sentence, localising the residual bottleneck to sentence-level competition. We audit this contextual source with Group Context Bias (GCB), an inference-time additive logit bias that pools sentence-consistent evidence across windows while leaving the base retrieval scores and candidate pool fixed. Used as a score-space intervention, GCB makes the contextual source measurable: R@1 shifts from 44% to 52% on Gwilliams and from 22% to 29% on MOUS under the same fixed setting. GCB is auditable under this design: its effect collapses under random-grouping perturbations and vanishes when local evidence is attenuated in MEG or is near chance in EEG, supporting its use as a controlled source-attribution intervention. These results suggest that brain-to-language performance should be source-attributed, not merely reported.

URL PDF HTML ☆

赞 0 踩 0

2605.24523 2026-05-26 cs.LG cs.CL q-bio.NC

MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

MindAlign: 弥合脑电图、视觉和语言实现零样本视觉解码

Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar, Xi Vincent Wang, Lihui Wang

AI总结提出一种三模态对比学习框架MindAlign，通过对齐脑电图、图像和文本表示，在Things-EEG2零样本基准上实现54.1% Top-1和83.4% Top-5准确率，显著超越先前方法。

Comments 20 pages, 10 figures, 15 tables

详情

AI中文摘要

从大脑信号进行视觉解码是计算机视觉和神经科学交叉领域的关键挑战，需要连接神经表征和视觉计算模型的方法。我们提出了一种基于脑电图的视觉解码三模态对比框架，在统一潜在空间中对齐脑电图、视觉和文本表示。我们的方法采用两阶段设计。首先，我们通过无标签试次上的掩码重建预训练脑电图编码器，学习可稳健迁移到下游任务的时空规律。其次，我们通过对比学习联合对齐脑电图、图像和大语言模型生成的文本描述，其中文本监督作为语义正则化器，向共享空间注入语言结构，而不压倒主要的脑电图-图像信号。编码器集成了被试自适应、通道上的图注意力和时空卷积嵌入。在Things-EEG2 200路零样本基准上，我们的框架实现了54.1%的Top-1和83.4%的Top-5准确率，大幅超过最强先前基线（32.4%/64.0%），配对Wilcoxon检验证实所有被试内基线的显著性（p<0.01）。我们在Things-MEG上验证了泛化性。分析表明，紧凑的嵌入几何（CN-CLIP）优于更大的骨干网络，且解码与视觉处理的既定神经生理学一致。这项工作是从非侵入性时间神经信号进行稳健、语义基础视觉解码的关键一步。源代码公开于https://github.com/anon-eeg/eeg_image_decoding。

英文摘要

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in https://github.com/anon-eeg/eeg_image_decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.24518 2026-05-26 cs.CL cs.AI

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

语法引导的稀疏注意力：高效且可解释的Transformer

Spandan Pratyush

AI总结提出语法引导的稀疏注意力方法，通过词性标签动态生成注意力掩码，在保持精度的同时降低计算复杂度。

Comments 9 pages, 2 tables Code available at https://github.com/toughthinktank/grammatically_guided_attention#

详情

AI中文摘要

Transformer模型中自注意力的二次复杂度仍然是处理长序列和高效部署大型语言模型的主要瓶颈。为此，已有大量关于稀疏注意力的研究，Deepseek稀疏注意力结合了多种创建令牌片段的方法以降低时间复杂度。本文提出了一种新颖的方法——语法引导的稀疏注意力，它基于令牌的语法角色约束注意力计算。通过利用词性（POS）标签，动态生成注意力掩码，强制令牌之间建立语言上连贯的连接，从而在不牺牲必要语言依赖性的情况下减少计算图。提出并评估了两种掩码策略：硬掩码严格只允许预定义的语法交互，软掩码则将注意力偏向这些交互。使用类似DistilBERT的架构在SST-2情感分类任务上进行的实验表明，语法引导的稀疏注意力在保持与全注意力相当的精度的同时，显著降低了理论计算开销。初步结果显示，硬掩码的准确率为0.8200，软掩码为0.8165，与全注意力的0.8200非常接近，为构建更高效、可解释且具有语言知识的Transformer架构提供了途径。

英文摘要

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.24517 2026-05-26 cs.LG cs.CL

ECHO: Terminal Agents Learn World Models for Free

ECHO: 终端代理免费学习世界模型

Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, Dimitris Papailiopoulos

AI总结提出ECHO混合目标，通过预测环境观测令牌将终端反馈转化为密集监督信号，显著提升CLI代理在TerminalBench-2.0上的性能。

详情

AI中文摘要

CLI代理是语言模型最接近具身环境的设置：模型发出命令，终端执行它们，返回的流——stdout、错误、文件、日志和跟踪——记录了后果。我们认为这个流是一个监督信号，但标准的代理强化学习丢弃了它：GRPO风格的训练使用稀疏的结果级奖励更新动作令牌，而忽略了rollout中已有的环境响应。失败的rollout尽管包含关于环境如何响应的丰富证据，但提供的策略梯度信号很少。我们引入了ECHO（环境交叉熵混合目标），这是一种混合目标，它将动作令牌上的标准策略梯度损失与辅助损失相结合，该辅助损失训练策略预测其自身动作产生的环境观测令牌。ECHO重用与GRPO相同的前向传播，不需要额外的rollout，并将终端反馈转化为所有rollout的密集监督。ECHO在TerminalBench-2.0上将GRPO的pass@1翻倍：Qwen3-8B从2.70%提升到5.17%，Qwen3-14B从5.17%提升到10.79%。ECHO还产生了更好地预测终端动态的策略，即使是在它们未生成的轨迹上：在保留的rollout中，它显著降低了环境令牌的交叉熵，而单独的GRPO几乎没有改变。从基础Qwen3-8B开始，ECHO在没有专家演示的情况下，在保留的终端任务上匹配了专家SFT然后GRPO的性能，并在TerminalBench-2.0上恢复了大专家SFT初始化收益的一半。在某些设置中，仅环境预测损失就能实现无验证器的自我改进，使策略仅通过与环境交互就能在未见过的OOD任务上改进。这些结果表明，环境观测不仅是未来动作的上下文，而且是每个rollout中已经存在的密集、在策略的监督信号。

英文摘要

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

URL PDF HTML ☆

赞 0 踩 0