arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2602.05833 2026-06-08 cs.LG 版本更新

SecretFan: Synthesizing Realistic Data without Breaking Privacy

SecretFan: 在不破坏隐私的情况下合成真实数据

Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller

发表机构 * CISPA Helmholtz Center for Information Security(CISPA赫尔姆霍尔茨信息安全部) Saarland University(萨尔兰州大学)

AI总结 提出将合成数据生成视为引导测试生成问题,结合生成对抗网络(GAN)的判别器和模糊测试生成器,在保护隐私的同时生成高可用性合成数据。

详情
AI中文摘要

需要合成训练和测试数据集,这些数据集能够复制原始数据集的统计分布,同时不损害其机密性。已有大量研究利用生成对抗网络(GAN)进行合成数据生成,但生成的模型要么不够准确,要么由于原始数据在训练过程中被利用,仍然容易受到成员推断攻击(MIA)或数据集重建攻击。在本文中,我们将合成数据生成视为引导测试生成或基于搜索的测试问题,而不是纯粹的生成建模任务。我们提出了一种基于搜索的、充分性引导的输入生成技术,灵感来自GAN,包括生成步骤和判别步骤;与GAN一样,判别使用在数据上训练的判别器模型,但生成部分我们不使用模型,而是使用模糊测试器。这样,原始(私有)数据仅在生成过程中间接利用,通过演化样本并用判别器确定“好样本”,我们可以生成遵循与原始数据集相同统计分布的隐私保护数据,从而获得与原始数据相似的效用。我们在八个用于评估最先进技术的数据集上评估了我们的方法,发现我们的技术生成的合成数据平均具有良好效用,同时具有较高的相似性得分,突显了结合经典生成和模型驱动判别的混合方法在生成隐私保护且有用的合成数据集方面的潜力。

英文摘要

There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation, however the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we frame synthetic data generation as a guided test generation, or search-based testing problem rather than a purely generative modeling task. Ours is a search-based, adequacy-guided input generation technique inspired by GANs, with a generation step and a discrimination step; as in GAN, discrimination uses a discriminator model trained on the date, but instead of using models also for generation, we use a fuzzer. This way, the original (private) data is only indirectly leveraged in the generation process, and by evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions as the original dataset, leading to a similar utility as the original data. We evaluated our approach on eight datasets that have been used to evaluate the state-of-the-art techniques, finding that synthetic generated with our technique achieves good utility on average while also having good similarity scores, highlighting the potential of a mixed approach leveraging classical generation and model-driven discrimination for generating privacy-preserving, useful synthetic datasets.

2602.02819 2026-06-08 cs.LG stat.ML 版本更新

Causal Evaluation of Membership Inference Attacks

成员推断攻击的因果评估

Mathieu Even, Clément Berenfeld, Linus Bleistein, Tudor Cebere, Julie Josse, Aurélien Bellet

发表机构 * Inria(法国国家科学研究中心) PreMeDICaL, Inserm, Montpellier, France(PreMeDICaL、法国国家医学研究院、蒙彼利埃,法国) School of Computer and Communication Science (EPFL)(信息与通信科学学院(EPFL)) School of Life Sciences (EPFL)(生命科学学院(EPFL)) Lausanne, Switzerland(瑞士洛桑)

AI总结 将成员推断攻击评估视为因果推断问题,定义记忆化为包含数据点的因果效应,提出多轮、单轮和零轮设置下的实用估计器并验证其有效性。

详情
Comments
Fixed ref label problems
AI中文摘要

成员推断攻击(MIA)旨在区分训练点(成员)和未见数据(非成员),并广泛用于量化记忆化和评估隐私风险。标准MIA评估需要重复训练,对于大型模型计算成本高昂。单轮(单次训练,随机数据包含)和零轮(事后评估)方法常被用作替代,但其统计有效性尚不清楚。我们通过将MIA评估框架化为因果推断问题来填补这一空白,将\emph{记忆化定义为在训练集中包含一个数据点的因果效应}。这一新颖的表述揭示并形式化了现有协议中偏差的关键来源:单轮方法受到联合包含点之间的干扰,而零轮评估还受到成员与非成员评估数据之间分布偏移的混淆。我们推导了标准MIA指标的因果类比,并提出了多轮、单轮和零轮设置下的实用估计器,具有非渐近一致性保证。我们在多个设置中验证了我们的方法,包括预训练和微调的大型语言模型,表明它能够在无需重新训练且存在分布偏移的情况下可靠地测量MIA性能。总体而言,我们的框架为现代AI系统中的隐私评估提供了原则性基础。

英文摘要

Membership Inference Attacks (MIAs) aim to distinguish training points (members) from unseen data (non-members), and are widely used to quantify memorization and assess privacy risks. Standard MIA evaluation requires repeated retraining, which is computationally costly for large models. One-run (single training with randomized data inclusion) and zero-run (post hoc evaluation) methods are often used instead, but their statistical validity remains unclear. We address this gap by framing MIA evaluation as a causal inference problem, defining \emph{memorization as the causal effect of including a data point in the training set}. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations are additionally confounded by distribution shift between member and non-member evaluation data. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. We validate our approach in several settings, including pretrained and fine-tuned LLMs, showing that it enables reliable measurement of MIA performance without retraining and under distribution shift. Overall, our framework provides a principled foundation for privacy evaluation in modern AI systems.

2602.02014 2026-06-08 cs.CV cs.AI cs.CL cs.LG 版本更新

Rethinking Genomic Modeling Through Optical Character Recognition

通过光学字符识别重新思考基因组建模

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OpticalDNA框架,将DNA渲染为视觉布局,利用视觉语言模型进行OCR式基因组理解,实现高保真压缩和长序列高效处理,在450k碱基序列上以近20倍更少有效token超越基线模型。

详情
Comments
Accepted by ICML 2026
AI中文摘要

最近的基因组基础模型大多采用大型语言模型架构,将DNA视为一维token序列。然而,穷举式顺序阅读在结构上与稀疏且不连续的基因组语义不匹配,导致在低信息背景上的计算浪费,并阻碍了面向长上下文的压缩理解。在此,我们提出OpticalDNA,一个基于视觉的框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解。OpticalDNA将DNA渲染为结构化视觉布局,并训练一个具备OCR能力的视觉语言模型,该模型包含视觉DNA编码器和文档解码器,其中编码器生成紧凑、可重建的视觉token以实现高保真压缩。基于这种表示,OpticalDNA定义了基于提示条件的核心基因组原语目标——读取、区域定位、子序列检索和掩码跨度补全——从而学习到布局感知的DNA表示,在减少的有效token预算下保留细粒度的基因组信息。在多种基因组基准测试中,OpticalDNA持续优于最近的基线模型;在长达450k碱基的序列上,它以近20倍更少的有效token实现了最佳整体性能,并且仅调整256k可训练参数就超越了激活参数多达985倍的模型。

英文摘要

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20$\times$ fewer effective tokens, and surpasses models with up to 985$\times$ more activated parameters while tuning only 256k trainable parameters.

2602.00541 2026-06-08 cs.LG 版本更新

One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models

一个损失统治一切:结构化EHR基础模型的标记时间到事件

Zilin Jing, Vincent Jeanselme, Yuta Kobayashi, Simon A. Lee, Chao Pang, Aparajita Kashyap, Yanwei Li, Xinzhuo Jiang, Shalmali Joshi

发表机构 * Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系) Department of Biomedical Informatics, Columbia University(哥伦比亚大学生物医学信息学系) Department of Computational Medicine, UCLA(洛杉矶大学计算医学系) Formation Bio

AI总结 提出ORA预训练目标,联合建模事件时间和关联测量,相比下一词预测和忽略连续测量的损失,在多个数据集和下游任务上产生更通用的表示,提升回归和时间到事件预测能力。

详情
AI中文摘要

电子健康记录(EHR)中捕获的临床事件是不规则采样的,可能由离散事件和数值测量(如实验室值或治疗剂量)混合组成。EHR的序列性质类似于自然语言,这促使使用下一词预测来训练事件上的EHR基础模型(FM)。然而,这种训练未能捕获EHR的完整结构。必须捕获给定事件发生的时间,但事件值(异常实验室)也会调节其他临床事件的可能性。大多数现有的EHR FM不联合建模这种可能性,无法捕获完整的观察过程,影响下游能力。我们提出ORA,一种标记时间到事件预训练目标,联合建模事件时间和相关测量。在多个数据集、下游任务和模型骨干上,该目标始终比下一词预测和忽略连续测量的预训练损失产生更可泛化的表示。重要的是,所提出的目标在传统分类评估之外带来改进,包括更好的回归和时间到事件预测。除了引入新的FM家族,我们的消融研究提出了更广泛的结论:考虑EHR结构的预训练目标对于扩展下游能力和泛化性至关重要。

英文摘要

Clinical events captured in Electronic Health Records (EHR) are irregularly sampled and may consist of a mixture of discrete events and numerical measurements, such as laboratory values or treatment dosages. The sequential nature of EHR, analogous to natural language, has motivated the use of next-token prediction to train prior EHR Foundation Models (FMs) over events. However, this training fails to capture the full structure of EHR. When a given event occurs must be captured, but the event value (abnormal lab) also modulates the likelihood of other clinical events. Most existing EHR FMs do not jointly model this likelihood and are unable to capture the full observation process, impacting downstream capabilities. We propose ORA, a marked time-to-event pretraining objective that jointly models event timing and associated measurements. Across multiple datasets, downstream tasks, and model backbones, this objective consistently yields more generalizable representations than next-token prediction and pretraining losses that ignore continuous measurements. Importantly, the proposed objective yields improvements beyond traditional classification evaluation, including better regression and time-to-event prediction. Beyond introducing a new family of FMs, our ablations suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability.

2602.00471 2026-06-08 cs.AI cs.CV 版本更新

Dual Latent Memory for Visual Multi-agent System

面向视觉多智能体系统的双潜在记忆

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出L²-VMAS框架,通过双潜在记忆解耦感知与思考,并采用熵驱动主动触发机制,打破视觉多智能体系统的“扩展墙”,在提升准确率的同时大幅降低令牌消耗。

详情
AI中文摘要

尽管视觉多智能体系统(VMAS)有望通过智能体间协作增强综合能力,但经验证据揭示了一个反直觉的“扩展墙”:增加智能体轮次往往会降低性能,同时指数级增加令牌成本。我们将这一失败归因于以文本为中心的通信中固有的信息瓶颈,其中将感知和思维轨迹转换为离散自然语言不可避免地导致语义损失。为此,我们提出了\textbf{L}$\mathbf{^{2}}$\textbf{-VMAS},一种新颖的模型无关框架,通过双潜在记忆实现智能体间协作。此外,我们解耦了感知与思考,同时动态合成双潜在记忆。另外,我们引入了熵驱动的主动触发,用高效的按需内存访问取代被动信息传输。在骨干网络、规模和多智能体结构上的大量实验表明,我们的方法有效打破了“扩展墙”,具有卓越的可扩展性,平均准确率提高2.7-5.4%,同时令牌使用量减少21.3-44.8%。

英文摘要

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose \textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

2602.00163 2026-06-08 cs.CV q-bio.NC 版本更新

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

基于深度学习姿态估计的联合多动性运动障碍多标签识别

Laura Cif, Diane Demailly, Gabriella A. Horvàth, Juan Dario Ortigoza Escobar, Nathalie Dorison, Mayté Castro Jiménez, Cécile A. Hubsch, Thomas Wirth, Gun-Marie Hariz, Sophie Huby, Morgan Dornadic, Zohra Souei, Muhammad Mushhood Ur Rehman, Simone Hemm, Mehdi Boulayme, Eduardo M. Moraud, Jocelyne Bloch, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV) and University of Lausanne (UNIL)(日内瓦大学医院(CHUV)和日内瓦大学) Institut du Neurone(神经研究所) Department of Neurology, Clinique Beau Soleil, Institut Mutualiste Montpelliérain(神经科,贝索尔诊所,蒙彼利埃互益研究所) Department of Pediatrics, British Columbia Children’s Hospital(儿科,不列颠哥伦比亚儿童医院) Movement Disorders Unit, Pediatric Neurology Department, Institut de Recerca, Hospital Sant Joan de Déu(运动障碍科,儿童神经科,研究所,圣约翰德杜医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(罕见神经系统疾病欧洲参考网络(ERN-RND)) U-703 Centre for Biomedical Research on Rare Diseases (CIBER-ER), Instituto de Salud Carlos III(罕见疾病生物医学研究中心(CIBER-ER),卡洛斯三世健康研究所) Pediatric Neurosurgery Department, CCMR Neurogenetique, European Reference Network Brainteam Member, Rothschild Foundation Hospital(小儿神经外科部门,CCMR神经遗传学,欧洲参考网络Brainteam成员,罗切什基金会医院) Department of Neurology, University Hospital of Strasbourg(神经科,斯特拉斯堡大学医院) Strasbourg Neuroscience Institute, Strasbourg University(斯特拉斯堡神经科学研究所,斯特拉斯堡大学) Institute of Genetics and Cellular biology(遗传学和细胞生物学研究所)

AI总结 针对多动性运动障碍(HMD)临床识别主观性强、表型重叠的问题,提出基于姿态的机器学习框架,从常规临床视频提取关键点时间序列并计算多维度运动学特征,实现多标签分类。

详情
AI中文摘要

多动性运动障碍(HMD),如肌张力障碍、震颤、舞蹈症、肌阵挛和抽动症,是儿童和成人中致残的运动表现。其波动性、间歇性和频繁共存的表达阻碍了临床识别和纵向监测,这些在很大程度上仍然是主观的且易受评估者间变异影响。目前仍缺乏客观且可扩展的方法来从常规临床视频中区分重叠的HMD表型。在此,我们开发了一个基于姿态的机器学习框架,将常规门诊视频转化为解剖学上有意义的关键点时间序列,并计算涵盖统计、时间、频谱以及高阶不规则性-复杂性特征的运动学描述符。

英文摘要

Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.

2601.23204 2026-06-08 cs.AI 版本更新

TSAQA: Time Series Analysis Question And Answering Benchmark

TSAQA:时间序列分析问答基准

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Dongqi Fu, Jingchao Ni, Jingrui He, Hanghang Tong

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Virginia Polytechnic Institute and State University(弗吉尼亚理工学院和州立大学) Amazon(亚马逊) Meta AI University of Houston(休斯顿大学)

AI总结 提出TSAQA基准,涵盖6种时间序列分析任务(含新型PZ格式),评估LLM在13领域21万样本上的表现,最佳模型仅65.08分。

详情
Comments
Comments: 35 pages, 7 figures. Accepted to the GEM Workshop at ACL 2026
AI中文摘要

时间序列数据在金融、医疗、交通和环境科学等关键应用中不可或缺。虽然近期工作开始探索多任务时间序列问答(QA),但现有基准仍局限于预测和异常检测任务。我们引入了TSAQA,这是一个新颖的统一基准,旨在拓宽任务覆盖范围并评估多样化的时间分析能力。TSAQA在单一框架下整合了六种不同任务,从常规分析(包括异常检测和分类)到高级分析(如特征描述、比较、数据转换和时间关系分析)。该数据集涵盖13个领域的21万个样本,采用多种格式,包括真/假(TF)、多项选择(MC)和一种新颖的谜题(PZ),以全面评估时间序列分析。零样本评估表明,这些任务对当前大型语言模型(LLM)具有挑战性:表现最好的商业LLM Gemini-2.5-Flash的平均得分仅为65.08。尽管指令调优提升了开源模型的性能:表现最好的开源模型LLaMA-3.1-8B仍有显著改进空间,凸显了LLM进行时间分析的复杂性。

英文摘要

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

2601.22574 2026-06-08 cs.CV cs.AI 版本更新

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

增强视频表示中的时空语义残差以缓解视频大型多模态模型中的幻觉

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Wenbin Xing, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) University of Toronto(多伦多大学) Dalian University of Technology(大连理工大学) Sun Yat-sen University(中山大学)

AI总结 提出ViSSRes方法,通过轻量级MLP网络学习视频表示的残差,从时空和语义一致性优化,在推理时仅需单次前向传播,有效降低幻觉率并提升视频理解性能。

详情
Comments
Preprint
AI中文摘要

尽管视频大型多模态模型在视频理解方面取得了强劲性能,但它们仍然存在幻觉问题。现有的推理时干预方法通常在对比解码框架下修改视频,但其启发式设计带来的改进有限且增加了推理延迟。为了解决这些问题,我们提出了ViSSRes,一种通过轻量级MLP风格网络增强视频表示的推理时干预方法。具体来说,我们使用对比随机游走方法来表征视频表示的时空一致性,并引入条件互信息将视频表示与模型的语义理解关联起来。在保持模型主干冻结的情况下,ViSSRes学习视频表示的残差,并从时空和语义一致性角度优化它们。在推理时,ViSSRes仅需单次前向传播,且不会引入显著的额外推理成本。实验表明,ViSSRes在EventHallusion上将LLaVA-NeXT-Video的幻觉率降低了40.69%,并在CoT设置下将MMVU上的视频理解提升了18.36%,证明了其在缓解幻觉方面的有效性。

英文摘要

Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video representations through a lightweight MLP-style network. Specifically, we use a contrastive random walk approach to characterize the spatiotemporal consistency of video representations, and introduce conditional mutual information to associate video representations with the model's semantic understanding. With the model backbone kept frozen, ViSSRes learns residuals for video representations and optimizes them from both spatiotemporal and semantic consistency perspectives. During inference, ViSSRes requires only a single forward pass and introduces no substantial additional inference cost. Experiments show that ViSSRes reduces the hallucination rate of LLaVA-NeXT-Video on EventHallusion by 40.69% and improves video understanding on MMVU by 18.36% under the CoT setting, demonstrating its effectiveness in mitigating hallucinations.

2512.05291 2026-06-08 cs.LG 版本更新

SHAP-Guided Kernel Actor-Critic for Explainable Reinforcement Learning

基于SHAP引导的核化Actor-Critic可解释强化学习

Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出RSA2C算法,利用RKHS-SHAP计算状态属性,通过马氏门控权重调节Actor梯度和Advantage Critic目标,实现高效、稳定且可解释的强化学习。

详情
Journal ref
ICML2026
AI中文摘要

Actor-Critic (AC) 方法是强化学习 (RL) 的基石,但可解释性有限。当前的可解释RL方法很少使用状态属性来辅助训练,而是平等对待所有状态特征,从而忽略了单个状态维度对奖励的异质性影响。我们提出基于RKHS-SHAP的高级Actor-Critic (RSA2C),一种属性感知的、核化的、双时间尺度AC算法,包括Actor、Value Critic和Advantage Critic。Actor实例化在向量值再生核希尔伯特空间 (RKHS) 中,使用马氏加权算子值核,而Value Critic和Advantage Critic位于标量RKHS中。这些RKHS增强组件使用稀疏化字典:Value Critic维护自己的字典,而Actor和Advantage Critic共享一个字典。通过RKHS-SHAP(用于流形上期望的核均值嵌入和流形外期望的条件均值嵌入)从Value Critic计算的状态属性被转换为马氏门控权重,用于调节Actor梯度和Advantage Critic目标。我们推导了在状态扰动下的全局非渐近收敛界,通过扰动误差项显示稳定性,通过收敛误差项显示效率。在三个连续控制环境上的实验结果表明,RSA2C实现了效率、稳定性和可解释性。我们的代码可在 https://github.com/Na-Li66/RSA2C 获取。

英文摘要

Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose RKHS-SHAP-based Advanced Actor-Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS-SHAP (kernel mean embedding for on-manifold and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. We derive a global, non-asymptotic convergence bound under state perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three continuous-control environments show that RSA2C achieves efficiency, stability, and interpretability. Our code is available at https://github.com/Na-Li66/RSA2C.

2505.15998 2026-06-08 cs.AI 版本更新

Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

探索Flow-Lenia宇宙:基于好奇心驱动的AI科学家发现多样生态系统动力学

Thomas Michel, Marko Cvjetko, Gautier Hamon, Pierre-Yves Oudeyer, Clément Moulin-Frier

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, CRIStAL, France(里尔大学、法国国家科学研究中心、中央里尔学院、CRIStAL实验室、法国) Inria Center at the University of Bordeaux, France(波尔多大学的Inria研究中心、法国) Inria, INSA Lyon, CITI, UR3720, 69621 Villeurbanne, France(Inria、里昂INSA、CITI、UR3720、法国)

AI总结 提出好奇心驱动的AI科学家方法,通过内在动机目标探索过程(IMGEP)在Flow-Lenia中发现系统级动力学,揭示类似生物现象的自组织行为,并展示大规模多样性搜索作为后续实验设计的框架。

详情
Journal ref
Proceedings of the Artificial Life Conference 2025, pp. 633-643
Comments
Extended version of the paper first published at ALife 2025. Project webpage: https://developmentalsystems.org/Exploring-Flow-Lenia-Universes/ 24 pages, 16 figures
AI中文摘要

我们提出了一种好奇心驱动的AI科学家方法,用于发现Flow-Lenia中的系统级动力学。Flow-Lenia是一种具有质量守恒和参数局部化的连续元胞自动机(CA)。基于先前使用Lenia中的多样性搜索来发现个体自组织模式的工作,我们将内在动机目标探索过程(IMGEP)适应于交互模式的大型环境,使用模拟范围的度量,如进化活动、压缩比和多尺度物质分布。我们在两个探索实验中应用IMGEP:一个针对生态系统级动力学,另一个针对通过障碍物环境的物质运动。在这两个实验中,IMGEP比随机搜索照亮了更多的度量空间,并揭示了定性上类似于许多生物现象的自组织行为。利用生成的档案,我们随后在六个空间尺度和七个时间跨度上进行了缩放研究,揭示了在基础尺度上没有类似物的宏观尺度组织,并表征了目标空间度量在尺度上的行为。这说明了我们方法的一个优势:相对廉价的大规模多样性搜索可以作为设计后续更昂贵实验的原则性框架,通过交互式探索工具支持实验设计、检查和重新设计的迭代循环,使科学家保持在循环中。尽管在Flow-Lenia上进行了演示,但这种方法可能适用于其他可参数化的复杂系统,其中研究自下而上的集体行为是有意义的。

英文摘要

We present a curiosity-driven AI scientist method for discovering system-level dynamics in Flow-Lenia, a continuous cellular automaton (CA) with mass conservation and parameter localization. Building on prior work that uses diversity search in Lenia to find individual self-organized patterns, we adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to large environments of interacting patterns, using simulation-wide metrics such as evolutionary activity, compression ratio, and multi-scale matter distribution. We apply IMGEP in two exploration experiments: one targeting ecosystem-level dynamics, the other matter movement through obstacle-laden environments. In both, IMGEP illuminates significantly more of the metric space than random search and reveals self-organized behaviors qualitatively resembling many biological phenomena. Leveraging the resulting archive, we then run a scaling study across six spatial scales and seven time horizons, uncovering macro-scale organization with no analogue at the base scale and characterizing how goal-space metrics behave at scale. This illustrates a strength of our approach: a relatively cheap large-scale diversity search can act as a principled scaffold for designing subsequent, more expensive experiments, enabling an iterative loop of experiment design, inspection, and redesign, supported by an interactive exploration tool that keeps scientists in the loop. Though demonstrated with Flow-Lenia, this approach potentially applies to other parameterizable complex systems where studying bottom-up collective behavior is of interest.

2512.09084 2026-06-08 cs.LG 版本更新

GS-KAN: Parameter-Efficient Kolmogorov-Arnold Networks via Sprecher-Type Shared Basis Functions

GS-KAN: 通过Sprecher型共享基函数的参数高效Kolmogorov-Arnold网络

Oscar Eliasson

发表机构 * Chalmers University of Technology(挑战大学)

AI总结 提出GS-KAN,通过每层共享单一父函数的线性变换构造边函数,在保持参数高效的同时,在函数逼近、表格回归和图像分类任务上优于或媲美现有KAN和MLP。

详情
Comments
6 pages, 2 figures
AI中文摘要

Kolmogorov-Arnold表示定理通过在边上而非节点上放置可学习单变量函数,为多层感知器(MLP)提供了理论替代方案。尽管最近的实现如Kolmogorov-Arnold网络(KAN)展示了高逼近能力,但由于需要为每个网络边维护唯一参数化,它们存在显著的参数低效问题。在这项工作中,我们提出GS-KAN(广义Sprecher-KAN),一种受David Sprecher对叠加定理的改进启发的轻量级架构。GS-KAN通过对每层单个可学习的共享父函数应用可学习线性变换来构造唯一的边函数。我们在合成函数逼近、表格数据回归和图像分类任务上评估了GS-KAN与现有KAN架构和MLP的性能。结果表明,GS-KAN在连续函数逼近任务上优于MLP和标准KAN基线,同时保持优越的参数效率。此外,GS-KAN在表格回归上与现有KAN架构性能相当,在高维分类任务上优于MLP。关键的是,所提出的架构使得在严格参数约束下的高维场景中部署基于KAN的架构成为可能,而标准实现由于参数爆炸通常不可行。源代码可在https://github.com/rambamn48/gs-impl获取。

英文摘要

The Kolmogorov-Arnold representation theorem offers a theoretical alternative to Multi-Layer Perceptrons (MLPs) by placing learnable univariate functions on edges rather than nodes. While recent implementations such as Kolmogorov-Arnold Networks (KANs) demonstrate high approximation capabilities, they suffer from significant parameter inefficiency due to the requirement of maintaining unique parameterizations for every network edge. In this work, we propose GS-KAN (Generalized Sprecher-KAN), a lightweight architecture inspired by David Sprecher's refinement of the superposition theorem. GS-KAN constructs unique edge functions by applying learnable linear transformations to a single learnable, shared parent function per layer. We evaluate GS-KAN against existing KAN architectures and MLPs across synthetic function approximation, tabular data regression and image classification tasks. Our results demonstrate that GS-KAN outperforms both MLPs and standard KAN baselines on continuous function approximation tasks while maintaining superior parameter efficiency. Additionally, GS-KAN achieves competitive performance with existing KAN architectures on tabular regression and outperforms MLPs on high-dimensional classification tasks. Crucially, the proposed architecture enables the deployment of KAN-based architectures in high-dimensional regimes under strict parameter constraints, a setting where standard implementations are typically infeasible due to parameter explosion. The source code is available at https://github.com/rambamn48/gs-impl.

2601.16622 2026-06-08 cs.LG cs.AI 版本更新

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

E2Former-V2:具有线性激活内存的即时等变注意力

Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出E2Former-V2架构,通过等变轴对齐稀疏化(EAAS)和即时等变注意力机制,利用SO(3)到SO(2)基变换和自定义Triton内核,实现线性激活内存和20倍TFLOPS提升,在SPICE和OMol25数据集上加速推理并保持预测性能。

详情
AI中文摘要

等变图神经网络(EGNN)已成为建模3D原子系统的广泛使用的方法。然而,主流架构由于在每条边上显式构造几何特征或密集张量积而面临关键的可扩展性瓶颈。为克服这一问题,我们引入了**E2Former-V2**,一种将代数稀疏性与硬件感知执行相结合的可扩展架构。我们首先提出**等变轴对齐稀疏化(EAAS)**。EAAS基于Wigner-$6j$卷积,利用$\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$基变换,将计算昂贵的密集张量收缩转化为高效的稀疏奇偶重索引操作。基于这种表示,我们引入了**即时等变注意力**,一种通过自定义融合Triton内核实现的完全节点中心机制。通过消除物化的边张量并最大化SRAM利用率,我们的内核相比标准实现实现了**20倍的TFLOPS提升**。在SPICE和OMol25数据集上的大量实验表明,E2Former-V2在保持相当预测性能的同时显著加速推理。这项工作表明,大型等变Transformer可以使用广泛可用的GPU平台高效训练。代码可在https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2获取。

英文摘要

Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on \textit{every} edge. To overcome this, we introduce \textbf{E2Former-V2}, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We first propose \textbf{E}quivariant \textbf{A}xis-\textbf{A}ligned \textbf{S}parsification (EAAS). EAAS builds on Wigner-$6j$ convolution by exploiting an $\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$ change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re-indexing operations. Building on this representation, we introduce \textbf{On-the-Fly Equivariant Attention}, a fully node-centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a \textbf{20$\times$ improvement in TFLOPS} compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former-V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2.

2601.10930 2026-06-08 cs.RO 版本更新

Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation

何处触碰,如何接触:面向几何感知的长时间灵巧操作的分层RL-MPC框架

Zhixian Xie, Yu Xiang, Michael Posa, Wanxin Jin

发表机构 * Arizona State University(亚利桑那州立大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出分层RL-MPC框架,高层RL策略预测接触意图(接触位置和子目标位姿),低层接触隐式MPC优化局部接触模式并实时重规划,实现几何泛化的非抓取操作,数据效率提升10倍且零样本迁移到真实环境。

详情
AI中文摘要

接触丰富的灵巧操作中的一个关键挑战是需要共同推理全局几何和非光滑接触动力学。端到端策略绕过了这一复杂性,但通常需要大量数据,并且从仿真到现实的迁移效果差。我们通过一个简单的见解来解决这些局限性:灵巧操作本质上是分层的——在高层次上,机器人决定在哪里触碰(几何);在低层次上,它确定如何通过接触动力学移动物体。基于这一见解,我们提出了一个分层RL-MPC框架,其中高层强化学习(RL)策略预测接触意图,这是一种新颖的以物体为中心的接口,指定了(i)物体表面接触位置和(ii)接触后的物体子目标位姿。在接触意图的条件下,低层接触隐式模型预测控制(MPC)优化局部接触模式,并通过接触动力学进行实时(重新)规划,以生成稳健地将物体移向每个子目标的机器人动作。我们在非抓取任务上评估该框架,包括跨不同物体形状的几何泛化推、基于翻转/旋转的物体重新定向以及环境辅助的物体重新定位。它实现了高成功率,数据量大幅减少(比端到端基线少10倍),高度稳健的性能,以及零样本从仿真到现实的迁移。

英文摘要

A key challenge in contact-rich dexterous manipulation is the need to jointly reason over global geometry and nonsmooth contact dynamics. End-to-end policies bypass this complexity, but often require large amounts of data and transfer poorly from simulation to reality. We address the limitations with a simple insight: dexterous manipulation is inherently hierarchical--at a high level, a robot decides where to touch (geometry); at a low level it determines how to move the object through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object subgoal pose. Conditioned on the contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and real-time (re)plans through contact dynamics to generate robot actions that robustly move the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing across diverse object shapes, pivoting/flipping-based object reorientation, and environment-assisted object repositioning. It achieves high success rate with substantially reduced data (10 times less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.

2408.08973 2026-06-08 cs.CV 版本更新

Image class translation: visual inspection of class-specific hypotheticals and classification based on translation distance

图像类别翻译:类别特定假设的视觉检查与基于翻译距离的分类

Mikyla K. Bowen, Jesse W. Wilson

发表机构 * College of Natural Sciences, Colorado State University, Colorado, United States of America(科罗拉多州立大学自然科学院) School of Biomedical and Chemical Engineering, Colorado State University, Colorado, United States of America(科罗拉多州立大学生物医学与化学工程学院) Department of Electrical and Computer Engineering, Colorado State University, Colorado, United States of America(科罗拉多州立大学电气与计算机工程学院)

AI总结 提出图像翻译网络用于分类,通过翻译距离作为低维特征进行分类,在皮肤镜和骨髓细胞图像上验证,可解释性优于传统CNN。

详情
Comments
47 pages, 20 figures, submitted revision to SPIE J. Medical Imaging
AI中文摘要

目的:人工智能在医学应用中的主要障碍是自动CNN缺乏可解释性,并且对错误决策(尤其是域外样本)有高置信度。我们提出图像翻译网络用于图像分类的泛化,并展示翻译网络作为传统黑盒分类器更可解释的替代方案的潜力。\n方法:我们训练一个图像到图像网络,将输入图像翻译为类别特定的假设,然后通过视觉和定量方式将这些假设与输入进行比较。翻译距离(即为了符合某一类别所需的改变程度)被检查其聚类和趋势,并用作分类的简单低维特征向量。\n结果:在黑色素瘤/良性皮肤镜图像上,翻译距离分类器仅使用2维特征空间就达到了80%的准确率(而传统CNN使用约62,000维特征空间达到85%)。对渲染图像的视觉检查揭示了数据集偏差,例如黑色素瘤照片中比良性病变有更多的比例尺。翻译距离空间中的图像分布揭示了沿着皮肤科医生活检决策的自然分离,而不是恶性与良性之间的分离。在骨髓细胞学图像上,翻译距离分类器在3类(92%准确率对比CNN的89%)和6类(90%对比86%)场景中均优于传统CNN。\n结论:这一概念验证表明,图像到图像翻译有潜力超越艺术/风格变化,揭示数据集偏差,进行降维和数据集可视化,并且在某些情况下可能优于传统的端到端CNN分类器。

英文摘要

Purpose: A major barrier to the implementation of artificial intelligence for medical applications is automated CNNs' lack of explainability and high confidence for incorrect decisions, specifically with out-of-domain samples. We propose a generalization of image translation networks for image classification and demonstrate translation networks' potential as a more interpretable alternative to conventional black-box classifiers. Approach: We train an image-to-image network to translate an input image to class-specific hypotheticals, and then compare these with the input, both visually and quantitatively. Translation distances, the degree of alteration needed to conform to one class or another, are examined for clusters and trends, and used as a simple low-dimensional feature vector for classification. Results: On melanoma/benign dermoscopy images, a translation distance classifier achieved 80% accuracy using only a 2-dimensional feature space (versus 85% for a conventional CNN using a ~62,000-dimensional feature space). Visual inspection of rendered images revealed dataset biases, like more scalebars in melanoma photographs than in benign lesions. Image distributions in translation distance space revealed a natural separation along the lines of dermatologist decision to biopsy, rather than between malignant and benign. On bone marrow cytology images, translation distance classifiers outperformed a conventional CNN in both 3-class (92% accuracy vs 89% for CNN) and 6-class (90% vs 86% for CNN) scenarios. Conclusions: This proof-of-concept shows the potential for image-to-image translation to go beyond artistic/stylistic changes and to expose dataset biases, perform dimension reduction and dataset visualization, and in some cases, potentially outperform conventional end-to-end CNN classifiers.

2601.09698 2026-06-08 cs.CV 版本更新

COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

COMPOSE:用于多视角三维人体姿态估计的超图覆盖优化

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * School of Computation, Information, and Technology, Technical University of Munich(技术大学慕尼黑计算、信息与技术学院) Munich Center for Machine Learning(慕尼黑机器学习中心) Department of Computing, Imperial College London(伦敦帝国学院计算机系)

AI总结 提出COMPOSE方法,将多视角三维人体姿态估计重构为超图上的加权精确覆盖优化,通过全局组合目标替代局部配对关联,结合几何剪枝与整数线性规划或信念传播求解器,无监督下精度提升显著。

详情
AI中文摘要

从稀疏多视角相机装置中进行三维人体姿态估计是众多应用(包括动作识别、体育分析和人机交互)的基本任务。尽管学习方法在基准测试中占据主导地位,但它们需要大量标注数据集;无训练的基于优化的方法仍然有前景,因为它们通过解决来自二维检测的跨视角对应问题来规避三维监督。现有的组合公式依赖配对关联来建模这一对应问题,并将跨视角的全局一致性仅作为下游约束来强制执行。然而,在遮挡和噪声检测下,调和局部合理的配对匹配变得脆弱,局部错误会全局传播。我们提出COMPOSE,它将多视角三维人体姿态估计重新定义为对人物假设超图上的加权精确覆盖优化。我们的公式用单个全局组合目标替代了配对关联和事后一致性强制执行。为了应对指数级大的候选空间,我们引入了一种几何剪枝策略以及两种互补的求解器:精确整数线性规划公式和通过信念传播的可扩展松弛。在没有任何三维监督的情况下,COMPOSE在平均精度上比最佳基于优化的方法提高了31个百分点,比自监督学习方法提高了13个百分点,证明了高阶组合关联在无训练的多视角三维人体姿态估计中的有效性。

英文摘要

3D human pose estimation from sparse multi-view camera rigs is an essential task for numerous applications, including action recognition, sports analysis, and human-robot interaction. While learned methods dominate the field on benchmarks, they require large annotated datasets; training-free optimization-based methods remain promising as they circumvent 3D supervision by solving a correspondence problem across views from 2D detections. Existing combinatorial formulations rely on pairwise associations to model this correspondence problem and enforce global consistency across views only as a downstream constraint. However, reconciling locally plausible pairwise matches becomes brittle under occlusion and noisy detections, where local errors propagate globally. We propose COMPOSE, which recasts multi-view 3D human pose estimation as a weighted exact-cover optimization over a hypergraph of person hypotheses. Our formulation replaces pairwise association and post-hoc consistency enforcement with a single global combinatorial objective. To address the exponentially large candidate space, we introduce a geometric pruning strategy alongside two complementary solvers: an exact Integer Linear Programming formulation and a scalable relaxation via Belief Propagation. Without any 3D supervision, COMPOSE improves average precision by up to 31 points over the best optimization-based method and 13 points over self-supervised learned methods, demonstrating the effectiveness of higher-order combinatorial association for training-free multi-view 3D human pose estimation.

2601.09402 2026-06-08 cs.CL 版本更新

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

SEEK: 通过内部推理草图引导LLM推理用于RAG

Xinze Li, Yuqing Lan, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Institute for AI, Tsinghua University, China(清华大学计算机科学与技术系,人工智能研究院) School of Intelligent Science and Technology, Nanjing University, China(南京大学智能科学与技术学院)

AI总结 提出SEEK框架,通过构建结构化引导草图,迭代检索和填充知识槽,减少冗余检索,提升RAG性能。

详情
AI中文摘要

检索增强生成(RAG)通过将外部知识融入生成过程来增强大型语言模型(LLM)。借助LLM的推理能力,现有方法利用这种能力实现迭代知识获取和积累,从而更好地支持答案生成。然而,随着推理轨迹的增长,积累的知识和先前生成的查询可能会干扰后续检索决策,导致子查询意图重复和知识获取冗余。为了解决这个问题,我们提出了SEEK,一种用于RAG的草图引导知识获取框架。SEEK首先提示LLM为给定问题构建一个结构化的引导草图。它由多组引导要点组成,每个要点后跟一个用于知识填充的槽位。在这些引导要点的指导下,SEEK迭代地检索和精炼知识,并填充相应的槽位以完成草图。然后,完成的草图作为上下文输入用于最终答案生成。实验结果表明,SEEK在多个任务上取得了比基线模型更好的性能。进一步分析表明,SEEK可以生成更多样化的子查询,减少冗余检索,并在外部知识利用和内部知识冲突缓解之间实现更好的平衡。所有代码可在 https://github.com/OpenBMB/PAGER 获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged such capabilities to enable iterative knowledge acquisition and accumulation, thereby better supporting answer generation. However, as the reasoning trajectory grows, the accumulated knowledge and previously generated queries may interfere with subsequent retrieval decisions, resulting in sub-queries with repetitive intents and redundant knowledge acquisition. To address this issue, we propose SEEK, a sketch-guided knowledge acquisition framework for RAG. SEEK first prompts the LLM to construct a structured steering sketch for the given question. It consists of multiple groups of steering gists, with each gist followed by a slot for knowledge filling. Guided by these steering gists, SEEK iteratively retrieves and refines knowledge, and fills the corresponding slots to complete the sketch. The completed sketch is then used as contextual input for final answer generation. Experimental results show that SEEK achieves better performance than baseline models across multiple tasks. Further analyses demonstrate that SEEK can generate more diverse sub-queries, reduce redundant retrieval, and achieve a better balance between external knowledge utilization and internal knowledge conflict mitigation. All codes are available at https://github.com/OpenBMB/PAGER.

2601.08097 2026-06-08 cs.CL cs.LG 版本更新

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

AdaJudge: 自适应多视角评判用于奖励建模

Yongliang Miao, Yangyang Liang, Mengnan Du

发表机构 * Emory University(埃默里大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出AdaJudge框架,通过门控精化块和自适应多视角池化模块,联合优化表示与聚合,解决奖励建模中静态归纳偏差和表示不匹配问题,在RM-Bench和JudgeBench上超越现有模型。

详情
Comments
ACL 2026
AI中文摘要

奖励建模对于将大型语言模型与人类偏好对齐至关重要,但主流架构依赖静态池化策略将序列压缩为标量分数。然而,这种范式存在两个关键限制:静态归纳偏差与任务相关的偏好信号不匹配,以及表示不匹配,因为骨干网络针对生成的优化使其表示不适用于细粒度判别。为解决这一问题,我们提出AdaJudge,一个统一框架,联合调整表示和聚合。AdaJudge首先通过门控精化块将骨干网络表示改进到判别导向的空间。然后,它用自适应多视角池化模块替换静态读出,该模块动态路由并组合证据。在RM-Bench和JudgeBench上的大量实验表明,AdaJudge优于强大的现成奖励模型和传统池化基线。

英文摘要

Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone's optimization for generation leaves its representations ill-suited to fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first improves backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module, which dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.

2601.05751 2026-06-08 cs.CL cs.AI 版本更新

Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

分析LLM生成文本中说服性语言的差异:揭示刻板的性别模式

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

发表机构 * Department of Computer Science, Aarhus University(阿arhus大学计算机科学系) AMD Silo AI University of Tokyo(东京大学) IT University of Copenhagen(哥本哈根IT大学) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 提出框架评估LLM生成说服性语言时受接收者性别、发送者意图和输出语言的影响,发现所有模型均存在显著的性别差异,反映性别刻板印象的语言倾向。

详情
Comments
Accepted at ACL Findings 2026
AI中文摘要

大型语言模型(LLMs)越来越多地用于日常交流任务,包括起草旨在影响和说服的人际信息。先前研究表明,LLMs能够成功说服人类并放大说服性语言。因此,理解用户指令如何影响说服性语言的生成,以及生成的说服性语言是否因目标群体不同而有所差异至关重要。在这项工作中,我们提出了一个框架,用于评估说服性语言生成如何受接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估了13个LLMs和16种语言。我们采用基于社会心理学和传播科学的LLM-as-judge设置,在19个说服性语言类别上评估模型响应。我们的结果揭示了所有模型生成的说服性语言中存在显著的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

英文摘要

Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

2601.05675 2026-06-08 cs.AI 版本更新

CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

CHDP:参数化动作空间中强化学习的协同混合扩散策略

Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对混合动作空间中的策略表达力不足和高维扩展性差问题,提出协同混合扩散策略框架,通过离散和连续扩散策略的协作与顺序更新,结合码本嵌入和Q函数引导,在基准测试中成功率提升高达19.3%。

详情
Comments
Accepted by AAAI 2026
AI中文摘要

混合动作空间结合了离散选择和连续参数,在机器人控制和游戏AI等领域普遍存在。然而,高效建模和优化离散-连续混合动作空间仍然是一个基本挑战,主要由于策略表达力有限和高维设置下的可扩展性差。为应对这一挑战,我们将混合动作空间问题视为完全合作博弈,并提出\textbf{协同混合扩散策略(CHDP)}框架来解决。CHDP采用两个协作智能体,分别利用离散和连续扩散策略。连续策略以离散动作的表示为条件,显式建模它们之间的依赖关系。这种协作设计使扩散策略能够利用其表达力捕获各自动作空间中的复杂分布。为缓解协作设置中同时更新策略导致的更新冲突,我们采用顺序更新方案以促进协同适应。此外,为提高在高维离散动作空间中学习时的可扩展性,我们构建了一个将动作空间嵌入低维潜在空间的码本。该映射使离散策略能够在紧凑、结构化的空间中学习。最后,我们设计了一种基于Q函数的引导机制,在训练过程中对齐码本的嵌入与离散策略的表示。在具有挑战性的混合动作基准测试中,CHDP的成功率比最先进方法高出高达19.3%。

英文摘要

Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action's representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook's embeddings with the discrete policy's representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3\%$ in success rate.

2505.11470 2026-06-08 cs.CL 版本更新

Reference-Free Evaluation of Taxonomies

无参考评价的层次分类体系

Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster

发表机构 * Hamilton Institute, Maynooth University, Ireland(爱尔兰梅诺特大学哈密尔顿研究所) School of Computing, Dublin City University, Ireland(爱尔兰都柏林城市大学计算学院) Lucerne School of Computer Science and IT, Switzerland(瑞士卢塞恩计算机科学与信息技术学院)

AI总结 提出两种无参考指标评估层次分类体系质量:基于语义与分类相似性相关性的鲁棒性指标,以及基于自然语言推理的逻辑充分性指标,在五个层次分类体系上验证与真实F1值高度相关,并能预测下游层次分类性能。

详情
AI中文摘要

我们引入了两种无参考指标,用于在缺乏标签的情况下评估层次分类体系的质量。第一个指标通过计算语义相似性与分类相似性之间的相关性来评估鲁棒性,解决了现有指标未考虑的错误类型。第二个指标使用自然语言推理来评估逻辑充分性。这两个指标在五个层次分类体系上进行了测试,结果显示它们与真实层次分类体系的F1值高度相关。我们进一步证明,当与标签层次结构一起使用时,我们的指标可以预测层次分类中的下游性能。

英文摘要

We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing error types not considered by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against ground truth taxonomies. We further demonstrate that our metrics can predict downstream performance in hierarchical classification when used with label hierarchies.

2512.13278 2026-06-08 cs.CL cs.LG 版本更新

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool: 面向智能体推理的动态工具选择与集成

Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AutoTool框架,通过双阶段优化(SFT+RL轨迹稳定化和KL正则化Plackett-Luce排序)使大语言模型具备动态工具选择能力,在数学、科学、代码和多模态推理等任务上平均提升6.4%-7.7%。

详情
Comments
ICML2026; Best Paper Award at ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence
AI中文摘要

智能体强化学习推动了大语言模型(LLMs)在长链思维轨迹中进行推理,同时穿插外部工具的使用。现有方法假设工具集固定,限制了LLM智能体对新工具或演化工具集的适应性。我们提出AutoTool,一个训练框架,使LLM智能体在整个推理轨迹中具备动态工具选择能力。AutoTool采用双阶段优化流水线:(i)基于SFT和RL的轨迹稳定化,以实现连贯推理;(ii)KL正则化的Plackett-Luce排序,以优化一致的多步工具选择。我们进一步构建了一个包含20万条数据的数据集,其中包含跨1000多个工具和100多个任务(涵盖数学、科学、代码生成和多模态推理)的显式工具选择理由。在十个多样化基准上,我们使用AutoTool训练了两个基础模型:Qwen3-8B和Qwen2.5-VL-7B。在参数更少的情况下,AutoTool持续优于先进的LLM智能体和工具集成方法,在数学与科学推理上平均提升6.4%,在基于搜索的问答上提升4.5%,在代码生成上提升7.7%,在多模态理解上提升6.9%。此外,AutoTool通过在推理过程中动态利用演化工具集中的未见工具,展现出更强的泛化能力。

英文摘要

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题,提出基于狄利克雷分布重参数化的对抗微调目标,统一对齐语义结构与置信度,提升校准性和鲁棒性。

详情
Comments
ICML 2026
AI中文摘要

CLIP在零样本分类中表现强劲,但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits,忽略了不确定性校准,可能损害零样本泛化能力。在可靠的不确定性估计中,一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而,在对抗环境中我们经常观察到相反的情况:扰动不仅降低准确性,还抑制不确定性,导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距,我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数,我们提出了一种统一表示,捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐,超越单一logits锚定,恢复校准的不确定性。在多个零样本基准上的实验表明,我们的方法显著提高了不确定性校准,在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

2512.10521 2026-06-08 cs.CV 版本更新

Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Take a Peek: 通过LoRA高效编码器适应少样本语义分割

Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano

发表机构 * University of Bari(巴里大学)

AI总结 提出TaP方法,利用低秩适应(LoRA)微调编码器,在少样本和跨域少样本语义分割中实现高效适应,提升新类分割性能。

详情
AI中文摘要

少样本语义分割(FSS)旨在仅使用少量标注支持集对查询图像中的新类进行分割。先前研究主要关注改进解码器,但编码器提取未见类有意义特征的能力有限仍是关键瓶颈。本文提出 extit{Take a Peek}(TaP),一种简单而有效的方法,通过引入基于支持集的轻量级 extit{特征空间偏移},增强了编码器对FSS和跨域FSS的适应性。TaP利用低秩适应(LoRA)在支持集上微调编码器,计算开销极小,能够快速适应新类同时减轻灾难性遗忘。我们的方法模型无关,可无缝集成到现有FSS流程中。在多个基准(包括COCO $20^i$、Pascal $5^i$以及跨域数据集DeepGlobe、ISIC和Chest X-ray)上的大量实验表明,TaP在不同模型和shot设置下一致地提升了分割性能。值得注意的是,TaP在复杂的多类场景中取得了显著增益,突显了其在现实场景中的实际有效性。秩敏感性分析还表明,即使采用低秩适应也能实现强性能,从而确保计算效率。通过解决FSS中编码器泛化到新类的关键限制,TaP为构建更鲁棒、高效和可泛化的分割系统铺平了道路。代码可在https://github.com/pasqualedem/TakeAPeek获取。

英文摘要

Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS \rev{by inducing a lightweight \textit{feature-space shift} conditioned on the support set}. TaP leverages Low-Rank Adaptation to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, thereby ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

2512.09634 2026-06-08 cs.CL 版本更新

Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

爱沙尼亚主观性数据集的创建:评估主观性程度的一个量表

Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts

发表机构 * University of Tartu(塔尔图大学)

AI总结 本文创建了爱沙尼亚语文档级主观性数据集,通过连续量表标注并分析标注一致性,初步实验使用大语言模型进行自动主观性分析,发现自动评分可行但不可完全替代人工。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) 8204-8216
Comments
9 pages, 5 figures, 3 appendixes, LREC 2026
AI中文摘要

本文介绍了爱沙尼亚语文档级主观性数据集的创建,分析了所得标注,并报告了使用大语言模型(LLM)进行自动主观性分析的初步实验。该数据集包含1000个文档——300篇新闻文章和700个随机选择的网络文本——每个文档由四位标注员在从0(完全客观)到100(完全主观)的连续量表上评分。由于标注员间相关性中等,部分文本得分位于量表两端,因此对得分差异最大的文本子集进行了重新标注,标注员间相关性有所提高。除了人工标注外,数据集还包括GPT-5作为标注自动化实验生成的分数。这些分数与人工标注相似,但出现了一些差异,表明基于LLM的自动主观性评分虽然可行,但并非人工标注的可互换替代方案,其适用性取决于预期应用。

英文摘要

This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.

2511.06080 2026-06-08 cs.CV cs.CY cs.HC 版本更新

AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired

AIDEN:面向视障人士的AI助手设计与初步研究

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

发表机构 * Institute for Computer Research, University of Alicante(计算机研究所,阿利坎特大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出AIDEN系统,结合YOLO实时目标检测、LLaVA场景描述与OCR,以及基于盖革计数器隐喻的连续触觉引导,避免听觉过载并保护隐私,实验表明用户满意度高。

详情
AI中文摘要

本文介绍了AIDEN,一种基于人工智能的助手,旨在增强视障人士的自主性和日常生活质量,他们通常在物体识别、文本阅读和陌生环境导航方面遇到困难。现有的解决方案如屏幕阅读器或基于音频的助手虽然便于获取信息,但常常导致听觉过载,并在开放环境中引发隐私问题。AIDEN通过一种混合架构解决了这些限制,该架构集成了用于实时目标检测的YOLO(You Only Look Once)和用于场景描述及光学字符识别(OCR)的大型语言与视觉助手(LLaVA)。该系统的一个关键创新是基于盖革计数器隐喻的连续触觉引导机制,该机制在不占用听觉通道的情况下支持物体居中,同时通过确保不存储个人数据来保护隐私。与视障参与者进行的实证评估使用技术接受模型(TAM)评估了感知易用性和接受度。结果表明用户满意度高,特别是在直观性和感知自主性方面。此外,“寻找物体”功能实现了有效的实时性能。这些发现提供了有希望的证据,表明与传统的以音频为中心的方法相比,多模态触觉-视觉反馈可以改善日常可用性和独立性,从而推动更大规模的临床验证。

英文摘要

This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.

2510.03381 2026-06-08 cs.LG cs.AI 版本更新

Proxy Reconstruction Pre-training for Ramp Flow Prediction at Highway Interchanges

高速公路立交匝道流量预测的代理重建预训练

Yongchao Li, Jun Chen, Zhuoxuan Li, Chao Gao, Yang Li, Chu Zhang, Changyin Dong

发表机构 * Southeast University(东南大学) Institute of Telecommunications and Information Sciences, China(中国电信与信息科学研究院)

AI总结 提出时空解耦自编码器(STDAE),通过跨模态重建预训练从主线数据恢复匝道流量,结合GWNet等模型提升预测精度,在真实数据集上超越13个基线。

详情
Journal ref
Applied Soft Computing Journal 200 (2026) 115462
Comments
Accepted at Applied Soft Computing Journal
AI中文摘要

立交桥是高速公路间车辆转换的关键节点,但缺乏实时匝道检测器导致交通预测存在盲区。为解决这一问题,我们提出时空解耦自编码器(STDAE),一种利用跨模态重建预训练的两阶段框架。在第一阶段,STDAE从主线数据重建历史匝道流量,迫使模型捕捉内在的时空关系。其解耦架构通过并行的空间和时间自编码器高效提取异质特征。在预测阶段,学习到的表示与GWNet等模型集成以提高准确性。在三个真实立交数据集上的实验表明,STDAE-GWNET始终优于十三个最先进的基线,并达到与使用历史匝道数据的模型相当的性能。这证明了其在克服检测器稀缺方面的有效性及其在不同预测流程中的即插即用潜力。

英文摘要

Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traffic prediction. To address this, we propose a Spatio-Temporal Decoupled Autoencoder (STDAE), a two-stage framework that leverages cross-modal reconstruction pretraining. In the first stage, STDAE reconstructs historical ramp flows from mainline data, forcing the model to capture intrinsic spatio-temporal relations. Its decoupled architecture with parallel spatial and temporal autoencoders efficiently extracts heterogeneous features. In the prediction stage, the learned representations are integrated with models such as GWNet to enhance accuracy. Experiments on three real-world interchange datasets show that STDAE-GWNET consistently outperforms thirteen state-of-the-art baselines and achieves performance comparable to models using historical ramp data. This demonstrates its effectiveness in overcoming detector scarcity and its plug-and-play potential for diverse forecasting pipelines.

2511.12795 2026-06-08 cs.RO 版本更新

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

ActiveGrasp: 基于校准能量模型的信息引导主动抓取

Boshu Lei, Wen Jiang, Kostas Daniilidis

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Archimedes, Athena RC(阿基米德、阿提卡RC)

AI总结 针对密集杂乱环境中的抓取问题,提出一种校准能量模型生成抓取姿态,并基于抓取分布的信息增益主动选择视角,在有限视角下高效抓取目标物体。

详情
Comments
CVPR 2026
AI中文摘要

在密集杂乱环境中抓取对机器人是一项具有挑战性的任务。以往的方法试图通过在抓取姿态生成前主动收集多个视角来解决这个问题。然而,它们要么忽略了抓取分布对信息增益估计的重要性,要么依赖于抓取分布的投影,这忽略了SE(3)流形上抓取姿态的结构。为了应对这些挑战,我们提出了一种用于抓取姿态生成的校准能量模型,以及一种从抓取分布估计信息增益的主动视角选择方法。我们的能量模型捕捉了SE(3)流形上抓取分布的多模态特性。能量水平被校准到抓取的成功率,使得预测分布与真实分布一致。通过从基于重建环境的校准分布中估计抓取的信息增益,选择下一个最佳视角,这可以高效地驱动机器人探索目标物体的可抓取部分。在模拟环境和真实机器人设置上的实验表明,与先前最先进的模型相比,我们的模型能够在有限视角预算下成功抓取杂乱环境中的物体。我们的模拟环境可以作为未来主动抓取研究的可复现平台。当论文公开发布时,我们的源代码将公开。

英文摘要

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

2511.07380 2026-06-08 cs.CL 版本更新

Mining Useful General Data for Low-Resource Domain Adaptation

挖掘低资源领域适应的有用通用数据

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

发表机构 * arXiv

AI总结 针对低资源领域数据稀缺问题,提出NTK-Selector方法,利用神经正切核从通用数据中筛选有用样本,显著提升领域适应效果。

详情
Comments
39 pages
AI中文摘要

由于领域特定数据的稀缺性,将大型语言模型(LLMs)适应到低资源领域仍然具有挑战性。虽然领域内数据有限,但存在大量与领域任务共享相似问答格式和推理模式的通用领域数据。这一观察提出了一个重要问题:能否挖掘有用的通用领域数据来改进低资源领域适应?我们的初步发现表明,即使没有仔细选择,通用领域的思维链数据也包含对领域适应有用的辅助信号。这一观察催生了一种新的领域适应范式,即不再完全依赖领域特定数据。为了系统地识别最有益的通用领域样本,我们提出了NTK-Selector,其动机源于神经正切核捕捉训练动态中对齐的能力。由于直接将NTK应用于预训练LLMs不切实际,我们引入了一种无雅可比矩阵的NTK近似,并在微调过程中经验性地展示了稳定的NTK类行为。在医学、金融、法律和心理领域的广泛实验表明,NTK-Selector始终优于仅使用领域数据的微调和现有数据选择基线。特别是,NTK-Selector在Llama3-8B-Instruct和Qwen3-8B上分别取得了+8.7和+5.1个百分点的提升,而仅使用领域数据的微调仅分别提升了+0.8和+0.9个百分点。

英文摘要

Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

2511.05949 2026-06-08 cs.CV 版本更新

Zero-Shot Polygon Matching with Pre-trained Models for Pose Estimation and Polygon Cloud from Challenging Stereo

基于预训练模型的零样本多边形匹配用于挑战性立体图像的姿态估计和多边形云

Chang Li, Xingtao Peng

发表机构 * Chang Li(李昌) Xingtao Peng(彭兴涛)

AI总结 提出首个零样本多边形匹配范式Z(PM)2,结合预训练模型和手工几何约束,通过双向金字塔匹配和局部-整体二分图优化解决视差不连续、尺度变化等问题,在姿态估计和3D表示中取得领先性能。

详情
AI中文摘要

尽管立体匹配在0D点和1D线基元上已经成熟,但由于视差不连续、尺度变化、训练依赖和泛化能力差等挑战,2D多边形的对应关系建立仍基本未被探索,限制了姿态估计和3D重建等下游任务。为了解决这些问题,我们首次提出了一种基于预训练模型的零样本多边形匹配范式(即Z(PM)2),通过即插即用模块结合学习特征和手工几何约束,将匹配从0D/1D基元扩展到2D多边形。该流程包括三个核心阶段:首先,检测器利用预训练的segment anything模型将分割掩码矢量化成图结构的多边形,融合几何和纹理;其次,全局匹配器使用双向金字塔和多几何约束处理视角变化;第三,局部匹配器利用局部-整体二分图优化解决视差不连续和拓扑不一致。此外,我们开发了多边形匹配引导的姿态估计,利用对应关系获得分布良好、低冗余的同名点,并首创多边形云概念及最优表面生成方法,生成结构完整、语义丰富的3D表示,超越点云和线云。由于没有可直接比较的立体图像多边形匹配方法,我们选择了最接近该任务的最先进方法作为基线。在五个具有挑战性的数据集(ISPRS、KITTI、ScanNet、SceneFlow、DTU)上的大量实验表明,Z(PM)2实现了68.60%的匹配面积分数,比MESA高出约32%,在区域级姿态估计中排名第一,具有竞争力的速度和强大的零样本泛化能力,无需任何训练要求。

英文摘要

While stereo matching has achieved maturity for 0D point and 1D line primitives, establishing correspondences for 2D polygons remains largely unexplored due to challenges including disparity discontinuity, scale variation, training dependency, and poor generalization, limiting downstream tasks such as pose estimation and 3D reconstruction. To address these issues, we are the first to propose a Zero-shot Polygon Matching paradigm with Pre-trained Models (i.e., Z(PM)2), which combines learned features and handcrafted geometric constraints through plug-and-play modules, extending matching from 0D/1D primitives to 2D polygons. The pipeline comprises three core stages: Firstly, detector leverages the pre-trained segment anything model to vectorize segmentation masks into graph-structured polygons integrating geometry and texture; Secondly, global matcher uses bidirectional-pyramid and multi-geometric constraints to handle viewpoint variation; Thirdly, local matcher leverages local-holistic bipartite graph optimization to resolve disparity discontinuity and topological inconsistency. Moreover, we develop polygon-matching-guided pose estimation using correspondences to obtain well-distributed, low-redundancy homologous points, and pioneer the polygon cloud concept with an optimal surface generation method, producing structurally complete and semantically rich 3D representations beyond point and line clouds. Since no polygon matching methods from stereo imagery are available for direct comparison, we selected state-of-the-art (SoTA) methods close to this task as baselines. Extensive experiments on five challenging datasets (ISPRS, KITTI, ScanNet, SceneFlow, DTU) show Z(PM)2 achieves a 68.60% matching area score, outperforming MESA by approximately 32% and ranking first in area-level pose estimation, with competitive speed and strong zero-shot generalization without any training requirement.

2510.09041 2026-06-08 cs.LG cs.AI 版本更新

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

自动驾驶鲁棒控制:一种智能一般和约束对抗强化学习方法

Junchao Fan, Qi Wei, Ruichen Zhang, Yang Lu, Jianhua Wang, Xiaolin Chang, Bo Ai

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation(北京智能交通安全与隐私重点实验室) Beijing Jiaotong University(北京交通大学) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学) School of Computer Science and Technology(计算机科学与技术学院) Taiyuan University of Technology(太原科技大学) School of Electronics and Information Engineering(电子与信息工程学院)

AI总结 针对深度强化学习在自动驾驶中易受对抗攻击的问题,提出智能一般和约束对抗强化学习(IGCARL),通过战略性目标对手和鲁棒驾驶代理的交互训练,在约束优化下提升策略稳定性,实验表明成功率比现有方法提高至少27.9%。

详情
AI中文摘要

深度强化学习(DRL)在开发自动驾驶策略方面取得了显著成功。然而,其对对抗攻击的脆弱性仍然是实际部署的关键障碍。尽管现有的鲁棒方法已取得一定成功,但它们仍面临三个关键问题:(i)这些方法针对短视的对抗攻击进行训练,限制了它们应对更具战略性威胁的能力;(ii)它们难以引发真正安全关键的事件(例如碰撞),反而常常导致轻微后果;(iii)由于缺乏鲁棒约束,这些方法在训练过程中可能导致学习不稳定和策略漂移。为了解决这些问题,我们提出了智能一般和约束对抗强化学习(IGCARL),一种新颖的鲁棒自动驾驶方法,包括一个战略性目标对手和一个鲁棒驾驶代理。战略性目标对手被设计为利用DRL的时间决策能力来执行策略协调的多步攻击。此外,它通过采用一般和目标明确地专注于引发安全关键事件。鲁棒驾驶代理通过与对手交互学习,以发展出对抗攻击的鲁棒自动驾驶策略。为了确保对抗环境中的稳定学习并减轻攻击引起的策略漂移,代理在约束公式下进行优化。大量实验表明,IGCARL相比现有最先进方法将成功率提高了至少27.9%,展示了对抗攻击的卓越鲁棒性,并增强了基于DRL的自动驾驶的安全性和可靠性。

英文摘要

Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.