arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10461 2026-06-10 cs.LG cs.AI cs.CL 新提交

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

ERAlign: 文本属性图上GNN与LLM的基于能量的表示对齐

Xianlin Zeng, Fan Xia, Xiangyu Chen

AI总结 提出ERAlign框架,利用能量模型对齐GNN和LLM的表示,通过能量差异优化实现分布一致性,在8个数据集上取得最优性能。

详情
Comments
Accepted to ICML 2026
AI中文摘要

文本属性图(TAGs)将文本节点属性与图结构相结合,以描述丰富的关联语义。最近整合图神经网络(GNNs)和大语言模型(LLMs)的努力在TAGs学习上显示出前景,但实现良好对齐的表示仍然具有挑战性。先前的研究主要依赖于执行粗粒度匹配的启发式方法。它们缺乏足够的约束,忽略了分布对齐,导致表示漂移和泛化能力有限。基于能量模型(EBMs),我们提出了一种基于能量的表示对齐(ERAlign)框架,该框架将GNN编码的图结构和LLM导出的文本嵌入投影到共享潜在空间,以实现分布一致性。具体来说,层间对齐通过距离度量量化,并通过EBM目标进行优化。通过降低能量值,我们的框架为下游任务产生良好对齐的表示。在训练过程中,我们引入能量差异(ED)以避免与难以处理的归一化相关的高采样成本。ED还具有更高的训练效率和减少能量景观失真的理论保证。在八个TAG数据集上的实证评估表明,ERAlign在不同监督水平和跨任务迁移场景下均获得了最先进的性能。

英文摘要

Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

2606.10459 2026-06-10 cs.SI cs.CL 新提交

Leveraging Social Media Data for COVID-19 Studies

利用社交媒体数据进行COVID-19研究

Nur Hafieza Ismail, Nur Shazwani Kamarudin, Nurol Husna Che Rose

AI总结 本文探讨社交媒体在COVID-19大流行期间的作用,分类使用数据,介绍机器学习、特征工程、自然语言处理和调查方法,并指出未来研究方向。

详情
Comments
8 pages, 1 figure
AI中文摘要

如今,社交网络已成为广泛偏好的信息来源。特别是在2019冠状病毒病(COVID-19)大流行期间,社交媒体已成为获取与COVID-19相关最新新闻和信息的最常用平台之一。社交媒体之所以受欢迎,是因为它们为注册用户提供免费访问,并允许他们发布、传播信息以及回复他人的帖子。全球有近46亿社交媒体用户,因此这些平台上共享的大量信息可能影响人们如何看待和应对当前面临的大流行,这并不令人惊讶。通过合理使用,社交媒体可以成为传播可靠新闻和提高患者、临床医生及社会公众意识的有益数字工具。具体而言,本章描述了用户披露中表达的语言、视觉和情感指标。因此,本章详细探讨和讨论了COVID-19大流行期间社交媒体平台使用的相关研究。本章还对所使用的社交媒体数据进行了分类,介绍了不同的部署机器学习、特征工程、自然语言处理和调查方法,并概述了未来研究的方向。

英文摘要

Nowadays, social media networks have become widely preferred sources of information. Especially during the time of the Coronavirus disease 2019 COVID 19 pandemic, social media has been one of the most used platforms to get the latest news and information related to COVID 19. Social media are popular because they offer free access to their registered users and allow them to do posting, disseminate information, and respond to others postings. With almost 4.6 billion social media users worldwide, it is not surprising the significant amount of information shared through these platforms could affect how people perceive and cope with the pandemic that we are facing right now. With decent use, social media can be a beneficial digital tool to spread reliable news and public awareness for patients, clinicians, and society. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. Thus, in this chapter, the related studies of social media platforms usage during the COVID 19 pandemic are explored and discussed in detail. This chapter also categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and survey methods, and outlines directions for future research.

2606.10450 2026-06-10 cs.CV cs.LG 新提交

Few-step Generative Models as Lossy Compression

少步生成模型作为有损压缩

Fuma Kimishima, Jinjia Zhou

AI总结 研究将少步生成模型(Rectified Flow、CTM、MeanFlow)用于反向信道编码框架进行有损压缩,通过参数化等效和局部高斯近似实现无需重训练的编解码,在低分辨率基准上减少编解码时间并提升低比特率下的真实性。

详情
AI中文摘要

DiffC 提供了一种重用预训练扩散模型进行有损压缩的原则性方法,但其编码和解码过程仍然缓慢,因为它们需要许多离散化的前向和反向步骤。我们研究少步生成模型——Rectified Flow、一致性轨迹模型(CTM)和 MeanFlow——是否可以在相同的反向信道编码(RCC)框架中作为编解码器使用。主要挑战在于 RCC 需要后验和共享分布参数,而这些模型并未显式参数化中间条件分布。对于 Rectified Flow 和 MeanFlow,我们利用速度参数化与扩散式去噪参数化之间的等价性来推导 RCC 所需的量。对于从 EDM 蒸馏得到的 CTM,我们采用 EDM 噪声参数化以及中间状态下发送方和共享分布的局部高斯近似。这产生了一个概念验证的概率公式,使得无需重新训练即可使用预训练的少步生成模型进行压缩。在低分辨率基准上,由此产生的编解码器减少了编码和解码时间,并在低比特率范围内提高了真实性。

英文摘要

DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

2606.10439 2026-06-10 cs.SD cs.CL eess.AS 新提交

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

利用混合专家和动态下采样增强基于多语言大模型的语音识别

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

AI总结 提出基于投影器的LLM-ASR框架,通过混合专家架构提升跨语言适应性,并利用连续整合-触发机制实现动态下采样和模态对齐,实验表明该方法显著超越强基线模型。

详情
Journal ref
ICASSP (2026),18807-18811
Comments
Accepted by ICASSP 2026
AI中文摘要

大语言模型的快速发展为自动语音识别开辟了新前沿,使其有效集成成为一个关键且具有挑战性的研究方向。为此,本文提出了一种基于投影器的LLM-ASR框架,针对多语言泛化和模态对齐的关键挑战。我们的方法结合了混合专家架构以改善跨语言适应性,以及连续整合-触发机制用于动态下采样和模态对齐。实验结果表明,这些组件的组合带来了显著的性能提升,超越了强基线模型。所提出的方法朝着构建更准确、更鲁棒、更泛化的基于LLM的ASR系统迈出了一步。

英文摘要

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

2606.10412 2026-06-10 cs.AI 新提交

A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

面向智能金融系统的统一多模态框架:整合强化学习、高频交易和博弈论方法与跨模态情感分析

Fanrong Liu, Zhang Yuwei, Mingni Luo

AI总结 提出统一框架整合PPO、高频预测、上下文学习、博弈论和跨模态情感分析,在多个金融任务上平均提升20%以上性能。

详情
AI中文摘要

金融科技的快速发展要求能够同时处理多领域多样化挑战的复杂人工智能系统。本文提出了一个开创性的统一框架,无缝整合了用于机器人顾问系统的近端策略优化、用于高频交易的先进时间序列预测模型、用于动态投资顾问的上下文学习机制、用于竞争性银行场景的博弈论方法以及用于跨模态金融情感分析的统一嵌入。我们的综合框架解决了现有文献中这些技术孤立发展、未能利用其协同潜力的关键空白。通过在多个金融数据集和现实场景中的广泛实验,我们证明了集成方法相比专门的单领域系统实现了更优的性能。具体而言,我们的框架在投资组合优化指标上提升了23.7%,将高频交易的预测误差降低了31.2%,将投资推荐准确率提高了18.9%,通过纳什均衡收敛速度增加27.4%优化了竞争性银行策略,并通过跨模态融合将情感分析准确率提高了15.6%。我们的工作理论基础为集成优化问题建立了收敛保证,而实证结果验证了其在多样化金融机构中的实际适用性。这项研究不仅推进了金融AI的最新水平,还为开发能够适应现代金融市场复杂互联本质的综合智能系统提供了蓝图。

英文摘要

The rapid evolution of financial technology demands sophisticated artificial intelligence systems capable of handling diverse challenges across multiple domains simultaneously. This paper presents a groundbreaking unified framework that seamlessly integrates Proximal Policy Optimization for robo-advisory systems, advanced time-series prediction models for high-frequency trading, in-context learning mechanisms for dynamic investment advisory, game-theoretic approaches for competitive banking scenarios, and unified embeddings for cross-modal financial sentiment analysis. Our comprehensive framework addresses the critical gap in existing literature where these technologies have been developed in isolation, failing to leverage their synergistic potential. Through extensive experimentation across multiple financial datasets and real-world scenarios, we demonstrate that our integrated approach achieves superior performance compared to specialized single-domain systems. Specifically, our framework shows a 23.7% improvement in portfolio optimization metrics, reduces prediction error in high-frequency trading by 31.2%, enhances investment recommendation accuracy by 18.9%, optimizes competitive banking strategies with a 27.4% increase in Nash equilibrium convergence speed, and improves sentiment analysis accuracy by 15.6% through cross-modal fusion. The theoretical foundation of our work establishes convergence guarantees for the integrated optimization problem, while our empirical results validate the practical applicability across diverse financial institutions. This research not only advances the state-of-the-art in financial AI but also provides a blueprint for developing comprehensive intelligent systems that can adapt to the complex, interconnected nature of modern financial markets.

2606.10410 2026-06-10 cs.LG eess.SP q-bio.QM 新提交

A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

生理信号中的综合推理时增强框架:应用于基于PPG的房颤检测

Davood Fattahi, Runze Yan, Saurabh Kataria, Zhaoliang Chen, Xiao Hu

AI总结 提出一个包含13种增强方法的统一推理时增强框架,通过贝叶斯优化超参数,在PPG房颤检测任务中显著提升AUROC和AUPRC,降低假阳性率。

详情
Comments
22 pages, 11 figures, 4 tables. Under review at Physiological Measurement
AI中文摘要

目标:在真实部署中,生理信号的准确分类面临传感器噪声、运动伪影以及训练数据与部署数据之间分布偏移的挑战。推理时增强(ITA)在推理过程中应用增强而非重新训练,提供了一种简单、模型无关的机制来提高鲁棒性。然而,ITA在生理信号中的应用范围仍然狭窄,依赖于有限的增强方法和固定的未优化参数。本文提出一个统一的ITA框架以解决这一差距。方法:该框架包含13种增强方法,涵盖时域、幅值域、频域和伪影注入变换,并通过贝叶斯优化优化超参数。我们使用GPT-PPG和ResNet在五个数据集(包含400多名患者和约9,800小时记录)上评估基于30秒PPG信号的房颤(AF)检测。主要结果:标准ITA持续改善了AUROC(GPT-PPG最高提升8.5%,ResNet最高提升0.7%)和AUPRC(GPT-PPG最高提升10.6%,ResNet最高提升0.8%)。选择性ITA进一步将非AF数据集上的平均FPR降低了高达4.4%(GPT-PPG)和1.3%(ResNet)。意义:这些发现确立了ITA作为一种实用的、模型无关的方法,用于在无法重新训练的部署环境中提高基于PPG的房颤分类可靠性,并具有更广泛的生理信号分析适用性。

英文摘要

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.

2606.10393 2026-06-10 cs.LG cs.CE 新提交

Validation-Stage Combinatorial Fusion Analysis for Imbalanced Credit-Card Fraud Detection

面向不平衡信用卡欺诈检测的验证阶段组合融合分析

Xiao Han, Chenyu Wu

AI总结 针对信用卡欺诈检测中数据不平衡问题,提出在验证阶段使用组合融合分析(CFA)选择互补模型子集并赋予多样性权重,在IEEE-CIS基准上AUC-ROC达0.9405。

详情
AI中文摘要

信用卡欺诈检测因欺诈交易稀少、成本高且分布不均而困难。强梯度提升树模型在结构化交易数据上已表现良好,因此另一种融合方法的价值并不明显。本文研究组合融合分析(CFA)——通过搜索模型子集和排名得分融合规则——是否能在IEEE-CIS欺诈检测基准上增加价值。使用无泄漏的60/20/20训练/验证/测试协议,我们评估了由七个基分类器构建的480种融合配置。最佳测试集结果来自随机森林、XGBoost和LightGBM的多样性加权得分融合(DEF WtScore),AUC-ROC = 0.9405,AUPRC = 0.6699,F1 = 0.6373。来自1000次重抽样的Bootstrap置信区间显示,对于所有三个指标,相对于最强单一模型的增益均排除零。CFA在AUC-ROC上与软投票持平,提高了AUPRC和F1,并在该设置下优于堆叠。CTGAN增强实验给出了负面结果:合成欺诈样本降低了单个模型和CFA的性能。总体而言,CFA在此处最有用的不是作为组合所有分类器的方法,而是作为验证阶段的方法,用于选择小的、互补的子集并分配多样性感知的权重。

英文摘要

Credit-card fraud detection is difficult because fraudulent transactions are rare, costly, and unevenly distributed. Strong gradient-boosted tree models already perform well on structured transaction data, so the value of another fusion method is not obvious. This paper examines whether Combinatorial Fusion Analysis (CFA), which searches over model subsets and rank-score fusion rules, can still add value on the IEEE-CIS Fraud Detection benchmark. Using a leakage-free 60/20/20 train/validation/test protocol, we evaluate 480 fusion configurations built from seven base classifiers. The best test-set result comes from diversity-weighted score fusion of Random Forest, XGBoost, and LightGBM (DEF WtScore), with AUC-ROC = 0.9405, AUPRC = 0.6699, and F1 = 0.6373. Bootstrap confidence intervals from 1,000 resamples show that the gains over the strongest single model exclude zero for all three metrics. CFA matches soft voting on AUC-ROC, improves AUPRC and F1, and outperforms stacking in this setting. A CTGAN augmentation experiment gives a negative result: synthetic fraud samples degrade both individual models and CFA. Overall, CFA is most useful here not as a way to combine every classifier, but as a validation-stage method for choosing a small, complementary subset and assigning diversity-aware weights.

2606.10392 2026-06-10 cs.AI 新提交

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

使用LoRA和NEFTune对DeepSeek-R1-8B模型进行指令微调

Wu Yuerong, Mingni Luo

AI总结 本研究结合LoRA和NEFTune微调DeepSeek-R1-8B模型,用于金融命名实体识别,在七类实体上达到0.912的微F1分数,优于多个基线模型。

详情
AI中文摘要

金融命名实体识别(NER)对于将非结构化的财务报告和新闻转化为结构化知识图谱至关重要。然而,通用大语言模型(LLMs)常常错误分类金融实体或忽略领域特定模式。本文研究了使用DeepSeek-R1-8B(一个最近开源的大语言模型)结合低秩适应(LoRA)和噪声嵌入微调(NEFTune)进行金融NER。我们语料库中的1693个样本中每个带注释的句子都被转换为指令-输入-输出三元组。我们将轻量级LoRA矩阵插入Transformer层,并应用NEFTune通过在训练期间向嵌入向量添加均匀噪声来提高泛化能力。实验表明,LoRA适应的DeepSeek-R1-8B在七种实体类型(公司、日期、地点、货币、人物、产品和数量)上达到了0.901的微F1分数,而添加NEFTune进一步将微F1分数提升至0.912,优于Llama3-8B、Qwen3-8B、Baichuan2-7B、T5和BERT-Base基线。

英文摘要

Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

2606.10382 2026-06-10 cs.RO 新提交

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

UMI-Bench 1.0:基于UMI数据的桌面机器人操作开放可复现真实世界基准

Shi Jin, Yuntian Wang, Yuhui Duan, Di Wu, Gaoqi Dong, Xiaohang Liu, Xiaotong Li, Hongfei Jia, Zehao Zhang, Tianyu Wang, Zhongjie Jia, Yuanqi Yao, Chenjia Bai, Zhaxizhuoma, Siao Liu, Nieqing Cao, Jin Wang, Chao Yu, Yan Ding

AI总结 提出UMI-Bench 1.0,首个专为UMI风格操作策略设计的真实机器人基准,通过统一协议实现数据收集、场景重置、策略执行、结果记录和任务因素分析,提供可复现的评估平台。

详情
AI中文摘要

真实机器人评估对于理解学习到的操作策略能否在精心策划的演示之外可靠运行至关重要。这一需求对于通用操作接口(UMI)风格策略尤为迫切,其性能取决于腕部视角观测、动作表示、数据收集和物理部署之间的耦合。现有的真实世界基准已取得重要进展,但它们并非围绕这种UMI数据到部署的设置而设计。我们提出UMI-Bench 1.0,一个本地优先的真实机器人基准,用于标准化评估UMI风格的操作策略。据我们所知,这是首个专门用于基于UMI的操作模型真实世界评估的基准。UMI-Bench将数据收集、场景重置、策略执行、结果记录和任务因素分析统一在一个协议中。通过使整个评估过程可复现和可审计,UMI-Bench为衡量UMI训练策略如何泛化到真实物理操作提供了一个实用的测试平台。

英文摘要

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

2606.10378 2026-06-10 cs.CV 新提交

FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

FSS-Net:用于颈动脉超声分割的频率-空间协同网络与小波注意力

Jiawei Liu, Zhijiang Wan, Junhua Hu, Rongli Zhang, Zhongbiao Xu, Yankun Cao, Yuan Chen, Jin Hong

AI总结 提出频率-空间协同网络(FSS-Net),集成小波变换、多域注意力和边缘增强,在颈动脉超声数据集上实现96.46%的Dice分数,有效分割颈动脉并识别斑块。

详情
AI中文摘要

超声成像中颈动脉的精确分割对于中风风险评估至关重要。然而,散斑噪声、低对比度和模糊边界仍然是主要挑战。在本文中,我们提出了一种频率-空间协同网络(FSS-Net),以实现噪声鲁棒且高精度的颈动脉分割。该网络将小波变换、多域注意力和边缘增强集成到一个统一的编码器-解码器架构中。具体来说,设计了一个通道-空间-小波注意力(CSWA)模块,以抑制频率域中的噪声并净化语义特征。引入了一个小波增强瓶颈(WEB)模块,以高效捕获长距离全局依赖关系。此外,一个拉普拉斯引导的自适应边缘融合(LAEF)模块补偿高频细节并保持边界连续性。在颈动脉超声数据集上的大量实验表明,FSS-Net在低信噪比条件下达到了96.46%的Dice分数(DSC)和强鲁棒性,优于几种最先进的方法。该方法实现了超声成像中颈动脉的精确分割,有效识别颈动脉粥样硬化斑块,并通过其他任务(如乳腺癌分割)验证,表明其在超声图像中识别异常组织肿块具有良好的临床应用潜力。

英文摘要

Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

2606.10372 2026-06-10 cs.CV 新提交

ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

ClinReadNet: 一种受临床阅读启发的低剂量腹部CT图像质量评估网络

Xianye Xiao, Yulong Zou, Yujie Luo, Taihui Yu, Cun-Jing Zheng, Yuan-ming Geng, Shuihua Wang, Yudong Zhang, Jin Hong

AI总结 提出ClinReadNet框架,通过模拟放射科医生阅读习惯,结合Sobel序数质量网络和窗口多尺度温度多头自注意力模块,并设计分层排序概率分数损失函数,在LDCTIQAG2023数据集上实现SOTA性能。

详情
AI中文摘要

在腹部CT成像中,开发一种模拟医生阅读习惯的低剂量无参考图像质量评估(No-reference IQA)模型具有重要的实际价值。本文提出了一种新颖的基于深度学习的框架ClinReadNet,其设计与放射科医生的临床阅读逻辑一致:首先,引入Sobel序数质量网络(SOQN)模块,该模块能同时关注与图像质量高度相关的边缘细节和整个图像的质量分布模式,准确匹配“兼顾局部细节与整体上下文”的临床阅片判断习惯;其次,该框架集成了(移位)窗口多尺度温度多头自注意力((S)W-MTMSA)模块,进一步复制了放射科医生从整体扫描到局部聚焦的阅片过程,并通过多锐度注意力精确锁定感兴趣区域;第三,设计了分层排序概率分数(HRPS)损失函数,该函数结合了粗分类和细分类的双重逻辑,同时关注分级标签之间的距离信息,有效提升了图像质量评估的性能。在LDCTIQAG2023数据集上进行的实验表明,所提方法达到了当前最先进(SOTA)性能:皮尔逊线性相关系数(PLCC)、斯皮尔曼秩相关系数(SROCC)和肯德尔秩相关系数(KROCC)的值分别达到0.9507、0.9554和0.8629,其绝对值之和(Score)为2.7690,优于现有方法。

英文摘要

In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

2606.10369 2026-06-10 cs.CL cs.LG 新提交

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

PADD: 面向非路由器教师指导MoE学生学习的路径对齐解压缩蒸馏

Xinyue Peng, Yi Qian, Jiaojiao Lin, Wenjian Shao, Yanming Liu

AI总结 提出路径对齐解压缩蒸馏(PADD)框架,通过四阶段两阶段流程将密集教师知识蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略,在数学推理任务上显著优于基线。

详情
Comments
published in ICML 2026
AI中文摘要

随着大型语言模型(LLMs)持续扩展,在固定计算预算下增长模型容量变得越来越具有挑战性。我们提出路径对齐解压缩蒸馏(PADD),这是一个将知识从无显式路由的密集教师蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略的框架。PADD将知识蒸馏组织为两个阶段的四个阶段:初始化阶段(阶段I)通过教师神经元聚类和学生专家预热在学生专家中构建多样功能,以及训练阶段(阶段II–IV)将在线自适应蒸馏、路径细化策略优化和奖励增强负载平衡集成在单一训练流程中。在数学推理基准上的实验表明,在相同推理成本下,PADD相比强基线取得了显著提升,且MoE学生能够匹配或超越其密集教师。实验还展示了有效的教师到学生知识蒸馏和稳定的路由行为。

英文摘要

As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

2606.10364 2026-06-10 cs.CV 新提交

Benchmarking stereo reconstruction for 3D printable Martian terrain models

用于3D打印火星地形模型的立体重建基准测试

Josephine Wang

AI总结 针对火星图像低纹理、不规则和部分观测的特点,评估从NASA好奇号图像估计立体深度、补全几何并导出可打印网格的流程,发现基准精度不直接迁移到火星地形重建,几何补全存在局部保真度与全局连通性的权衡。

详情
Comments
9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026
AI中文摘要

从火星车图像重建可打印的3D模型具有挑战性,因为火星地形纹理低、不规则且部分被观测。我们评估了一个流程,该流程从NASA好奇号图像估计立体深度,补全几何,并导出水密OBJ网格。在Middlebury数据集上,RAFT-Stereo优于半全局块匹配(SGBM),将视差MAE从3.22像素降低到0.73像素,并将有效预测覆盖率从76.3%提高到100.0%。然而,在好奇号图像上,RAFT更密集的视差显示出较弱的边缘对齐和更高的光度重投影误差,表明基准精度不能直接迁移到火星地形重建。几何补全展示了局部保真度与全局连通性之间的权衡。我们发现,alpha形状保留了准确但碎片化的结构,泊松重建产生更连贯的网格但增加了无支撑表面,而确定性扩散填充基线介于两者之间但对立体质量敏感。总体而言,标准立体和补全方法可以产生火星地形的可打印近似,但可靠的重建需要更强的领域特定验证。

英文摘要

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

2606.10363 2026-06-10 cs.RO 新提交

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

HiMem-WAM: 用于机器人操作的分层记忆门控世界动作模型

Xiaoquan Sun, Ruijian Zhang, Chen Cao, Yihan Sun, Jiahui Chen, Zetian Xu, Bo Chen, Haijier Chen, Zhen Yang, Jiarun Zhu, Yijun Hong, JingZhe Xu, Jingrui Pang, Mingqi Yuan, Jiayu Chen

AI总结 提出分层记忆门控世界动作模型HiMem-WAM,通过分层潜在动作框架和边界触发记忆更新,提升长时域机器人操作的任务相关记忆与泛化鲁棒性。

详情
AI中文摘要

世界动作模型(WAM)已成为具身智能的一种新的强大范式,学习与动作相关的视觉动态,显著增强了泛化性和鲁棒性。然而,现有的WAM在长时域机器人操作中仍难以处理任务相关记忆。为了解决这个问题,我们提出了HiMem-WAM,一种分层记忆门控WAM,它集成了以运动为中心的潜在动作、高级技能潜在变量和边界触发的记忆更新。具体来说,我们开发了一个分层潜在动作框架,共同学习低级运动和高级技能潜在变量,提供结构化的时间抽象。同时,边界感知记忆门在预测的技能转换处写入紧凑的任务状态,无需在测试时生成未来视频或光流估计即可实现因果推理。在LIBERO、LIBERO-PLUS、RMBench和真实世界任务上的评估表明,HiMem-WAM的分层潜在变量提高了部署扰动下的鲁棒性,而记忆模块显著有益于依赖记忆的长时域操作。

英文摘要

World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.

2606.10359 2026-06-10 cs.AI 新提交

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain: 面向供应链韧性的LLM驱动世界模型中的认知基础

Jia Luo

AI总结 提出ReflectiChain框架,通过生成式供应链世界模型和双环学习分离认知不确定性与偶然不确定性,在半导体基准上提升推理一致性33.0%,并在对抗冲击下保持82.3%可操作性。

详情
AI中文摘要

供应链中的AI代理面临一个基本的认知鸿沟:大语言模型(LLMs)解释策略但缺乏物理基础,而强化学习(RL)优化流程但对非结构化约束语义上视而不见。我们引入REFLECTICHAIN,通过生成式供应链世界模型(SC-WM)——将异构供应网络编码到具有物理守恒的6维图-潜在空间中——以及双环学习(将认知不确定性(KL信任域约束的策略适应)与偶然不确定性(随机潜在展开)分离)来弥合这一鸿沟。在Semi-Sim(一个具有SIR风险传播、6种扰动类型和10种策略约束模板的10节点半导体基准)上,REFLECTICHAIN将推理一致性得分提高了33.0%(p < 0.0001, d = 2.78),在对抗性冲击下保持了82.3%的可操作性,并表现出反脆弱行为(在适度压力下增益+40.2%)。我们识别了三种操作性的认知机制——不确定性分离、知识边界检测和经验贝叶斯策略更新——并讨论了五个局限性类别。

英文摘要

AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.

2606.10358 2026-06-10 cs.LG cs.AI 新提交

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

AI总结 针对稀疏离散数据中贝叶斯网络结构学习困难的问题,提出KG-SoftMAP方法,将加权有向知识图谱编码为软先验,结合BDeu评分与logit形式先验最大化MAP目标,在合成与真实数据上显著提升结构恢复性能。

详情
Comments
33 pages including appendices, 1 figure
AI中文摘要

从稀疏离散数据中学习贝叶斯网络(BN)结构是困难的:当每个实例仅记录少数变量时,大多数变量对缺乏可靠评分所需的联合观测,且纯数据方法恢复的结构很少。不完美的领域知识,可表示为加权有向知识图谱(KG),通常是可用的。我们提出KG-SoftMAP,它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验,并最大化结合BDeu评分与logit形式先验的MAP目标;KG可由专家整理或由LLM提取。在受控的合成基准(唯一具有真实DAG的设置)上,KG-SoftMAP在$\rho=0.05$时恢复部分有向结构(DF1从$0.14$到$0.29$,而基线接近零),当$\rho\geq0.2$时恢复更多(DF1从$0.46$到$0.96$),前提是配有一个信息丰富但不完美的KG;恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上,我们仅评估面向部署的指标:预测、校准和KG一致性。学习到的BN最好被解读为诊断模型:在SAF上,它落后于逻辑回归$0.03$的F1_FAIL,同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理;当不存在有意义的KG时,判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at $ρ=0.05$ (DF1 $0.14$ to $0.29$, versus near-zero baselines) and substantially more once $ρ\geq0.2$ (DF1 $0.46$ to $0.96$), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by $0.03$ F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

2606.10350 2026-06-10 cs.CV 新提交

Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

无人机多光谱影像观测的多角度反射率各向异性

Zhenqiang Qin, Chenguang Dai, Min Wang, Xian Li

AI总结 提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应,通过SFM精化相机参数并重投影同质区域,联合提取多波段反射率和观测几何参数,发现红边和近红外波段反射率变化达119-137%。

详情
AI中文摘要

由于低空飞行和宽视场成像,无人机多光谱影像自然包含多角度观测,这可能引入几何驱动的辐射变异性。本研究提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应。具体地,通过运动恢复结构(SFM)精化相机内参和外参,并将正射影像上标注的同质区域重投影到从不同视角获取的多个原始子图像上。这使得能够在不同观测方向下联合提取同一地面目标的多波段反射率和观测几何参数。进一步利用(VZA,RAA)域中的波段极坐标可视化分析提取的观测值。草地目标的结果显示,十个波段均存在明显的反射率各向异性,其中红边和近红外波段的最大与最小反射率变化达119-137%,表明观测几何效应对辐射一致性有不可忽视的影响。

英文摘要

UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

2606.10348 2026-06-10 cs.RO 新提交

Rethinking Embodied Navigation via Relational Inductive Bias

通过关系归纳偏差重新思考具身导航

Weitao An, Chenghao Xu, Xu Yang, Cheng Deng

AI总结 提出DB-Nav框架,利用激活偏置和抑制偏置双关系偏置重塑搜索空间,通过关系激活-抑制探索图调节前沿探索,显著提升目标导航成功率和路径效率。

详情
AI中文摘要

目标导航要求智能体通过视觉观察在未知环境中定位目标。现有方法通常依赖开放词汇检测器或视觉语言模型(VLM)来回答在哪里搜索,但往往忽略了什么不可信——哪些语义线索不可靠。开放词汇感知容易产生系统性误导证据:误报、过时的静态先验以及由于缺乏具身验证而导致的重复失败探索,这会污染地图构建和决策制定。此类错误根植于真实场景中的结构化对象关系。为解决此问题,我们提出DB-Nav,一个通过双关系偏置重塑搜索空间的框架。它将目标中心关系分解为激活偏置(传播上下文证据)和抑制偏置(通过感知混淆和动作级证伪抑制不可靠区域)。这些偏置统一到一个关系激活-抑制探索图中,该图利用在线观察和失败访问来调节前沿探索值。在ObjectNav基准上的实验表明,DB-Nav在成功率(SR)和路径长度加权成功率(SPL)上显著优于现有方法,提供了一个轻量级、可解释且鲁棒的导航框架,无需昂贵的在线VLM推理。

英文摘要

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

2606.10347 2026-06-10 cs.LG cs.LO 新提交

Beyond Explaining Predictions: Logic-Based Explanations for Confidence in Machine Learning Models

超越预测解释:基于逻辑的机器学习模型置信度解释

Vinícius Peixoto Chagas, Carlos Henrique Leitão Cavalcante, Thiago Alves Rocha

AI总结 提出置信度感知的反绎解释,通过最小置信度阈值量化解释的置信保证,并设计算法生成满足用户指定置信阈值的最小解释,在提升置信保证的同时仅适度增加解释长度。

详情
AI中文摘要

机器学习越来越多地应用于关键领域,在这些领域中,预测及其相关的置信水平都会影响重要决策。为了增强此类场景的透明度,理解模型为何对其预测有信心或不确定非常重要。最近的基于逻辑的方法提供了反绎解释,即足以保持预测类别的最小特征子集,并具有正确性保证。然而,这些方法仅关注分类行为,可能产生覆盖低预测置信度实例的解释。在这项工作中,我们引入了最小置信度阈值(MCT)的概念,它量化了反绎解释提供的最弱置信度保证。基于这一概念,我们提出了置信度感知的反绎解释,它不仅保持预测类别,还保持用户指定的置信度保证。我们将MCT计算表述为一个优化问题,并引入了一种算法,用于生成满足所需置信度阈值的最小解释。我们在用于二分类的提升树上评估了所提出的框架,尽管该方法也适用于其他提供置信度分数的机器学习模型。实验结果表明,传统的反绎解释通常提供比被解释实例本身相关的置信度弱得多的置信度保证。相比之下,置信度感知的解释持续提高了解释所保证的最小置信度,同时仅需要适度增加解释长度。这些特性使得所提出的方法特别适用于预测正确性和置信度对于可信决策都至关重要的应用。

英文摘要

Machine learning is increasingly used in critical domains, where both predictions and their associated confidence levels influence important decisions. To enhance transparency in such scenarios, it is important to understand why a model is confident or uncertain about its predictions. Recent logic-based approaches provide abductive explanations, minimal subsets of features sufficient to preserve the predicted class, with correctness guarantees. However, these methods focus solely on classification behavior and may produce explanations that cover instances with low predictive confidence. In this work, we introduce the concept of Minimum Confidence Threshold (MCT), which quantifies the weakest confidence guarantee provided by an abductive explanation. Building upon this concept, we propose confidence-aware abductive explanations, which preserve not only the predicted class but also a user-specified confidence guarantee. We formulate MCT computation as an optimization problem and introduce an algorithm for generating minimal explanations that satisfy a desired confidence threshold. We evaluate the proposed framework on boosted trees for binary classification, although the approach is applicable to other machine learning models that provide confidence scores. Experimental results show that traditional abductive explanations often provide substantially weaker confidence guarantees than the confidence associated with the explained instance itself. In contrast, confidence-aware explanations consistently improve the minimum confidence guaranteed by an explanation while requiring only a modest increase in explanation length. These properties make the proposed approach particularly suitable for applications where both predictive correctness and confidence are essential for trustworthy decision making.

2606.10346 2026-06-10 cs.AI 新提交

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆?LLM强化学习中的方向感知多样性探索

Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu

AI总结 提出DiRL框架,通过方向感知奖励区分推理与记忆驱动的探索,在GRPO中集成方向加权梯度特征,显著提升数学与通用推理性能。

详情
Comments
12 pages, 6 figures
AI中文摘要

强化学习已成为激发大型语言模型推理能力的关键范式,其中探索对于发现有效解轨迹至关重要。现有的探索方法通常鼓励语义或梯度空间中的多样性,而不区分驱动这种多样性的因素。一条轨迹可能因为遵循新的推理过程而显得新颖,也可能因为变化了记忆模式和捷径。对这两种情况给予同等奖励可能会将探索导向记忆而非真正的推理改进。在本文中,我们提出DiRL,一种方向感知强化学习框架,将探索锚定到策略内部的推理-记忆方向。具体地,DiRL从模型表示中提取该方向,构建方向加权梯度特征以表征轨迹更新,并塑造奖励以放大推理对齐的探索,同时抑制记忆对齐的变化。DiRL无缝集成到标准的组相对策略优化(GRPO)中。在数学和通用推理基准上的大量实验证明了DiRL的有效性,显示出相对于各种现有探索方法的显著改进。

英文摘要

Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

2606.10340 2026-06-10 cs.RO 新提交

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

OMG: 面向通用人形机器人的全模态运动生成

Siqiao Huang, Kun-Ying Lee, Dongming Qiao, Guanqi He, Zhenyu Wang, Yitang Li, Shaoting Zhu, Hang Zhao

AI总结 提出OMG框架,通过精心策划的数据流程和扩散模型,实现基于语言、音频和参考动作的全模态全身控制,展示了最先进的性能和可扩展性。

详情
Comments
Project Page: https://tsinghua-mars-lab.github.io/OMG/
AI中文摘要

近年来,人形机器人全身控制取得了显著进展,但现有方法仍局限于需要大量奖励工程的少数技能策略,或难以扩展到新输入模态的运动跟踪器。我们认为,通用人形机器人的关键在于构建一个可扩展的大脑——一个能够处理多种条件模态的模块,位于反应式运动跟踪小脑之上,模仿生物运动系统的层次结构。实现这一愿景面临两个挑战:获取大量高质量数据以实现通用控制,以及使生成器具备处理组合式、可扩展的多模态输入的能力。我们提出了OMG,通过精心策划的数据整理、过滤和标注流程,以及基于扩散的运动生成骨干网络(可条件于语言、音频和人类参考运动),解决了这些挑战。大量实验验证了OMG作为全模态全身控制器的性能,展示了最先进的结果、模型扩展行为以及对新分布和模态的高效适应,标志着向人形机器人基础模型迈出了具体一步。

英文摘要

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

2606.10338 2026-06-10 cs.CL cs.AI 新提交

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

AI总结 针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题,提出TRACE方法,通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率,实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情
AI中文摘要

机器遗忘对于大型语言模型越来越重要,然而混合专家(MoE)架构中的遗忘仍未得到充分探索。与密集模型不同,MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中,我们观察到遗忘数据往往不成比例地激活一小部分专家,而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题,我们提出了\textbf{TRACE},即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家,然后通过重新加权令牌级保留损失来校准保留正则化,使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明,TRACE一致地改善了遗忘-效用权衡,在相当的遗忘质量下,相对于最强基线实现了9%的相对效用提升,并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

2606.10334 2026-06-10 cs.AI 新提交

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

通过视觉反馈的自蒸馏策略优化:连接代码与视觉工件

Haoyu Dong

AI总结 提出Visual-SDPO框架,利用渲染视觉反馈作为特权上下文,通过自蒸馏和视觉引导的代码信用加权优化代码生成视觉工件的质量,在图表、UI和幻灯片生成任务上显著提升性能。

详情
AI中文摘要

代码生成大语言模型(LLMs)通过编写由不可微渲染器执行的程序,越来越多地生成图表、网页和幻灯片等视觉工件,在观察渲染结果之前就确定了代码。因此,原本可执行的代码常常产生具有视觉显著缺陷的工件,包括元素重叠、文本裁剪、对齐破坏、对比度低和溢出。我们研究针对代码生成视觉工件的视觉反馈自蒸馏。我们提出Visual-SDPO,一种自蒸馏策略优化框架,将渲染的视觉反馈视为权重共享教师的特权上下文,并将该反馈蒸馏到编码学生中。为了使监督具有空间针对性而非均匀性,我们引入视觉引导的代码信用加权,将每个检测到的缺陷追溯到影响该元素的代码语句,并放大这些语句上的蒸馏信号。序列级GRPO(组相对策略优化)项通过奖励可执行、视觉质量高的 rollout 来补充密集的 token 级目标,而失败的执行通过自蒸馏路径仍然可学习,通过将执行错误作为特权上下文传递给教师。我们使用统一的 Qwen3-VL-8B-Instruct 骨干网络,在图表、网页/UI和幻灯片生成任务上实例化 Visual-SDPO。在图表到代码、UI到代码和幻灯片生成基准(ChartMimic、Design2Code和AeSlides)上,Visual-SDPO 在主要指标上比零样本基线提升超过10个绝对点,比GRPO提升至少2.4个点,且训练步骤更少,无额外推理成本。

英文摘要

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

2606.10333 2026-06-10 cs.LG cs.CR 新提交

Privacy-Preserving Credit Risk Prediction with Alternative Data

基于替代数据的隐私保护信用风险预测

Hongzhe Zhang, Jiarong Xu, Jing He, Xiao Fang

AI总结 针对信用风险预测中替代数据共享导致的隐私泄露问题,提出PrivacyCredit方法,在保护消费者隐私、模型保密性和无损性能约束下,实现与传统明文数据组合相同的预测性能。

详情
AI中文摘要

信用风险预测是消费信贷行业中的一个关键问题。传统上,金融机构使用借款人的人口统计、财务和信用历史数据(统称为传统数据)构建信用风险预测模型。最近的研究表明,替代数据(如借款人的手机通信数据)使贷款人能够获得更全面、更准确的借款人信用状况画像,从而提高信用风险预测性能。然而,替代数据由独立于金融机构的外部实体持有。直接与金融机构共享替代数据会侵犯消费者隐私,但现有的信用风险预测研究大多忽略了这一问题。为填补这一空白,我们定义了一个新问题,即基于替代数据的隐私保护信用风险预测,该问题同时考虑三个实际约束:保护消费者隐私的隐私保护约束、在金融机构集中学习和存储模型的模型保密性约束,以及保持学习模型性能的无损约束。为解决该问题,我们开发了PrivacyCredit,一种新颖的隐私保护机器学习方法。然后,我们从理论上证明了PrivacyCredit的隐私保护、模型保密和无损特性。通过使用与替代数据关联的真实信用数据集进行大量实验,我们证明了安全地将替代数据纳入信用风险预测的预测价值,并表明PrivacyCredit实现了与从传统数据和替代数据的不安全明文组合中学习的模型相同的预测性能。我们进一步评估了其模型保密性和计算效率。

英文摘要

Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data, collectively referred to as traditional data. Recent studies have demonstrated that alternative data, such as borrowers' mobile phone communication data, enable lenders to acquire fuller and more accurate profiles of borrowers' creditworthiness, thereby improving credit risk prediction performance. Nevertheless, alternative data are held by external entities independent of financial institutions. Directly sharing alternative data with financial institutions infringe on consumer privacy, yet existing credit risk prediction studies largely overlook this issue. To address this gap, we define a new problem, namely privacy-preserving credit risk prediction with alternative data, which simultaneously considers three practical constraints: the privacy-preserving constraint that protects consumer privacy, the model-confidentiality constraint that learns and stores the model centrally at the financial institution, and the lossless constraint that maintains the performance of the learned model. To solve this problem, we develop PrivacyCredit, a novel privacy-preserving machine learning method. We then theoretically demonstrate the privacy-preserving, model-confidential, and lossless properties of PrivacyCredit. Through extensive experiments using a real-world credit dataset linked with alternative data, we demonstrate the predictive value of securely incorporating alternative data into credit risk prediction and show that PrivacyCredit achieves the same predictive performance as the model learned from the insecure plaintext combination of traditional and alternative data. We further evaluate its model-confidentiality property and computational efficiency.

2606.10329 2026-06-10 cs.CV cs.AI 新提交

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测:一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

AI总结 针对地震后短期成像间隔导致的变化检测难题,构建了土耳其地震变化检测数据集(TUE-CD),并提出多尺度特征交互网络(MSI-Net),通过联合交叉注意力和多尺度偏移校准模块,有效缓解侧视问题,提升变化检测精度。

详情
AI中文摘要

作为最具破坏性的自然灾害之一,近年来地震袭击了世界许多国家,造成了严重的经济损失。变化检测(CD)可应用于震后损伤评估,因为它能从多时相遥感图像中推断出被破坏的变化区域。此外,短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而,由于缺乏短成像间隔的数据集,当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求,我们创建了一个变化检测数据集——土耳其地震变化检测数据集(TUE-CD),用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短,不同时相图像的成像角度不同,导致了一些侧视问题。为了应对这些挑战,我们提出了一种多尺度特征交互网络(MSI-Net),用于双时相特征之间的高效交互,并减轻侧视问题的影响。具体来说,所提出的MSI-Net由联合交叉注意力(JCA)模块、多尺度偏移校准(MOC)模块和特征集成(FeI)模块组成。JCA模块统一了通道交叉注意力和空间联合注意力,以实现充分的特征交互。MOC模块进一步估计偏移量,以将双时相图像与多尺度特征对齐。最后,通过FeI模块融合校准后的特征和多尺度特征,用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明,所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

2606.10328 2026-06-10 cs.CV cs.AI 新提交

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

AI总结 提出内容引导的空间-光谱集成网络(CSI-Net),通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息,有效抑制未变化区域差异,在三个数据集上取得最优性能。

详情
AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而,现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题,本文提出了一种内容引导的空间-光谱集成网络(CSI-Net),用于融合全局空间细节和光谱差异信息。具体而言,所提出的CSI-Net由空间推理(SR)模块、光谱差异(SD)模块和内容引导集成(CGI)模块组成。在SR模块中,通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征,通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外,为了有效集成空间-光谱特征,我们设计了CGI模块以进一步利用它们的互补信息。在该模块中,引入高层内容信息作为引导,以实现适当的交互。由于高效的空间-光谱融合,所提出的CSI-Net能够更好地学习变化特征,同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明,与最先进方法相比,所提出的CSI-Net产生了更好的性能,并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

2606.10327 2026-06-10 cs.CL cs.LG 新提交

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

顺序重要:LLaMA的序列微调用于连贯的自动作文评分

Ali Keramati, Mark Warschauer

AI总结 提出对LLaMA-3.1-8B进行任务感知的序列微调,按作文话语结构顺序训练,在PERSUADE 2.0语料上证据F1达65%、结论F1达87%,超越独立训练和70B基线,证明课程设计可提升自动作文评分性能。

详情
AI中文摘要

自动作文评分(AES)系统必须判断相互依赖的话语元素(如引言、立场、证据、结论),但大多数方法孤立地处理这些元素,损害了连贯性和泛化能力。我们研究了对LLaMA-3.1-8B进行任务感知的微调,用于AES,使用参数高效的LoRA和4位量化,并比较了三种训练课程:(i)序列式(依次在引言、立场、主张、证据、结论上微调),(ii)独立式(任务特定模型),以及(iii)随机式(打乱的多任务)。在PERSUADE 2.0语料上的实验表明,建模任务依赖性很重要:序列微调取得了最强的整体结果,包括证据的F1分数65%和结论的87%,以及相应的准确率63%和85%,超越了独立训练,并且在结论上优于通用LLaMA-70B基线,尽管后者容量大得多。随机训练提高了立场评分(57% F1),但在其他地方一致性较差。这些发现表明:(1)与话语结构对齐的课程设计可以实质性地改善AES,以及(2)小型、任务优化的模型可以与显著更大的大型语言模型(LLM)竞争,为可扩展、成本效益高的评估提供了实用途径。我们发布模板和实现细节,以促进复现和未来在教育NLP中课程设计的工作。

英文摘要

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

2606.10316 2026-06-10 cs.CL 新提交

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

TabClaw: 一个用于电子表格操作和表格推理的交互式自进化智能体

Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Qingyang Mao, Yitong Zhou, Qi Liu

AI总结 提出TabClaw,一个开源交互式AI智能体,通过可编辑执行计划、流式ReAct循环、并行多表推理和用户记忆提取,提升电子表格操作和表格推理的透明性与个性化。

详情
Comments
5 pages, 2 figures
AI中文摘要

电子表格和表格是结构化数据分析中广泛使用的表示形式,但有效分析仍需大量人工和领域专业知识。近期的大语言模型智能体可以自动化部分过程,但它们通常对中间决策提供有限的透明度,依赖隐含假设,难以处理多表比较,并且重复类似工作流而不适应用户偏好。本文提出TabClaw,一个用于电子表格操作和表格推理的开源交互式AI智能体。用户上传CSV或Excel文件并发出自然语言请求;TabClaw澄清模糊意图,展示可编辑执行计划,流式传输ReAct风格的工具使用分析循环,派遣专家智能体进行并行多表推理,并通过显式一致性和不确定性标记综合发现。除一次性分析外,TabClaw记录完成的工作流,提取持久用户记忆,从重复工具使用模式中提炼可复用技能,支持包式技能导入,并从负面反馈中升级技能。在电子表格操作和表格推理基准上的实验表明,TabClaw在提高可执行任务完成度和推理性能的同时,保持了可检查的用户工作流。本文展示了TabClaw如何将电子表格和表格转化为可检查的分析工作流,同时逐步个性化以适应重复的数据分析任务。我们的代码已公开。

英文摘要

Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

2606.10315 2026-06-10 cs.CL cs.AI 新提交

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一:LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

AI总结 研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率,发现其仅捕获22%的系统性问题,主要因评分标准缺乏状态跟踪等行为维度,且路由机制导致缺陷被错误分类。

详情
Comments
13 pages, 1 figure, 5 tables
AI中文摘要

LLM作为评判员是评估对话代理的默认工具,但其可靠性几乎总是报告为与人类评分的一致性,而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理,并通过详尽的人工转录审查作为基准,衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中,评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中,9种模式中只有2种(22%),而在另一个批次中,其操作门控标记了100轮中的0轮,而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明,失败是有结构的,而非随机的:评判员能捕获轮次局部问题(虚构统计数据、错误语言),但遗漏了跨轮状态问题(确认门锁死、购物车幻觉、升级锁死、过时引用)。机制在于:评分标准仅暴露三个粗略轴(意图、品牌声音、个性化),且没有针对行为维度(状态跟踪、护栏、恢复)的类别,而大多数缺陷集中于此。失败在于路由而非感知:114轮中,113轮原始评判员注释描述了确认门或购物车状态缺陷,但被评分为“品牌声音”,且无一到达操作失败——门控连接到挂起和硬断言,而非评分标准——因此0%是路由和接线失败,而非失明。对流行率估计的影响是显著的:当表观缺陷率为零时,Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时,相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理,自动评判是回归底线,而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

2606.10314 2026-06-10 cs.AI 新提交

Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

基于大语言模型驱动行为与运动约束的移动异常生成

Yueyang Liu, Joon-Seok Kim, Andreas Züfle

AI总结 提出端到端生成框架,结合大语言模型注入语义异常与地图约束路由重建,合成带标注的真实轨迹异常数据集。

详情
AI中文摘要

尽管人类轨迹异常研究对于推进空间数据挖掘至关重要,但实证研究因缺乏真实标注数据集而严重受阻。现有真实和模拟轨迹数据集仅包含正常移动模式,缺乏异常标注。这种稀缺性源于异常事件的统计稀有性,使得传统观测方法不可行。此外,大规模移动数据的系统获取受高昂成本和严格隐私法规限制。为克服这些限制并建立可靠的带标注真实轨迹异常数据集,我们提出一种新颖的端到端生成框架,用于大规模合成逼真的轨迹异常。该架构直接在基线模拟轨迹上操作,弥合纯合成移动数据与复杂真实物理约束之间的差距。我们利用大语言模型(LLM)代理系统性地注入语义上有意义的异常行为,例如不规则分布外签到和跳过常规访问。为确保空间有效性,系统利用地图约束路由重建重新计算LLM代理修改停留点之间的物理转移。此外,为缩小模拟与现实的差距,我们通过上下文感知的空间噪声模型增强生成轨迹,该模型由环境和位置特定变量参数化,以准确模拟异构GPS传感器退化。

英文摘要

Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.