arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10410 2026-06-10 cs.LG eess.SP q-bio.QM 新提交

A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

生理信号中的综合推理时增强框架:应用于基于PPG的房颤检测

Davood Fattahi, Runze Yan, Saurabh Kataria, Zhaoliang Chen, Xiao Hu

AI总结 提出一个包含13种增强方法的统一推理时增强框架,通过贝叶斯优化超参数,在PPG房颤检测任务中显著提升AUROC和AUPRC,降低假阳性率。

详情
Comments
22 pages, 11 figures, 4 tables. Under review at Physiological Measurement
AI中文摘要

目标:在真实部署中,生理信号的准确分类面临传感器噪声、运动伪影以及训练数据与部署数据之间分布偏移的挑战。推理时增强(ITA)在推理过程中应用增强而非重新训练,提供了一种简单、模型无关的机制来提高鲁棒性。然而,ITA在生理信号中的应用范围仍然狭窄,依赖于有限的增强方法和固定的未优化参数。本文提出一个统一的ITA框架以解决这一差距。方法:该框架包含13种增强方法,涵盖时域、幅值域、频域和伪影注入变换,并通过贝叶斯优化优化超参数。我们使用GPT-PPG和ResNet在五个数据集(包含400多名患者和约9,800小时记录)上评估基于30秒PPG信号的房颤(AF)检测。主要结果:标准ITA持续改善了AUROC(GPT-PPG最高提升8.5%,ResNet最高提升0.7%)和AUPRC(GPT-PPG最高提升10.6%,ResNet最高提升0.8%)。选择性ITA进一步将非AF数据集上的平均FPR降低了高达4.4%(GPT-PPG)和1.3%(ResNet)。意义:这些发现确立了ITA作为一种实用的、模型无关的方法,用于在无法重新训练的部署环境中提高基于PPG的房颤分类可靠性,并具有更广泛的生理信号分析适用性。

英文摘要

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.

2606.10398 2026-06-10 cs.IR cs.CL cs.HC cs.SI 新提交

Selection, Not Salience: The Shape and Limits of Personalization in Social Highlighting

选择而非显著性:社交高亮中个性化的形态与局限

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过社交高亮和共读身份控制实验,发现个性化主要作用于文档选择层(约+0.13),而非句子显著性层,且效果主要由主题偏好驱动。

详情
Comments
9 pages, 1 figure, 3 tables
AI中文摘要

个性化读者所见内容是否值得,其边界在哪里?利用社交网页高亮器和共读身份控制(同一文档被多个用户高亮,固定文档和主题,询问个人历史是否比另一个读者的历史更好地预测其标记),我们绘制了跨阅读层次的个性化形态与局限。在文档层次,我们给出了干净、无泄漏、身份控制的测量,而先前的下一文档评估只能给出上界:个人历史能识别共读邻域中哪些文档属于该用户,自身与其他的差距为+0.169(相对于社区负例)和+0.119(相对于主题匹配的难负例),两者均高度显著;基于内容的实验表明该信号并非纯粹由标题驱动,而主要是主题性的。这与我们先前工作中跨度级的选择信号(+0.14)相当:选择信号在不同层次上幅度相近(+0.12至+0.17),其中大部分是稳定的主题偏好。在句子层次,两阶段个性化自动高亮(非个性化模型提出候选,个性化模型重新排序)并未优于其非个性化基线:两个现成的零样本大语言模型(包括前沿模型)预测高亮位置的效果不如首句基线,且即使在最高召回率的候选池中,个性化重排序也被显著性顺序击败,因此零结果并非仅仅是第一阶段的天花板效应。可测量的个性化主要出现在选择层:适度(约+0.13)、以主题为主,在显著性层没有可靠增益。我们还发现了一个控制负例偏差,该偏差在审计前将我们的文档差距膨胀到虚假的+0.227。超越共享显著性层可能更适合通过聚合个体而非加强个性化来实现。

英文摘要

Does personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person's history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.

2606.10393 2026-06-10 cs.LG cs.CE 新提交

Validation-Stage Combinatorial Fusion Analysis for Imbalanced Credit-Card Fraud Detection

面向不平衡信用卡欺诈检测的验证阶段组合融合分析

Xiao Han, Chenyu Wu

AI总结 针对信用卡欺诈检测中数据不平衡问题,提出在验证阶段使用组合融合分析(CFA)选择互补模型子集并赋予多样性权重,在IEEE-CIS基准上AUC-ROC达0.9405。

详情
AI中文摘要

信用卡欺诈检测因欺诈交易稀少、成本高且分布不均而困难。强梯度提升树模型在结构化交易数据上已表现良好,因此另一种融合方法的价值并不明显。本文研究组合融合分析(CFA)——通过搜索模型子集和排名得分融合规则——是否能在IEEE-CIS欺诈检测基准上增加价值。使用无泄漏的60/20/20训练/验证/测试协议,我们评估了由七个基分类器构建的480种融合配置。最佳测试集结果来自随机森林、XGBoost和LightGBM的多样性加权得分融合(DEF WtScore),AUC-ROC = 0.9405,AUPRC = 0.6699,F1 = 0.6373。来自1000次重抽样的Bootstrap置信区间显示,对于所有三个指标,相对于最强单一模型的增益均排除零。CFA在AUC-ROC上与软投票持平,提高了AUPRC和F1,并在该设置下优于堆叠。CTGAN增强实验给出了负面结果:合成欺诈样本降低了单个模型和CFA的性能。总体而言,CFA在此处最有用的不是作为组合所有分类器的方法,而是作为验证阶段的方法,用于选择小的、互补的子集并分配多样性感知的权重。

英文摘要

Credit-card fraud detection is difficult because fraudulent transactions are rare, costly, and unevenly distributed. Strong gradient-boosted tree models already perform well on structured transaction data, so the value of another fusion method is not obvious. This paper examines whether Combinatorial Fusion Analysis (CFA), which searches over model subsets and rank-score fusion rules, can still add value on the IEEE-CIS Fraud Detection benchmark. Using a leakage-free 60/20/20 train/validation/test protocol, we evaluate 480 fusion configurations built from seven base classifiers. The best test-set result comes from diversity-weighted score fusion of Random Forest, XGBoost, and LightGBM (DEF WtScore), with AUC-ROC = 0.9405, AUPRC = 0.6699, and F1 = 0.6373. Bootstrap confidence intervals from 1,000 resamples show that the gains over the strongest single model exclude zero for all three metrics. CFA matches soft voting on AUC-ROC, improves AUPRC and F1, and outperforms stacking in this setting. A CTGAN augmentation experiment gives a negative result: synthetic fraud samples degrade both individual models and CFA. Overall, CFA is most useful here not as a way to combine every classifier, but as a validation-stage method for choosing a small, complementary subset and assigning diversity-aware weights.

2606.10392 2026-06-10 cs.AI 新提交

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

使用LoRA和NEFTune对DeepSeek-R1-8B模型进行指令微调

Wu Yuerong, Mingni Luo

AI总结 本研究结合LoRA和NEFTune微调DeepSeek-R1-8B模型,用于金融命名实体识别,在七类实体上达到0.912的微F1分数,优于多个基线模型。

详情
AI中文摘要

金融命名实体识别(NER)对于将非结构化的财务报告和新闻转化为结构化知识图谱至关重要。然而,通用大语言模型(LLMs)常常错误分类金融实体或忽略领域特定模式。本文研究了使用DeepSeek-R1-8B(一个最近开源的大语言模型)结合低秩适应(LoRA)和噪声嵌入微调(NEFTune)进行金融NER。我们语料库中的1693个样本中每个带注释的句子都被转换为指令-输入-输出三元组。我们将轻量级LoRA矩阵插入Transformer层,并应用NEFTune通过在训练期间向嵌入向量添加均匀噪声来提高泛化能力。实验表明,LoRA适应的DeepSeek-R1-8B在七种实体类型(公司、日期、地点、货币、人物、产品和数量)上达到了0.901的微F1分数,而添加NEFTune进一步将微F1分数提升至0.912,优于Llama3-8B、Qwen3-8B、Baichuan2-7B、T5和BERT-Base基线。

英文摘要

Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

2606.10388 2026-06-10 cs.IR cs.AI 新提交

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

SkillResolve-Bench:衡量和解决智能体技能检索中的同能力歧义

Jiandong Ding

AI总结 针对智能体技能库中同一能力族内不同技能的执行风险,提出SkillResolve-Bench基准和SkillResolve方法,通过候选族解析和代表性选择,在保持高召回率的同时将有害技能暴露率降至0。

详情
Comments
Preprint
AI中文摘要

智能体技能库正成为可路由的软件资产:检索到的技能可以为智能体提供指令、脚本、资源绑定和执行假设。这使得技能检索不仅仅是广泛的相关性匹配。检索器可以找到正确的能力族,却暴露出错误的同能力代表。我们将这种失败研究为同能力执行风险检索。每个查询将一个有用的技能与一个特定于查询的有风险兄弟技能配对,该兄弟技能共享能力族,但可能导致执行指向过时资源、缺失前提或错误程序。我们引入了SkillResolve-Bench 1.0,这是一个针对该场景的可审计基准,包含661个有用/有风险对、源角色和准入证据、线索/泄漏检查、查询不相交划分,以及一个包含6,660个公共SkillRet候选的7,982个候选池。该基准报告有用性排名以及有害兄弟率(HSR@K),即前K个中暴露有风险兄弟的比例。我们还提供了SkillResolve,一种参考方法,它解析活跃候选族,从易混淆的库负样本和契约配置文件线索中评分查询条件效用,并在最终前K列表之前从每个族中选择一个代表。在已发布族关系下,SkillResolve达到Recall@3 0.766和NDCG@3 0.699,同时保持HSR@3=0。与SkillRouter相比,Recall@3提升0.112,NDCG@3提升0.165,同时将HSR@3从0.693降至0。如果没有代表性选择,在相同评分器下HSR@3升至0.236,这表明族内代表性选择是将能力检索转化为更安全过程暴露的机制。

英文摘要

Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

2606.10382 2026-06-10 cs.RO 新提交

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

UMI-Bench 1.0:基于UMI数据的桌面机器人操作开放可复现真实世界基准

Shi Jin, Yuntian Wang, Yuhui Duan, Di Wu, Gaoqi Dong, Xiaohang Liu, Xiaotong Li, Hongfei Jia, Zehao Zhang, Tianyu Wang, Zhongjie Jia, Yuanqi Yao, Chenjia Bai, Zhaxizhuoma, Siao Liu, Nieqing Cao, Jin Wang, Chao Yu, Yan Ding

AI总结 提出UMI-Bench 1.0,首个专为UMI风格操作策略设计的真实机器人基准,通过统一协议实现数据收集、场景重置、策略执行、结果记录和任务因素分析,提供可复现的评估平台。

详情
AI中文摘要

真实机器人评估对于理解学习到的操作策略能否在精心策划的演示之外可靠运行至关重要。这一需求对于通用操作接口(UMI)风格策略尤为迫切,其性能取决于腕部视角观测、动作表示、数据收集和物理部署之间的耦合。现有的真实世界基准已取得重要进展,但它们并非围绕这种UMI数据到部署的设置而设计。我们提出UMI-Bench 1.0,一个本地优先的真实机器人基准,用于标准化评估UMI风格的操作策略。据我们所知,这是首个专门用于基于UMI的操作模型真实世界评估的基准。UMI-Bench将数据收集、场景重置、策略执行、结果记录和任务因素分析统一在一个协议中。通过使整个评估过程可复现和可审计,UMI-Bench为衡量UMI训练策略如何泛化到真实物理操作提供了一个实用的测试平台。

英文摘要

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

2606.10378 2026-06-10 cs.CV 新提交

FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

FSS-Net:用于颈动脉超声分割的频率-空间协同网络与小波注意力

Jiawei Liu, Zhijiang Wan, Junhua Hu, Rongli Zhang, Zhongbiao Xu, Yankun Cao, Yuan Chen, Jin Hong

AI总结 提出频率-空间协同网络(FSS-Net),集成小波变换、多域注意力和边缘增强,在颈动脉超声数据集上实现96.46%的Dice分数,有效分割颈动脉并识别斑块。

详情
AI中文摘要

超声成像中颈动脉的精确分割对于中风风险评估至关重要。然而,散斑噪声、低对比度和模糊边界仍然是主要挑战。在本文中,我们提出了一种频率-空间协同网络(FSS-Net),以实现噪声鲁棒且高精度的颈动脉分割。该网络将小波变换、多域注意力和边缘增强集成到一个统一的编码器-解码器架构中。具体来说,设计了一个通道-空间-小波注意力(CSWA)模块,以抑制频率域中的噪声并净化语义特征。引入了一个小波增强瓶颈(WEB)模块,以高效捕获长距离全局依赖关系。此外,一个拉普拉斯引导的自适应边缘融合(LAEF)模块补偿高频细节并保持边界连续性。在颈动脉超声数据集上的大量实验表明,FSS-Net在低信噪比条件下达到了96.46%的Dice分数(DSC)和强鲁棒性,优于几种最先进的方法。该方法实现了超声成像中颈动脉的精确分割,有效识别颈动脉粥样硬化斑块,并通过其他任务(如乳腺癌分割)验证,表明其在超声图像中识别异常组织肿块具有良好的临床应用潜力。

英文摘要

Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

2606.10372 2026-06-10 cs.CV 新提交

ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

ClinReadNet: 一种受临床阅读启发的低剂量腹部CT图像质量评估网络

Xianye Xiao, Yulong Zou, Yujie Luo, Taihui Yu, Cun-Jing Zheng, Yuan-ming Geng, Shuihua Wang, Yudong Zhang, Jin Hong

AI总结 提出ClinReadNet框架,通过模拟放射科医生阅读习惯,结合Sobel序数质量网络和窗口多尺度温度多头自注意力模块,并设计分层排序概率分数损失函数,在LDCTIQAG2023数据集上实现SOTA性能。

详情
AI中文摘要

在腹部CT成像中,开发一种模拟医生阅读习惯的低剂量无参考图像质量评估(No-reference IQA)模型具有重要的实际价值。本文提出了一种新颖的基于深度学习的框架ClinReadNet,其设计与放射科医生的临床阅读逻辑一致:首先,引入Sobel序数质量网络(SOQN)模块,该模块能同时关注与图像质量高度相关的边缘细节和整个图像的质量分布模式,准确匹配“兼顾局部细节与整体上下文”的临床阅片判断习惯;其次,该框架集成了(移位)窗口多尺度温度多头自注意力((S)W-MTMSA)模块,进一步复制了放射科医生从整体扫描到局部聚焦的阅片过程,并通过多锐度注意力精确锁定感兴趣区域;第三,设计了分层排序概率分数(HRPS)损失函数,该函数结合了粗分类和细分类的双重逻辑,同时关注分级标签之间的距离信息,有效提升了图像质量评估的性能。在LDCTIQAG2023数据集上进行的实验表明,所提方法达到了当前最先进(SOTA)性能:皮尔逊线性相关系数(PLCC)、斯皮尔曼秩相关系数(SROCC)和肯德尔秩相关系数(KROCC)的值分别达到0.9507、0.9554和0.8629,其绝对值之和(Score)为2.7690,优于现有方法。

英文摘要

In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

2606.10369 2026-06-10 cs.CL cs.LG 新提交

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

PADD: 面向非路由器教师指导MoE学生学习的路径对齐解压缩蒸馏

Xinyue Peng, Yi Qian, Jiaojiao Lin, Wenjian Shao, Yanming Liu

AI总结 提出路径对齐解压缩蒸馏(PADD)框架,通过四阶段两阶段流程将密集教师知识蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略,在数学推理任务上显著优于基线。

详情
Comments
published in ICML 2026
AI中文摘要

随着大型语言模型(LLMs)持续扩展,在固定计算预算下增长模型容量变得越来越具有挑战性。我们提出路径对齐解压缩蒸馏(PADD),这是一个将知识从无显式路由的密集教师蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略的框架。PADD将知识蒸馏组织为两个阶段的四个阶段:初始化阶段(阶段I)通过教师神经元聚类和学生专家预热在学生专家中构建多样功能,以及训练阶段(阶段II–IV)将在线自适应蒸馏、路径细化策略优化和奖励增强负载平衡集成在单一训练流程中。在数学推理基准上的实验表明,在相同推理成本下,PADD相比强基线取得了显著提升,且MoE学生能够匹配或超越其密集教师。实验还展示了有效的教师到学生知识蒸馏和稳定的路由行为。

英文摘要

As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

2606.10364 2026-06-10 cs.CV 新提交

Benchmarking stereo reconstruction for 3D printable Martian terrain models

用于3D打印火星地形模型的立体重建基准测试

Josephine Wang

AI总结 针对火星图像低纹理、不规则和部分观测的特点,评估从NASA好奇号图像估计立体深度、补全几何并导出可打印网格的流程,发现基准精度不直接迁移到火星地形重建,几何补全存在局部保真度与全局连通性的权衡。

详情
Comments
9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026
AI中文摘要

从火星车图像重建可打印的3D模型具有挑战性,因为火星地形纹理低、不规则且部分被观测。我们评估了一个流程,该流程从NASA好奇号图像估计立体深度,补全几何,并导出水密OBJ网格。在Middlebury数据集上,RAFT-Stereo优于半全局块匹配(SGBM),将视差MAE从3.22像素降低到0.73像素,并将有效预测覆盖率从76.3%提高到100.0%。然而,在好奇号图像上,RAFT更密集的视差显示出较弱的边缘对齐和更高的光度重投影误差,表明基准精度不能直接迁移到火星地形重建。几何补全展示了局部保真度与全局连通性之间的权衡。我们发现,alpha形状保留了准确但碎片化的结构,泊松重建产生更连贯的网格但增加了无支撑表面,而确定性扩散填充基线介于两者之间但对立体质量敏感。总体而言,标准立体和补全方法可以产生火星地形的可打印近似,但可靠的重建需要更强的领域特定验证。

英文摘要

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

2606.10363 2026-06-10 cs.RO 新提交

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

HiMem-WAM: 用于机器人操作的分层记忆门控世界动作模型

Xiaoquan Sun, Ruijian Zhang, Chen Cao, Yihan Sun, Jiahui Chen, Zetian Xu, Bo Chen, Haijier Chen, Zhen Yang, Jiarun Zhu, Yijun Hong, JingZhe Xu, Jingrui Pang, Mingqi Yuan, Jiayu Chen

AI总结 提出分层记忆门控世界动作模型HiMem-WAM,通过分层潜在动作框架和边界触发记忆更新,提升长时域机器人操作的任务相关记忆与泛化鲁棒性。

详情
AI中文摘要

世界动作模型(WAM)已成为具身智能的一种新的强大范式,学习与动作相关的视觉动态,显著增强了泛化性和鲁棒性。然而,现有的WAM在长时域机器人操作中仍难以处理任务相关记忆。为了解决这个问题,我们提出了HiMem-WAM,一种分层记忆门控WAM,它集成了以运动为中心的潜在动作、高级技能潜在变量和边界触发的记忆更新。具体来说,我们开发了一个分层潜在动作框架,共同学习低级运动和高级技能潜在变量,提供结构化的时间抽象。同时,边界感知记忆门在预测的技能转换处写入紧凑的任务状态,无需在测试时生成未来视频或光流估计即可实现因果推理。在LIBERO、LIBERO-PLUS、RMBench和真实世界任务上的评估表明,HiMem-WAM的分层潜在变量提高了部署扰动下的鲁棒性,而记忆模块显著有益于依赖记忆的长时域操作。

英文摘要

World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.

2606.10359 2026-06-10 cs.AI 新提交

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain: 面向供应链韧性的LLM驱动世界模型中的认知基础

Jia Luo

AI总结 提出ReflectiChain框架,通过生成式供应链世界模型和双环学习分离认知不确定性与偶然不确定性,在半导体基准上提升推理一致性33.0%,并在对抗冲击下保持82.3%可操作性。

详情
AI中文摘要

供应链中的AI代理面临一个基本的认知鸿沟:大语言模型(LLMs)解释策略但缺乏物理基础,而强化学习(RL)优化流程但对非结构化约束语义上视而不见。我们引入REFLECTICHAIN,通过生成式供应链世界模型(SC-WM)——将异构供应网络编码到具有物理守恒的6维图-潜在空间中——以及双环学习(将认知不确定性(KL信任域约束的策略适应)与偶然不确定性(随机潜在展开)分离)来弥合这一鸿沟。在Semi-Sim(一个具有SIR风险传播、6种扰动类型和10种策略约束模板的10节点半导体基准)上,REFLECTICHAIN将推理一致性得分提高了33.0%(p < 0.0001, d = 2.78),在对抗性冲击下保持了82.3%的可操作性,并表现出反脆弱行为(在适度压力下增益+40.2%)。我们识别了三种操作性的认知机制——不确定性分离、知识边界检测和经验贝叶斯策略更新——并讨论了五个局限性类别。

英文摘要

AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.

2606.10358 2026-06-10 cs.LG cs.AI 新提交

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

AI总结 针对稀疏离散数据中贝叶斯网络结构学习困难的问题,提出KG-SoftMAP方法,将加权有向知识图谱编码为软先验,结合BDeu评分与logit形式先验最大化MAP目标,在合成与真实数据上显著提升结构恢复性能。

详情
Comments
33 pages including appendices, 1 figure
AI中文摘要

从稀疏离散数据中学习贝叶斯网络(BN)结构是困难的:当每个实例仅记录少数变量时,大多数变量对缺乏可靠评分所需的联合观测,且纯数据方法恢复的结构很少。不完美的领域知识,可表示为加权有向知识图谱(KG),通常是可用的。我们提出KG-SoftMAP,它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验,并最大化结合BDeu评分与logit形式先验的MAP目标;KG可由专家整理或由LLM提取。在受控的合成基准(唯一具有真实DAG的设置)上,KG-SoftMAP在$\rho=0.05$时恢复部分有向结构(DF1从$0.14$到$0.29$,而基线接近零),当$\rho\geq0.2$时恢复更多(DF1从$0.46$到$0.96$),前提是配有一个信息丰富但不完美的KG;恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上,我们仅评估面向部署的指标:预测、校准和KG一致性。学习到的BN最好被解读为诊断模型:在SAF上,它落后于逻辑回归$0.03$的F1_FAIL,同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理;当不存在有意义的KG时,判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at $ρ=0.05$ (DF1 $0.14$ to $0.29$, versus near-zero baselines) and substantially more once $ρ\geq0.2$ (DF1 $0.46$ to $0.96$), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by $0.03$ F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

2606.10357 2026-06-10 cs.IR cs.AI 新提交

Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations

原子意图推理:将LLM语义引入工业跨域推荐

Zhuohang Jiang, Yuxin Chen, Shijie Wang, Haohao Qu, Zhou Jindong, Wenqi Fan, Li Qing, Dongxu Liang, Jun Wang

AI总结 提出AIR框架,通过离线LLM推理与在线高效检索组合,实现工业级跨域推荐,在快手电商中GMV提升3.446%。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea
AI中文摘要

跨域推荐是内容到电子商务平台的核心问题。其目标是利用用户与内容的交互来推断电子商务端的潜在购买意图,从而提高转化率和商业价值。然而,在真实的工业场景中,跨域推荐面临多重挑战:不同领域之间存在显著的语义鸿沟,用户跨域行为序列通常规模庞大且噪声丰富。尽管大型语言模型(LLM)具有强大的语义理解和推理能力,但其毫秒级的推理延迟使得直接应用于在线推荐系统变得困难。为了解决这些问题,本文介绍了AIR(原子意图推理),一个为工业级部署设计的LLM驱动的跨域推荐框架。通过将LLM推理迁移到离线阶段,并在在线操作期间通过高效检索和组合动态构建用户意图表示,它在保持语义一致性的同时实现了约400倍的推理加速。在多个公共数据集上的实验结果表明,我们的方法在跨域推荐任务中达到了最先进的性能。此外,在快手电商真实业务场景中进行的大规模在线A/B测试显示,我们的方法在多个核心业务指标上取得了稳定且显著的提升,包括GMV增长+3.446%,充分验证了其在工业级推荐系统中的有效性和实用价值。

英文摘要

Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce's real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.

2606.10350 2026-06-10 cs.CV 新提交

Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

无人机多光谱影像观测的多角度反射率各向异性

Zhenqiang Qin, Chenguang Dai, Min Wang, Xian Li

AI总结 提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应,通过SFM精化相机参数并重投影同质区域,联合提取多波段反射率和观测几何参数,发现红边和近红外波段反射率变化达119-137%。

详情
AI中文摘要

由于低空飞行和宽视场成像,无人机多光谱影像自然包含多角度观测,这可能引入几何驱动的辐射变异性。本研究提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应。具体地,通过运动恢复结构(SFM)精化相机内参和外参,并将正射影像上标注的同质区域重投影到从不同视角获取的多个原始子图像上。这使得能够在不同观测方向下联合提取同一地面目标的多波段反射率和观测几何参数。进一步利用(VZA,RAA)域中的波段极坐标可视化分析提取的观测值。草地目标的结果显示,十个波段均存在明显的反射率各向异性,其中红边和近红外波段的最大与最小反射率变化达119-137%,表明观测几何效应对辐射一致性有不可忽视的影响。

英文摘要

UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

2606.10348 2026-06-10 cs.RO 新提交

Rethinking Embodied Navigation via Relational Inductive Bias

通过关系归纳偏差重新思考具身导航

Weitao An, Chenghao Xu, Xu Yang, Cheng Deng

AI总结 提出DB-Nav框架,利用激活偏置和抑制偏置双关系偏置重塑搜索空间,通过关系激活-抑制探索图调节前沿探索,显著提升目标导航成功率和路径效率。

详情
AI中文摘要

目标导航要求智能体通过视觉观察在未知环境中定位目标。现有方法通常依赖开放词汇检测器或视觉语言模型(VLM)来回答在哪里搜索,但往往忽略了什么不可信——哪些语义线索不可靠。开放词汇感知容易产生系统性误导证据:误报、过时的静态先验以及由于缺乏具身验证而导致的重复失败探索,这会污染地图构建和决策制定。此类错误根植于真实场景中的结构化对象关系。为解决此问题,我们提出DB-Nav,一个通过双关系偏置重塑搜索空间的框架。它将目标中心关系分解为激活偏置(传播上下文证据)和抑制偏置(通过感知混淆和动作级证伪抑制不可靠区域)。这些偏置统一到一个关系激活-抑制探索图中,该图利用在线观察和失败访问来调节前沿探索值。在ObjectNav基准上的实验表明,DB-Nav在成功率(SR)和路径长度加权成功率(SPL)上显著优于现有方法,提供了一个轻量级、可解释且鲁棒的导航框架,无需昂贵的在线VLM推理。

英文摘要

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

2606.10347 2026-06-10 cs.LG cs.LO 新提交

Beyond Explaining Predictions: Logic-Based Explanations for Confidence in Machine Learning Models

超越预测解释:基于逻辑的机器学习模型置信度解释

Vinícius Peixoto Chagas, Carlos Henrique Leitão Cavalcante, Thiago Alves Rocha

AI总结 提出置信度感知的反绎解释,通过最小置信度阈值量化解释的置信保证,并设计算法生成满足用户指定置信阈值的最小解释,在提升置信保证的同时仅适度增加解释长度。

详情
AI中文摘要

机器学习越来越多地应用于关键领域,在这些领域中,预测及其相关的置信水平都会影响重要决策。为了增强此类场景的透明度,理解模型为何对其预测有信心或不确定非常重要。最近的基于逻辑的方法提供了反绎解释,即足以保持预测类别的最小特征子集,并具有正确性保证。然而,这些方法仅关注分类行为,可能产生覆盖低预测置信度实例的解释。在这项工作中,我们引入了最小置信度阈值(MCT)的概念,它量化了反绎解释提供的最弱置信度保证。基于这一概念,我们提出了置信度感知的反绎解释,它不仅保持预测类别,还保持用户指定的置信度保证。我们将MCT计算表述为一个优化问题,并引入了一种算法,用于生成满足所需置信度阈值的最小解释。我们在用于二分类的提升树上评估了所提出的框架,尽管该方法也适用于其他提供置信度分数的机器学习模型。实验结果表明,传统的反绎解释通常提供比被解释实例本身相关的置信度弱得多的置信度保证。相比之下,置信度感知的解释持续提高了解释所保证的最小置信度,同时仅需要适度增加解释长度。这些特性使得所提出的方法特别适用于预测正确性和置信度对于可信决策都至关重要的应用。

英文摘要

Machine learning is increasingly used in critical domains, where both predictions and their associated confidence levels influence important decisions. To enhance transparency in such scenarios, it is important to understand why a model is confident or uncertain about its predictions. Recent logic-based approaches provide abductive explanations, minimal subsets of features sufficient to preserve the predicted class, with correctness guarantees. However, these methods focus solely on classification behavior and may produce explanations that cover instances with low predictive confidence. In this work, we introduce the concept of Minimum Confidence Threshold (MCT), which quantifies the weakest confidence guarantee provided by an abductive explanation. Building upon this concept, we propose confidence-aware abductive explanations, which preserve not only the predicted class but also a user-specified confidence guarantee. We formulate MCT computation as an optimization problem and introduce an algorithm for generating minimal explanations that satisfy a desired confidence threshold. We evaluate the proposed framework on boosted trees for binary classification, although the approach is applicable to other machine learning models that provide confidence scores. Experimental results show that traditional abductive explanations often provide substantially weaker confidence guarantees than the confidence associated with the explained instance itself. In contrast, confidence-aware explanations consistently improve the minimum confidence guaranteed by an explanation while requiring only a modest increase in explanation length. These properties make the proposed approach particularly suitable for applications where both predictive correctness and confidence are essential for trustworthy decision making.

2606.10346 2026-06-10 cs.AI 新提交

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆?LLM强化学习中的方向感知多样性探索

Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu

AI总结 提出DiRL框架,通过方向感知奖励区分推理与记忆驱动的探索,在GRPO中集成方向加权梯度特征,显著提升数学与通用推理性能。

详情
Comments
12 pages, 6 figures
AI中文摘要

强化学习已成为激发大型语言模型推理能力的关键范式,其中探索对于发现有效解轨迹至关重要。现有的探索方法通常鼓励语义或梯度空间中的多样性,而不区分驱动这种多样性的因素。一条轨迹可能因为遵循新的推理过程而显得新颖,也可能因为变化了记忆模式和捷径。对这两种情况给予同等奖励可能会将探索导向记忆而非真正的推理改进。在本文中,我们提出DiRL,一种方向感知强化学习框架,将探索锚定到策略内部的推理-记忆方向。具体地,DiRL从模型表示中提取该方向,构建方向加权梯度特征以表征轨迹更新,并塑造奖励以放大推理对齐的探索,同时抑制记忆对齐的变化。DiRL无缝集成到标准的组相对策略优化(GRPO)中。在数学和通用推理基准上的大量实验证明了DiRL的有效性,显示出相对于各种现有探索方法的显著改进。

英文摘要

Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

2606.10340 2026-06-10 cs.RO 新提交

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

OMG: 面向通用人形机器人的全模态运动生成

Siqiao Huang, Kun-Ying Lee, Dongming Qiao, Guanqi He, Zhenyu Wang, Yitang Li, Shaoting Zhu, Hang Zhao

AI总结 提出OMG框架,通过精心策划的数据流程和扩散模型,实现基于语言、音频和参考动作的全模态全身控制,展示了最先进的性能和可扩展性。

详情
Comments
Project Page: https://tsinghua-mars-lab.github.io/OMG/
AI中文摘要

近年来,人形机器人全身控制取得了显著进展,但现有方法仍局限于需要大量奖励工程的少数技能策略,或难以扩展到新输入模态的运动跟踪器。我们认为,通用人形机器人的关键在于构建一个可扩展的大脑——一个能够处理多种条件模态的模块,位于反应式运动跟踪小脑之上,模仿生物运动系统的层次结构。实现这一愿景面临两个挑战:获取大量高质量数据以实现通用控制,以及使生成器具备处理组合式、可扩展的多模态输入的能力。我们提出了OMG,通过精心策划的数据整理、过滤和标注流程,以及基于扩散的运动生成骨干网络(可条件于语言、音频和人类参考运动),解决了这些挑战。大量实验验证了OMG作为全模态全身控制器的性能,展示了最先进的结果、模型扩展行为以及对新分布和模态的高效适应,标志着向人形机器人基础模型迈出了具体一步。

英文摘要

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

2606.10338 2026-06-10 cs.CL cs.AI 新提交

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

AI总结 针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题,提出TRACE方法,通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率,实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情
AI中文摘要

机器遗忘对于大型语言模型越来越重要,然而混合专家(MoE)架构中的遗忘仍未得到充分探索。与密集模型不同,MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中,我们观察到遗忘数据往往不成比例地激活一小部分专家,而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题,我们提出了\textbf{TRACE},即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家,然后通过重新加权令牌级保留损失来校准保留正则化,使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明,TRACE一致地改善了遗忘-效用权衡,在相当的遗忘质量下,相对于最强基线实现了9%的相对效用提升,并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

2606.10334 2026-06-10 cs.AI 新提交

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

通过视觉反馈的自蒸馏策略优化:连接代码与视觉工件

Haoyu Dong

AI总结 提出Visual-SDPO框架,利用渲染视觉反馈作为特权上下文,通过自蒸馏和视觉引导的代码信用加权优化代码生成视觉工件的质量,在图表、UI和幻灯片生成任务上显著提升性能。

详情
AI中文摘要

代码生成大语言模型(LLMs)通过编写由不可微渲染器执行的程序,越来越多地生成图表、网页和幻灯片等视觉工件,在观察渲染结果之前就确定了代码。因此,原本可执行的代码常常产生具有视觉显著缺陷的工件,包括元素重叠、文本裁剪、对齐破坏、对比度低和溢出。我们研究针对代码生成视觉工件的视觉反馈自蒸馏。我们提出Visual-SDPO,一种自蒸馏策略优化框架,将渲染的视觉反馈视为权重共享教师的特权上下文,并将该反馈蒸馏到编码学生中。为了使监督具有空间针对性而非均匀性,我们引入视觉引导的代码信用加权,将每个检测到的缺陷追溯到影响该元素的代码语句,并放大这些语句上的蒸馏信号。序列级GRPO(组相对策略优化)项通过奖励可执行、视觉质量高的 rollout 来补充密集的 token 级目标,而失败的执行通过自蒸馏路径仍然可学习,通过将执行错误作为特权上下文传递给教师。我们使用统一的 Qwen3-VL-8B-Instruct 骨干网络,在图表、网页/UI和幻灯片生成任务上实例化 Visual-SDPO。在图表到代码、UI到代码和幻灯片生成基准(ChartMimic、Design2Code和AeSlides)上,Visual-SDPO 在主要指标上比零样本基线提升超过10个绝对点,比GRPO提升至少2.4个点,且训练步骤更少,无额外推理成本。

英文摘要

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

2606.10333 2026-06-10 cs.LG cs.CR 新提交

Privacy-Preserving Credit Risk Prediction with Alternative Data

基于替代数据的隐私保护信用风险预测

Hongzhe Zhang, Jiarong Xu, Jing He, Xiao Fang

AI总结 针对信用风险预测中替代数据共享导致的隐私泄露问题,提出PrivacyCredit方法,在保护消费者隐私、模型保密性和无损性能约束下,实现与传统明文数据组合相同的预测性能。

详情
AI中文摘要

信用风险预测是消费信贷行业中的一个关键问题。传统上,金融机构使用借款人的人口统计、财务和信用历史数据(统称为传统数据)构建信用风险预测模型。最近的研究表明,替代数据(如借款人的手机通信数据)使贷款人能够获得更全面、更准确的借款人信用状况画像,从而提高信用风险预测性能。然而,替代数据由独立于金融机构的外部实体持有。直接与金融机构共享替代数据会侵犯消费者隐私,但现有的信用风险预测研究大多忽略了这一问题。为填补这一空白,我们定义了一个新问题,即基于替代数据的隐私保护信用风险预测,该问题同时考虑三个实际约束:保护消费者隐私的隐私保护约束、在金融机构集中学习和存储模型的模型保密性约束,以及保持学习模型性能的无损约束。为解决该问题,我们开发了PrivacyCredit,一种新颖的隐私保护机器学习方法。然后,我们从理论上证明了PrivacyCredit的隐私保护、模型保密和无损特性。通过使用与替代数据关联的真实信用数据集进行大量实验,我们证明了安全地将替代数据纳入信用风险预测的预测价值,并表明PrivacyCredit实现了与从传统数据和替代数据的不安全明文组合中学习的模型相同的预测性能。我们进一步评估了其模型保密性和计算效率。

英文摘要

Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data, collectively referred to as traditional data. Recent studies have demonstrated that alternative data, such as borrowers' mobile phone communication data, enable lenders to acquire fuller and more accurate profiles of borrowers' creditworthiness, thereby improving credit risk prediction performance. Nevertheless, alternative data are held by external entities independent of financial institutions. Directly sharing alternative data with financial institutions infringe on consumer privacy, yet existing credit risk prediction studies largely overlook this issue. To address this gap, we define a new problem, namely privacy-preserving credit risk prediction with alternative data, which simultaneously considers three practical constraints: the privacy-preserving constraint that protects consumer privacy, the model-confidentiality constraint that learns and stores the model centrally at the financial institution, and the lossless constraint that maintains the performance of the learned model. To solve this problem, we develop PrivacyCredit, a novel privacy-preserving machine learning method. We then theoretically demonstrate the privacy-preserving, model-confidential, and lossless properties of PrivacyCredit. Through extensive experiments using a real-world credit dataset linked with alternative data, we demonstrate the predictive value of securely incorporating alternative data into credit risk prediction and show that PrivacyCredit achieves the same predictive performance as the model learned from the insecure plaintext combination of traditional and alternative data. We further evaluate its model-confidentiality property and computational efficiency.

2606.10329 2026-06-10 cs.CV cs.AI 新提交

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测:一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

AI总结 针对地震后短期成像间隔导致的变化检测难题,构建了土耳其地震变化检测数据集(TUE-CD),并提出多尺度特征交互网络(MSI-Net),通过联合交叉注意力和多尺度偏移校准模块,有效缓解侧视问题,提升变化检测精度。

详情
AI中文摘要

作为最具破坏性的自然灾害之一,近年来地震袭击了世界许多国家,造成了严重的经济损失。变化检测(CD)可应用于震后损伤评估,因为它能从多时相遥感图像中推断出被破坏的变化区域。此外,短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而,由于缺乏短成像间隔的数据集,当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求,我们创建了一个变化检测数据集——土耳其地震变化检测数据集(TUE-CD),用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短,不同时相图像的成像角度不同,导致了一些侧视问题。为了应对这些挑战,我们提出了一种多尺度特征交互网络(MSI-Net),用于双时相特征之间的高效交互,并减轻侧视问题的影响。具体来说,所提出的MSI-Net由联合交叉注意力(JCA)模块、多尺度偏移校准(MOC)模块和特征集成(FeI)模块组成。JCA模块统一了通道交叉注意力和空间联合注意力,以实现充分的特征交互。MOC模块进一步估计偏移量,以将双时相图像与多尺度特征对齐。最后,通过FeI模块融合校准后的特征和多尺度特征,用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明,所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

2606.10328 2026-06-10 cs.CV cs.AI 新提交

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

AI总结 提出内容引导的空间-光谱集成网络(CSI-Net),通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息,有效抑制未变化区域差异,在三个数据集上取得最优性能。

详情
AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而,现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题,本文提出了一种内容引导的空间-光谱集成网络(CSI-Net),用于融合全局空间细节和光谱差异信息。具体而言,所提出的CSI-Net由空间推理(SR)模块、光谱差异(SD)模块和内容引导集成(CGI)模块组成。在SR模块中,通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征,通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外,为了有效集成空间-光谱特征,我们设计了CGI模块以进一步利用它们的互补信息。在该模块中,引入高层内容信息作为引导,以实现适当的交互。由于高效的空间-光谱融合,所提出的CSI-Net能够更好地学习变化特征,同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明,与最先进方法相比,所提出的CSI-Net产生了更好的性能,并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

2606.10327 2026-06-10 cs.CL cs.LG 新提交

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

顺序重要:LLaMA的序列微调用于连贯的自动作文评分

Ali Keramati, Mark Warschauer

AI总结 提出对LLaMA-3.1-8B进行任务感知的序列微调,按作文话语结构顺序训练,在PERSUADE 2.0语料上证据F1达65%、结论F1达87%,超越独立训练和70B基线,证明课程设计可提升自动作文评分性能。

详情
AI中文摘要

自动作文评分(AES)系统必须判断相互依赖的话语元素(如引言、立场、证据、结论),但大多数方法孤立地处理这些元素,损害了连贯性和泛化能力。我们研究了对LLaMA-3.1-8B进行任务感知的微调,用于AES,使用参数高效的LoRA和4位量化,并比较了三种训练课程:(i)序列式(依次在引言、立场、主张、证据、结论上微调),(ii)独立式(任务特定模型),以及(iii)随机式(打乱的多任务)。在PERSUADE 2.0语料上的实验表明,建模任务依赖性很重要:序列微调取得了最强的整体结果,包括证据的F1分数65%和结论的87%,以及相应的准确率63%和85%,超越了独立训练,并且在结论上优于通用LLaMA-70B基线,尽管后者容量大得多。随机训练提高了立场评分(57% F1),但在其他地方一致性较差。这些发现表明:(1)与话语结构对齐的课程设计可以实质性地改善AES,以及(2)小型、任务优化的模型可以与显著更大的大型语言模型(LLM)竞争,为可扩展、成本效益高的评估提供了实用途径。我们发布模板和实现细节,以促进复现和未来在教育NLP中课程设计的工作。

英文摘要

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

2606.10316 2026-06-10 cs.CL 新提交

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

TabClaw: 一个用于电子表格操作和表格推理的交互式自进化智能体

Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Qingyang Mao, Yitong Zhou, Qi Liu

AI总结 提出TabClaw,一个开源交互式AI智能体,通过可编辑执行计划、流式ReAct循环、并行多表推理和用户记忆提取,提升电子表格操作和表格推理的透明性与个性化。

详情
Comments
5 pages, 2 figures
AI中文摘要

电子表格和表格是结构化数据分析中广泛使用的表示形式,但有效分析仍需大量人工和领域专业知识。近期的大语言模型智能体可以自动化部分过程,但它们通常对中间决策提供有限的透明度,依赖隐含假设,难以处理多表比较,并且重复类似工作流而不适应用户偏好。本文提出TabClaw,一个用于电子表格操作和表格推理的开源交互式AI智能体。用户上传CSV或Excel文件并发出自然语言请求;TabClaw澄清模糊意图,展示可编辑执行计划,流式传输ReAct风格的工具使用分析循环,派遣专家智能体进行并行多表推理,并通过显式一致性和不确定性标记综合发现。除一次性分析外,TabClaw记录完成的工作流,提取持久用户记忆,从重复工具使用模式中提炼可复用技能,支持包式技能导入,并从负面反馈中升级技能。在电子表格操作和表格推理基准上的实验表明,TabClaw在提高可执行任务完成度和推理性能的同时,保持了可检查的用户工作流。本文展示了TabClaw如何将电子表格和表格转化为可检查的分析工作流,同时逐步个性化以适应重复的数据分析任务。我们的代码已公开。

英文摘要

Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

2606.10315 2026-06-10 cs.CL cs.AI 新提交

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一:LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

AI总结 研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率,发现其仅捕获22%的系统性问题,主要因评分标准缺乏状态跟踪等行为维度,且路由机制导致缺陷被错误分类。

详情
Comments
13 pages, 1 figure, 5 tables
AI中文摘要

LLM作为评判员是评估对话代理的默认工具,但其可靠性几乎总是报告为与人类评分的一致性,而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理,并通过详尽的人工转录审查作为基准,衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中,评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中,9种模式中只有2种(22%),而在另一个批次中,其操作门控标记了100轮中的0轮,而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明,失败是有结构的,而非随机的:评判员能捕获轮次局部问题(虚构统计数据、错误语言),但遗漏了跨轮状态问题(确认门锁死、购物车幻觉、升级锁死、过时引用)。机制在于:评分标准仅暴露三个粗略轴(意图、品牌声音、个性化),且没有针对行为维度(状态跟踪、护栏、恢复)的类别,而大多数缺陷集中于此。失败在于路由而非感知:114轮中,113轮原始评判员注释描述了确认门或购物车状态缺陷,但被评分为“品牌声音”,且无一到达操作失败——门控连接到挂起和硬断言,而非评分标准——因此0%是路由和接线失败,而非失明。对流行率估计的影响是显著的:当表观缺陷率为零时,Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时,相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理,自动评判是回归底线,而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

2606.10314 2026-06-10 cs.AI 新提交

Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

基于大语言模型驱动行为与运动约束的移动异常生成

Yueyang Liu, Joon-Seok Kim, Andreas Züfle

AI总结 提出端到端生成框架,结合大语言模型注入语义异常与地图约束路由重建,合成带标注的真实轨迹异常数据集。

详情
AI中文摘要

尽管人类轨迹异常研究对于推进空间数据挖掘至关重要,但实证研究因缺乏真实标注数据集而严重受阻。现有真实和模拟轨迹数据集仅包含正常移动模式,缺乏异常标注。这种稀缺性源于异常事件的统计稀有性,使得传统观测方法不可行。此外,大规模移动数据的系统获取受高昂成本和严格隐私法规限制。为克服这些限制并建立可靠的带标注真实轨迹异常数据集,我们提出一种新颖的端到端生成框架,用于大规模合成逼真的轨迹异常。该架构直接在基线模拟轨迹上操作,弥合纯合成移动数据与复杂真实物理约束之间的差距。我们利用大语言模型(LLM)代理系统性地注入语义上有意义的异常行为,例如不规则分布外签到和跳过常规访问。为确保空间有效性,系统利用地图约束路由重建重新计算LLM代理修改停留点之间的物理转移。此外,为缩小模拟与现实的差距,我们通过上下文感知的空间噪声模型增强生成轨迹,该模型由环境和位置特定变量参数化,以准确模拟异构GPS传感器退化。

英文摘要

Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.

2606.10309 2026-06-10 cs.CV 新提交

Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

剖析与剪枝:增强AI生成图像检测的鲁棒性

Dahye Kim, Jaehyun Choi, Hyun Seok Seong, Seongho Kim, Donghun Lee, Sungwon Yi, Jang-Ho Choi

AI总结 针对AI生成图像检测器对真实类别的预测偏差问题,提出DEAR方法,利用修复图像识别并剪除干扰特征,从而提升对未知生成器和后处理的鲁棒性。

详情
Comments
25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix
AI中文摘要

虽然现有的AI生成图像检测器报告了高性能,但我们发现这主要是由一种关键的预测不对称性驱动的:对真实类别的偏见严重限制了其对生成内容的敏感性,尤其是在压缩和调整大小等标准后处理操作下。我们假设这源于模型对虚假特征的依赖,这些干扰信号掩盖了真正的生成伪影。为了解决这个问题,我们提出了DEAR(剖析与剪枝),它利用修复图像来识别和剪除这些干扰成分。具体来说,我们发现与修复区域或非修复区域强烈对齐的特征对后处理的鲁棒性较差。通过测量通道激活与修复掩码之间的对齐程度,DEAR移除两端的特征,仅保留那些捕捉真实生成伪影的特征。实验结果表明,我们的方法显著增强了对未见过的生成器和后处理的鲁棒性,有效缓解了预测不对称性。我们的代码可在该 https URL 获取。

英文摘要

While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at https://github.com/dahyedahye/dear.

2606.10307 2026-06-10 cs.CL 新提交

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

早期令牌置信度预测多智能体LLM辩论中的推理质量

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

AI总结 研究利用解码时令牌级对数概率作为置信度信号,预测多智能体LLM辩论中的推理质量,发现早期令牌置信度是最强预测因子。

详情
Comments
15 pages, 8 figures, 4 tables; ACL Proceedings
AI中文摘要

评估多智能体LLM系统中的推理质量具有挑战性,尤其是对于没有参考答案的开放任务。我们研究了内在置信度信号(解码时的令牌级对数概率)是否能预测由LLM作为评判者评估的推理质量。使用基于辩论的论文评分框架,我们在两个ASAP论文集上比较了置信度代理与基于评分标准的评判者分数。我们发现,早期令牌置信度,特别是在生成的前几个令牌内,始终是推理质量的最强预测因子,优于全序列统计量。对数概率轨迹分析表明,生成的起始阶段是最异质的,因此信息量最大。我们还观察到智能体角色之间存在系统性不对称,支持性推理的置信度与质量之间的对齐强于对抗性批评。这些结果表明,早期解码动态为估计多智能体LLM系统中的推理可靠性提供了轻量级且有效的信号。

英文摘要

Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.