arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10223 2026-06-10 cs.SD cs.AI cs.CV 新提交

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

双分支门控融合用于开放集音频深度伪造源追踪

Awais Khan, Kutub Uddin, Khalid Malik

AI总结 针对开放集音频深度伪造源追踪问题,提出双分支门控融合框架,结合XLSR-53和CORES描述符,通过输入条件门控自适应加权,实现域内高精度和域外鲁棒泛化。

详情
AI中文摘要

将合成语音归因于其原始系统仍然是一个开放挑战:闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题,我们提出了一个双分支门控融合框架,将XLSR-53与CORES配对,CORES是一个66维描述符,与之前仅使用线性滤波器组(LFB)的工作不同,它跨越倒谱、振荡、节奏、能量和频谱维度,以捕获互补的合成伪影。我们的分析表明,XLSR-53在域内(ID)保持判别性,而CORES在分布偏移(OOD)下稳定泛化,但由于SSL表示不平衡,它们的简单拼接失败。为了解决这个问题,一个输入条件门控在联合训练下自适应地加权每个分支,使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上,我们的系统实现了97.6%的ID准确率、4.9%的EERc,并且相对于Interspeech 2025基线,FPR95相对降低了83.5%。

英文摘要

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

2606.10219 2026-06-10 cs.LG cs.AI 新提交

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

高频金融时间序列的快速精确最近邻学习

Henry Han, Diane Li

AI总结 针对金融高频数据增长带来的实时性挑战,提出基于Mojo的SIMD k-d树方法,通过方差分裂、连续存储和编译时向量化距离计算,在保持精确输出的同时实现17.5-43.5倍加速,并支持期权定价模型训练数据量提升10倍。

详情
Comments
15 pages 5 figures;
AI中文摘要

随着股票、ETF、外汇、期权和高频交易数据量的激增,AI在金融领域的大规模效率变得至关重要。这种增长给成熟的金融AI系统带来了核心挑战:模型必须从更大的历史语料库中学习,同时满足交易、风险管理和衍生品定价中的实时延迟约束。我们以高频金融时间序列的精确最近邻学习为具体案例,展示基于Mojo的金融AI可以应对这一挑战。我们引入了一种Mojo SIMD k-d树,采用基于方差的划分、连续的扁平缓冲区存储和编译时向量化距离计算。我们还提供了运行时结果,表明在标准剪枝和实现成本假设下,对于固定股票、大样本量、中等维度的情况,Mojo SIMD k-d树渐近地优于Mojo SIMD暴力搜索和scikit-learn的k-d树。在x86和ARM64架构的八个金融数据集上(训练样本最多277K),该方法在x86上比scikit-learn的k-d树加速17.5-21.6倍,在ARM64股票/ETF数据集上比scikit-learn暴力搜索加速28.1-43.5倍,同时保持精确输出。除了最近邻推理,Mojo的编译执行使得基于Extra Trees的隐含波动率定价模型能够训练10倍以上的期权数据,将看跌期权IV RMSE降低8.0%。这些结果将Mojo定位为金融AI的可扩展、生产就绪栈,并为其他数据密集型领域的高效AI提供了有前景的基础。

英文摘要

AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems: models must learn from larger historical corpora while still meeting real-time latency constraints in trading, risk management, and derivative pricing. We use exact nearest-neighbor learning for high-frequency financial time series as a concrete case study to show that Mojo-based financial AI can address this challenge. We introduce a Mojo SIMD k-d tree with variance-based splitting, contiguous flat-buffer storage, and compile-time vectorized distance computation. We also provide a runtime result showing that, under standard pruning and implementation-cost assumptions, the Mojo SIMD k-d tree asymptotically dominates Mojo SIMD brute force and scikit-learn's k-d tree in the fixed-stock, large-$n$, moderate-dimensional regime. Empirically, across eight financial datasets on x86 and ARM64 with up to 277K training samples, the method achieves 17.5--21.6$\times$ speedup over scikit-learn's k-d tree on x86 and 28.1--43.5$\times$ over scikit-learn brute force on ARM64 equity/ETF datasets, while preserving exact outputs. Beyond nearest-neighbor inference, Mojo's compiled execution enables an Extra Trees-based implied-volatility pricing model to train on $10\times$ more options data, reducing put-IV RMSE by 8.0\%. These results position Mojo as a scalable, production-ready stack for financial AI and a promising foundation for efficient AI in other data-intensive fields. \keywords{Financial AI \and AI Efficiency \and Mojo \and SIMD \and K-D Trees \and KNN \and High-Frequency Trading \and Financial Time Series \and Scaling}

2606.10217 2026-06-10 cs.LG cs.CR 新提交

Alignment Defends LLMs from Property Inference Attacks

对齐防御LLM免受属性推断攻击

Pengrun Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

AI总结 提出基于对齐的防御方法,通过后训练调整模型输出分布,在不修改训练数据的情况下缓解属性推断攻击,并保持效用与机密性的平衡。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地在包含敏感数据集级属性的领域特定数据集上进行微调。最近的研究表明,此类数据集级信息可以通过属性推断攻击有效提取,构成机密性风险。现有的防御措施主要通过修改训练数据分布来运作,因此需要访问原始数据并重新训练模型,限制了其在数据不可用或模型已部署场景中的适用性。在这项工作中,我们提出了基于对齐的防御方法,用于缓解LLMs中的属性推断攻击。我们的方法通过后训练对齐将模型的输出分布重塑为目标属性比率,而无需修改训练数据。具体而言,我们通过构建偏好对和定义特定奖励函数,分别适配两种广泛使用的RLHF框架——直接偏好优化(DPO)和组相对策略优化(GRPO)——作为我们的防御方法。通过全面的实验,我们展示了基于对齐的防御方法有效缓解了属性推断攻击,同时保持了良好的效用-机密性权衡。

英文摘要

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.

2606.10216 2026-06-10 cs.LG cs.AI 新提交

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

一个源域足矣:基于语义对齐和最优传输的仅源域跨操作系统APT异常检测迁移学习

Sidahmed Benabderrahmanea, Petko Valtchev, James Cheney, Talal Rahwan

AI总结 针对跨操作系统APT检测中目标域无标签的挑战,提出基于最优传输的仅源域异常评分框架,通过语义抽象和三种偏差通道实现零目标监督下的异常排序。

详情
AI中文摘要

高级持续性威胁(APT)是隐蔽的多阶段网络攻击,由于标记痕迹稀缺、严重的类别不平衡以及生成真实恶意行为的挑战,其检测十分困难。这些挑战在跨操作系统(cross-OS)设置中被放大,此时在一个源平台上训练的检测器必须部署在无标签的目标平台上,且无法访问目标域标签。我们利用系统级溯源轨迹研究这种仅源域的跨操作系统APT检测问题,并提出一个基于传输的框架,在零目标监督下对异常目标进程进行排序。该框架将进程行为抽象为结构化的自然语言描述,使用预训练语言模型进行嵌入,并构建源域正常参考用于目标评分。它结合了三种证据通道:与源域正常原型的语义偏差、由图自编码器捕获的结构偏差、以及通过最优传输(OT)度量的几何偏差。主要贡献是一个基于OT的重心异常分数,该分数将目标嵌入投影到源域正常流形上,并量化残差传输不匹配。我们进一步引入熵加权、角度感知和密度感知的OT变体,以捕获不确定性、方向漂移和稀疏支持行为。在DARPA透明计算数据(涵盖Linux、Windows、BSD和Android)上的评估,涉及两个APT场景和十二个跨操作系统传输对,表明所提框架在ROC-AUC和nDCG上优于仅源域异常检测基线。结果表明,仅源域溯源建模结合语义抽象和基于OT的异常评分,可以在没有目标域监督的情况下支持实际的跨平台APT检测。

英文摘要

Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These challenges are amplified in cross-operating-system (cross-OS) settings, where a detector trained on one source platform must be deployed on an unlabeled target platform without access to target-domain labels. We study this source-only cross-OS APT detection problem using system-level provenance traces and propose a transport-based framework for ranking anomalous target processes under zero target supervision. The framework abstracts process behavior into structured natural-language descriptions, embeds them using pretrained language models, and constructs a source-normal reference for target scoring. It combines three evidence channels: semantic deviation from source-normal prototypes, structural deviation captured by graph autoencoding, and geometric deviation measured through Optimal Transport (OT). The main contribution is an OT-based barycentric anomaly score that projects target embeddings onto the source-normal manifold and quantifies residual transport mismatch. We further introduce entropy-weighted, angle-aware, and density-aware OT variants to capture uncertainty, directional drift, and sparse-support behavior. Evaluation on DARPA Transparent Computing data spanning Linux, Windows, BSD, and Android, across two APT scenarios and twelve cross-OS transfer pairs, shows that the proposed framework improves ROC-AUC and nDCG over source-only anomaly-detection baselines. The results demonstrate that source-only provenance modeling, combined with semantic abstraction and OT-based anomaly scoring, can support practical cross-platform APT detection without target-domain supervision.

2606.10213 2026-06-10 cs.SD cs.AI 新提交

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

基于说话人日志和自监督学习的韩语幼儿语音自动发音评估

Diane Myung-kyung Woodbridge, Jee Hyun Suh

AI总结 提出结合神经说话人日志与自监督学习的端到端韩语幼儿发音评估流水线,引入53名2-5岁儿童录音语料库,通过多模型集成实现辅音和元音分类平衡准确率0.782。

详情
Comments
This paper will be presented at IEEE ICTs4ehealth in June, 2026
AI中文摘要

言语障碍约占韩国儿科沟通障碍病例的44%,然而针对韩语幼儿语音的自动评估工具仍不成熟。本文提出一种端到端的韩语幼儿语音自动发音评估流水线,结合神经说话人日志与自监督语音表示学习。我们引入了一个经IRB批准的新语料库,包含53名2-5岁韩语儿童的录音。其中53名受试者的子集由三位独立评审员标注,得到1,190个辅音和748个元音的词汇级二元正确性标签。我们评估了三种说话人日志模型,发现NeMo SortFormer凭借其到达时间排序的Transformer架构,实现了88.69%的说话人计数准确率和33.04%的日志错误率(DER),该架构处理了表现出aegyo的年轻女性看护者与幼儿语音之间的声学混淆。对于发音评分,我们比较了三种自监督学习(SSL)骨干网络在多种池化策略下的表现。跨模型集成将辅音预测路由到HuBERT-large,元音预测路由到WavLM-large,实现了0.720和0.845的平衡准确率,平均值为0.782。

英文摘要

Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.

2606.10209 2026-06-10 cs.AI cs.LG cs.SE 新提交

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

更少上下文,更优智能体:面向长周期工具使用LLM智能体的高效上下文工程

Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal

AI总结 针对企业工具使用场景中上下文过长导致的问题,提出选择性保留最近工具交互并添加紧凑摘要的方法,在费用明细任务上将完成率从71.0%提升至91.6%,同时大幅降低token消耗和运行时间。

详情
Comments
17 pages, 3 figures, 8 tables
AI中文摘要

部署为自主智能体用于企业工作流的大型语言模型面临一个关键挑战:来自企业系统的冗长工具响应可能导致上下文溢出、状态过时错误和高推理成本。我们在Microsoft Dynamics 365 Finance and Operations中使用Model Context Protocol工具研究自动费用明细化问题。我们在一个包含50个任务的酒店费用基准上评估了四种GPT-5配置:无用户模型、完整对话历史、上下文裁剪至最近5个工具调用/响应对、以及裁剪加自动摘要。结果在5次独立运行中取平均,用户模型在上下文工程比较中保持不变。无用户模型基线仅达到8.0%的完全明细化。完整上下文保留将完成率提升至71.0%,但每次基准测试消耗1,480,996个token和14.56小时。裁剪至最近5个工具调用将完成率提升至79.0%,同时将token使用降至535,274个,运行时间降至5.39小时。添加摘要实现了最佳结果:91.6%的完全明细化和99.64%的平均明细金额,使用553,374个token和5.79小时。我们进一步报告了置信区间、效应量分析、裁剪和摘要窗口的敏感性、失败分析、按三个类别分组的五种费用类型的结果,以及使用Claude Sonnet 4.5的跨模型证据。这些结果表明,对于这类企业工具使用工作流,选择性保留最近的工具交互加上紧凑摘要,与保留完整历史相比,可以提高可靠性和效率。

英文摘要

Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

2606.10208 2026-06-10 cs.RO cs.AI 新提交

Exploration of Foundation Model-Based Robots in Patient and Elderly Care

基于基础模型的机器人在患者和老年人护理中的探索

Zhiwen Qiu, Wei Liu, Yuexing Hao

AI总结 本文综述了基于基础模型的护理机器人在设计特征、用户体验和护理效果方面的现状,指出当前系统多用于语音交互,多模态和物理自主性有限,并呼吁向护理特定评估标准和负责任自主性发展。

详情
AI中文摘要

随着全球人口老龄化,对老年人和患者护理的需求迅速增长。基础模型越来越多地被集成到机器人和交互代理中,有望实现更灵活的沟通和个性化辅助。然而,护理环境需要可靠且与工作流程兼容的系统,并具备可问责的人类监督,目前尚不清楚当前具身系统能否将技术进步转化为临床影响。本综述从三个方面综合了基于基础模型的护理机器人:设计特征、用户体验以及护理相关结果的证据。当前系统最常将基础模型用作以语音为中心的社会辅助具身中的对话和推理层,而多模态基础和物理自主性仍然有限。实证评估报告了积极的可用性和参与度益处,但交互流程中仍存在可靠性故障,如幻觉和对话中断。护理影响的证据主要集中在认知参与和参与等近期结果上,而经过验证的临床或护理相关变化的证据有限。我们认为,未来的研究应转向护理特定的评估标准、可问责的自主性以及融入护理工作流程,以支持更具响应性和负责任的护理技术。

英文摘要

Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

2606.10199 2026-06-10 cs.LG cs.CL 新提交

A Continuous-Time Markov Chain Framework for Insertion Language Models

插入语言模型的连续时间马尔可夫链框架

Dhruvesh Patel, Benjamin Rozonoyer, Soumitra Das, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

AI总结 提出基于连续时间马尔可夫链的插入语言模型去噪框架,统一现有方法,在规划任务中优于自回归和掩码扩散模型,语言建模中与现有方法竞争且采样更灵活。

详情
Comments
Accepted at AISTATS 2026. Code is available at https://github.com/dhruvdcoder/ctmc_dilm
AI中文摘要

插入语言模型(ILMs)相比从左到右生成和基于掩码的生成具有若干优势。然而,现有的插入式生成公式大多是临时性的。在本文中,我们通过将噪声过程建模为变长序列空间上的连续时间马尔可夫链,从第一性原理推导出ILMs的扩散式去噪目标。我们表明,先前的ILMs公式可以视为该去噪框架的特例。通过在合成规划任务上的实证评估,我们展示了所提出的方法保留了插入式生成相对于从左到右生成和掩码扩散模型的优势。在语言建模中,我们的基于扩散的方法与从左到右生成和掩码扩散模型具有竞争力,同时与现有的插入语言模型相比,在采样方面提供了额外的灵活性。

英文摘要

Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

2606.10196 2026-06-10 cs.CV cs.AI 新提交

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

Fisher引导的自适应微调渐进参数选择

Ghodsiyeh Rostami, Po-Han Chen, Mahdi S. Hosseini

AI总结 提出FisherAdapTune框架,通过追踪Fisher几何的时间漂移渐进选择参数组,在保持适应动态的同时冻结稳定参数以降低泛化误差界,在分割任务上提升分布内性能和零样本迁移。

详情
AI中文摘要

参数高效微调(PEFT)旨在使用少量可训练参数子集来适应预训练模型,然而,现有大多数方法从固定的架构启发式中选择该子集,而不是使用动态的、任务感知的标准。我们引入了\textbf{FisherAdapTune},一个Fisher引导的自适应微调框架,通过追踪参数组Fisher几何的时间漂移来渐进选择参数组。从微调的PAC-Bayesian视角出发,我们将泛化误差界分解为Fisher加权更新成本,并表明曲率贡献已稳定的参数组可以被冻结,以减少误差界而不中断剩余的适应动态。FisherAdapTune使用连续Fisher分布之间的尺度不变Jensen-Shannon距离来制定这一标准,从而产生一个自适应的活动参数集。我们在下游分割任务上评估了我们的方法,结果表明FisherAdapTune在多种设置下提升了分布内性能和零样本迁移,验证了Fisher结构漂移是高效、任务感知适应的有用信号。我们公开发布了代码(\href{this https URL}{code}),以促进我们提出方法的进一步应用。

英文摘要

Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code} publicly to enable further application of our proposed approach.

2606.10194 2026-06-10 cs.LG cs.AI 新提交

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

MMClima:多模态气候科学数据与评估框架

Muhammad Umer Sheikh, Hassan Abid, Khawar Shehzad, Ufaq Khan, Muhammad Haris Khan

AI总结 提出MMClima,一个包含10万+专家验证问答对的多模态气候问答框架,覆盖文本、视频和图表,用于评估多模态语言模型在气候科学中的表现。

详情
AI中文摘要

气候变化研究日益需要能够推理文本、动态视觉内容和科学图表的AI系统,然而现有的气候问答基准规模小、大多为文本,且覆盖模型范围狭窄。我们提出MMClima,一个大规模多模态气候问答框架,包含10万+专家验证的问答对,涵盖五个核心气候科学领域的文章、视频转录和图表。MMClima通过自动化的声明提取和问答合成构建,并采用人在回路验证以确保规模和可靠性。利用MMClima,我们在需要事实回忆、视觉解释和跨模态合成的任务上对最先进的多模态语言模型进行基准测试。此外,我们在文本分割上进行微调,得到mmclima-70b-txt,一个领域适应的基线模型,在文本问答上优于强大的开源和闭源模型。我们发布数据集、评估流程、微调模型权重和数据创建框架,以支持气候科学的标准多模态评估。

英文摘要

Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

2606.10184 2026-06-10 cs.LG cs.AI 新提交

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO: 用于连续潜在推理的变分随机性

Wooil Jung

AI总结 针对GRPO在连续潜在推理模型中因确定性轨迹导致优势为零的问题,提出通过结构化Dropout引入随机性,使GRPO能优化贝叶斯模型平均策略,在GSM8K上提升Coconut基线准确率。

详情
AI中文摘要

组相对策略优化(GRPO)依赖于每组内$K$次rollout的多样性;否则,组平均优势$A^{(k)} = r^{(k)} - \mu_r$会坍缩为零。这对像Coconut这样的潜在推理模型构成了结构性挑战,该模型循环地馈送连续隐藏状态以替代离散的思维链token。由于给定参数和提示后潜在阶段本质上是确定性的,多次rollout会产生相同的轨迹,阻碍GRPO的进展。因此,将组相对强化学习应用于连续潜在推理被证明是困难的。为解决此问题,我们提出通过结构化dropout来获取必要的随机性。通过在给定rollout的所有潜在递归步骤中应用一个保持不变的单一伯努利掩码,我们生成必要的轨迹方差。这个共享掩码有效地将每个rollout视为来自参数变分分布的后验样本,使GRPO能够优化贝叶斯模型平均策略的期望奖励。我们为该方法提供了理论证明——包括无偏性、方差减少以及潜在梯度的良定义性——以及实证验证。在GSM8K上,dropout-GRPO将Coconut基线从$27.29\%$提升至$29.01\%$的pass@1,证明了GRPO学习在潜在推理模型中的可行性。我们的工作将此定位为一种实用且理论基础的潜在推理LLM后训练方法。

英文摘要

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - μ_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 新提交

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结 提出一种时间控制方法,通过轻量级时间模块扩展预训练DiT,实现运动速度和时序结构的编辑,无需重新设计骨干网络。

详情
AI中文摘要

现代用于视频生成的扩散Transformer对时间进程的控制和时序动态的编辑能力有限。我们提出一种时间控制方法,通过显式时间编辑扩展预训练DiT,允许控制运动速度和时序结构,而无需重新设计骨干网络。其核心实现通过一个轻量级时间模块增强预训练模型,保留原始生成先验的同时扩展其可控动态范围。

英文摘要

Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

2606.10180 2026-06-10 cs.RO cs.AI cs.HC 新提交

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

流控制:通过简单实时输入引导视觉-语言-动作模型

Jonathan C. Kao, Jason Chan, Andy Wang

AI总结 提出流控制方法,利用键盘等通用实时输入引导VLA模型动作,无需重新训练,能提升任务成功率和完成速度。

详情
Comments
10 pages, 5 figures
AI中文摘要

我们引入了视觉-语言-动作(VLA)模型的流控制,这是一种简单有效的方法,通过通用输入(如键盘)实时引导VLA动作。该方法可直接使用,无需重新训练或微调VLA。它允许相对粗糙的用户输入引导VLA与用户意图对齐。VLA将这些输入转换为从训练期间学习的VLA专家动作分布中采样的动作样本,从而生成高质量(符合动作专家分布)和高保真度(反映用户意图)的动作。我们证明流控制具有许多理想特性:(1)流控制能准确、响应地通过用户输入引导机器人动作;(2)它对次优用户输入具有鲁棒性;(3)它使用户能够引导VLA实现显著更高的成功率和更快的任务完成;(4)在流控制轨迹上微调VLA可提高自主策略性能。这些结果共同为用户提供了一种简单直观的方式来帮助引导VLA动作,提升任务性能。

英文摘要

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

2606.10174 2026-06-10 cs.CV 新提交

A Large Scale Open-Source Image and Video Dataset for Robust Wildfire Detection and Classification

用于鲁棒野火检测与分类的大规模开源图像和视频数据集

Emadeldeen Hamdan, Yingyi Luo, B. Ugur Toreyin, Erdem Koyuncu, Adam J. Watts, Ugur Gudukbay, Ahmet Enis Cetin

AI总结 提出大规模开源野火图像视频数据集GWFP,结合多种卷积与Transformer架构及HTE-ResNet方法,实现跨域鲁棒检测。

详情
AI中文摘要

野火检测与监测对于减缓火势蔓延和减少环境及基础设施损害至关重要。本文介绍了GWFP(全球野火预防数据集),这是一个大规模、开源的野火图像和视频数据集,旨在支持早期火灾和烟雾检测研究。GWFP包含地理多样化的野火场景,包括火焰、烟雾、水雾/雾环境条件、近红外(NIR)图像、余烬以及从全球真实场景中收集的具有挑战性的负样本。为了评估数据集的鲁棒性和跨域泛化能力,我们在域内和跨数据集设置下对多种卷积和基于Transformer的架构进行了基准测试。此外,我们探索了使用Hadamard增强残差连接(HTE-ResNet)的轻量级频率-空间特征交互,以分析域偏移条件下的表示鲁棒性。实验结果表明,该方法在真实世界野火监测应用中具有强大的跨数据集泛化能力和实用价值。数据集和源代码将在接收后公开发布。

英文摘要

Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency--spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

2606.10173 2026-06-10 cs.CR cs.AI 新提交

Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI

本地并非充分的隐私边界:治理操作系统集成的设备端AI

Jonghyun Chung, Sanket Badhe

AI总结 本文提出一个以操作系统为中心的隐私框架,将隐私视为制度问责问题而非部署属性,通过威胁模型、六部分隐私风险分类、隐私架构控制和四级审计标准来治理设备端AI。

详情
AI中文摘要

随着AI系统进入操作系统,隐私不再仅仅取决于模型是否在本地运行。本地助手可能整合电子邮件、日历条目、文件、截图、通知和应用意图;保留嵌入或摘要;调用工具;发送遥测数据;或将困难请求路由到云基础设施。本地推理减少了一些暴露,但它只回答了一个问题:计算发生在哪里。它没有回答谁可以整合上下文、哪些派生状态持续存在、哪些操作被授权,或者更新如何改变系统的权限。我们为设备端AI开发了一个以操作系统为中心的隐私框架,将隐私视为一个制度问责问题,而不是一个部署属性。该框架指定了一个威胁模型、一个六部分隐私风险分类、隐私架构控制和一个四级审计标准。我们通过一个文档约束的比较来展示该标准,比较对象包括Apple Intelligence/Foundation Models、Android AICore/Gemini Nano和Microsoft Recall。设备端AI中有意义的隐私取决于受限的信息流、有限的权限、可见的用户控制以及在操作系统生命周期中可审计的治理。

英文摘要

As AI systems move into operating systems, privacy no longer turns only on whether a model runs locally. A local assistant may assemble email, calendar entries, files, screenshots, notifications, and app intents; retain embeddings or summaries; invoke tools; emit telemetry; or route difficult requests to cloud infrastructure. Local inference reduces some exposure, but it answers only one question: where computation occurs. It does not answer who may assemble context, what derived state persists, which actions are authorized, or how updates change the system's authority. We develop an OS-centered privacy framework for on-device AI that treats privacy as an institutional accountability problem rather than a deployment attribute. The framework specifies a threat model, a six-part privacy risk taxonomy, privacy-by-architecture controls, and a four-level audit rubric. We demonstrate the rubric through a documentation-bounded comparison of Apple Intelligence/Foundation Models, Android AICore/Gemini Nano, and Microsoft Recall. Meaningful privacy in on-device AI depends on constrained information flow, bounded authority, visible user control, and auditable governance across the operating-system lifecycle.

2606.10170 2026-06-10 cs.LG 新提交

Learning Entropy and Spatial Adaptation Dynamics of Multilayer Perceptrons for Structural Point Extraction

多层感知机的学习熵与空间自适应动力学用于结构点提取

Jan Glaser, Ivo Bukovsky, Marcel Jirina

AI总结 提出空间学习熵(SLEM)方法,通过分析MLP在图像样本学习中的权重自适应,识别对网络学习重要的图像点与区域,为特征提取提供新视角。

详情
AI中文摘要

本文将学习熵(LE)的概念从时间自适应系统扩展到应用于图像数据的多层感知机网络(MLP)中的空间学习。与局部邻域方法直接从梯度或协方差算子评估图像结构不同,所提方法通过学习熵分析学习过程本身。训练MLP从周围空间上下文预测中心像素的强度,同时从跨图像样本的学习过程中神经权重的增量自适应评估LE。生成的空间学习熵图(SLEM)识别出引起神经网络强烈自适应的异常图像点和区域,这些点在网络学习过程中具有重要作用。结果表明,空间学习熵通过突出对网络学习特别有信息量的空间位置,为传统特征提取和可解释性方法提供了补充视角。空间学习熵根据学习影响而非局部结构属性识别图像点和区域,为传统特征提取和可解释性方法提供了补充视角。所提框架可能为计算机视觉、制造和机器人学中的学习驱动图像或场景分析开辟新方向。

英文摘要

This paper extends the concept of Learning Entropy (LE) from temporal adaptive systems to spatial learning in multilayer perceptron networks (MLPs) applied to image data. Instead of evaluating image structure directly from gradients or covariance operators, as local neighborhood methods do, the proposed approach analyzes the learning process itself through Learning Entropy. An MLP is trained to predict the intensity of a center pixel from its surrounding spatial context, while LE is evaluated from the incremental adaptation of neural weights during learning across image-derived samples. The resulting Spatial Learning Entropy Maps (SLEM) identify unusual image points and regions that induce strong adaptation of the neural network and therefore have an important role in the learning process. The results indicate that spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by highlighting spatial locations that are particularly informative for network learning. Spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by identifying image points and regions according to their learning impact rather than their local structural properties. The proposed framework may open new directions for learning-driven image or scene analysis in computer vision, manufacturing, and robotics.

2606.10167 2026-06-10 cs.CV 新提交

FlexPath: Learned Semantic Path Priors for Image-Based Planning

FlexPath: 基于图像规划的学习语义路径先验

Taehyoung Kim, Tim Schoenbrod, David Eckel, Henri Meeß

AI总结 提出FlexPath两阶段框架,将可行性先验与偏好解耦,通过可微路径形状目标实现任务自适应,在最短路径规划中搜索代价降低14.3%,并支持零样本泛化与多目标适配。

详情
AI中文摘要

最近基于学习的路径规划器使用神经网络处理视觉地图表示,并近似经典搜索算法的启发式,从而以更少的搜索代价获得接近最优的路径。然而,这些方法受限于其监督中隐含的最短路径目标,这限制了它们适应其他标准的灵活性。我们提出FlexPath,一个两阶段框架,将可行性与偏好解耦。在第一阶段,我们使用模仿学习从视觉地图输入中获取一个与任务无关的可行路径空间先验。在第二阶段,可微路径形状目标(PSOs)使该先验适应特定任务的标准,而无需重新学习路径结构,仅需高效的 objective 级适应。单个预训练模型可适应多个目标。对于最短路径规划,FlexPath在TMP上相比最先进的TransPath减少了14.3%的搜索代价,同时平均找到更低成本的路径,并在三个未见领域上展现出强大的零样本泛化能力。对于最小间隙距离为2的障碍物避让,它在保持低搜索代价的同时实现了96.8%的完全避障。该框架进一步通过 objective 级适应扩展到语义感知避让和航点引导,并在推理时与经典规划器兼容。数据和代码可在 https://this URL 获取。

英文摘要

Recent learning-based path planners use neural networks to process visual map representations and approximate heuristics for classical search algorithms, yielding near-optimal paths with reduced search effort. However, these methods are tied to the shortest-path objective implicit in their supervision, which limits their flexibility to accommodate alternative criteria. We introduce FlexPath, a two-stage framework that decouples feasibility from preference. In Stage 1, we use imitation learning to acquire a task-independent spatial prior over feasible paths from visual map inputs. In Stage 2, differentiable Path Shape Objectives (PSOs) adapt this prior toward task-specific criteria without relearning path structure, requiring only efficient objective-level adaptation. A single pretrained model can be adapted to multiple objectives. For shortest-path planning, FlexPath reduces search effort on TMP by 14.3% compared to the state-of-the-art TransPath, while also finding lower-cost paths on average and demonstrating strong zero-shot generalization across three unseen domains. For obstacle clearance with minimum clearance distance 2, it achieves 96.8% full obstacle avoidance while maintaining low search cost. The framework further extends to semantic-aware avoidance and waypoint guidance via objective-level adaptation, and remains compatible with classical planners at inference time. Data and code are available at https://github.com/FraunhoferIVI/FlexPath.

2606.10166 2026-06-10 cs.CV 新提交

Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

融合卫星图像与平面地图的跨视角定位

Quang Long Ho Ngo, Zimin Xia, Alexandre Alahi

AI总结 提出一种融合卫星图像与平面地图的模块,通过跨模态条件化和补丁级融合规则,将定位误差降低30.13%。

详情
AI中文摘要

当前的跨视角定位方法主要依赖卫星图像作为空中模态。尽管近期工作探索了平面地图(如OpenStreetMap瓦片),但这些方法性能往往滞后。然而,两种模态都广泛可用且具有互补特性。卫星图像更接近地面相机图像,提供更精细的细节,而平面地图包含标注对象(如路灯),并在地面被遮挡(如树叶)的区域仍能提供信息。尽管如此,只有一项先前工作提供了融合这两种模态的端到端方法,且未展示其在最先进方法中的潜力。为结合两种模态的优势,我们提出一种新的融合模块,增强标准编码器,并证明将卫星图像与平面地图集成可改进最先进的单模态方法。该模块包括(i)跨模态条件化,处理每种模态编码时考虑另一种模态的信息,以及(ii)控制信息交换粒度的补丁级融合规则。我们取得了最先进的结果,将平均定位误差降低了30.13%。定性上,融合自适应地选择信息更丰富的模态,提高了整体准确性。

英文摘要

Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality's encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13\%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

游戏化AI辅助同行评审对科学界构成新风险

Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI总结 研究发现,通过表面改写摘要即可显著操纵AI评审结果,成功率约38%,且成本低、难以区分,可能扭曲科学评估的公正性。

详情
AI中文摘要

AI越来越多地被用于支持科学同行评审,从稿件筛选、评审辅助到编辑分类。尽管这类系统有望减轻评审负担并加速出版,但其对策略性操纵的鲁棒性仍知之甚少。本文表明,AI中介的同行评审容易受到一种简单、低成本的操纵:对稿件摘要进行表面改写。在不改变底层科学内容和交流方式,甚至不了解评审模型的情况下,对抗性重写的摘要显著改善了AI评审结果。我们在不同学科和出版场所,针对人类撰写和AI生成的论文都观察到了这一现象。我们最强的攻击实现了约38%的攻击成功率,将Gemini 3 Flash评审员的接受评分提高了+1.31,将GPT 5.4 Mini评审员的接受评分提高了+0.88(10分制)。当原始AI评审建议“拒绝”时,成功率升至50%以上。这种效应不仅限于总体分数膨胀,还增加了评审信心以及核心科学标准(如合理性、重要性和感知贡献)的得分。该攻击实用性强,仅需约5分钟和1美元即可完成一篇10页的AI会议投稿,且难以与普通科学编辑区分。膨胀的AI评审可能偏向下游人类决策,将编辑建议从拒绝转向接受。这些发现揭示了AI辅助科学评估中的一个普遍漏洞:当AI生成的评审影响编辑决策时,作者可能被激励优化稿件以迎合AI判断而非科学价值。我们的结果表明,在高风险的同行评审中,AI工具不应被视为中立的评估者,而应进行系统的鲁棒性测试、透明的保障措施和谨慎的人工监督。

英文摘要

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

2606.10156 2026-06-10 cs.IR cs.AI cs.CL 新提交

$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

$τ$-Rec:面向智能推荐系统的可验证基准

Bharath Sivaram Narasimhan, Karthik R Narasimhan

AI总结 针对多轮对话式智能推荐系统评估中主观性强、成本高的问题,提出$τ$-Rec基准,通过可验证奖励和揭示标记引导机制,结合pass^k可靠性指标,系统评估模型推理一致性,发现当前最佳模型可靠性仅约57%。

详情
AI中文摘要

随着推荐系统向智能、多轮对话界面转变,评估范式难以跟上步伐。当前的基准通常依赖“LLM作为评判者”的评估,这引入了主观性、高成本和不一致性。我们提出了$τ$-Rec,一个用于智能推荐系统的基准,它用可验证奖励取代主观评估,并采用揭示标记引导(RTE)机制来控制任务约束在对话中如何呈现。通过针对结构化目录谓词测试智能体,并采用pass^k可靠性指标,$τ$-Rec为一致的推理提供了系统测试。我们对五个模型家族(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B和GPT-5 mini)的九种配置进行了评估,揭示了一个陡峭的可靠性悬崖,即使是最好的模型在pass^1上也仅达到约57%,在pass^4上约38%,突显了当前对话智能体部署中的关键差距。所有代码和数据均在此https URL公开。

英文摘要

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

2606.10154 2026-06-10 cs.LG cs.CR 新提交

Quality Is Not a Safety Proxy Under Quantization

质量不是量化下的安全代理

Sahil Kadadekar

AI总结 研究发现量化检查点的质量指标无法替代直接安全测试,提出拒绝模板稳定性指数(RTSI)以识别危险行。

详情
Comments
21 pages, 6 figures. Preprint
AI中文摘要

量化检查点通常首先通过质量指标筛选,然后才进行直接安全测试(如果有的话)。本文在一个匹配的51行矩阵上审计了这一捷径,该矩阵涵盖6个模型、4个系列、7级GGUF梯度和AWQ/GPTQ INT4检查点。在这个矩阵中,捷径失败:所有36个质量-安全配对在模型间方向分裂,9个隐藏危险行加上1个接近隐藏危险行显示质量稳定或改善,而拒绝率下降12-68个百分点。11个AWQ/GPTQ行中有7个是隐藏危险。对17个Hugging Face支持的FP16/AWQ/GPTQ单元格进行的四探针机械后续研究未能挽救:熵、拒绝方向和校准探针是危险行的弱或零分离器,尽管探针识别的安全相关神经元整体上吸收了1.39倍的量化误差(p < 5×10^{-7}),但该效应并非特定于体系。Claude Sonnet 4重新标记了预定义分层集中的11,470个项目,与主要gemma3:12b判断者在89.9%的行上一致(κ=0.873,95% CI [0.866, 0.881]),并且改变了0/10个隐藏危险单元格。一个校准的研究内部行为筛选——拒绝模板稳定性指数(RTSI),由四个拒绝模板漂移特征构建并在该矩阵上校准——将10/10个隐藏或接近隐藏危险行路由到直接安全测试(Wilson 95% CI下限0.72),同时在样本内评分和行级留一验证下,将45个非基线行中的23个留在低风险桶中;在同一矩阵上,最佳单特征基线(唯一前缀率差、原始拒绝率差)在匹配桶大小下分别恢复9/10和8/10,跨堆栈传输需要重新校准。对于此处研究的量化检查点、模型系列和安全结果,保留的质量不能免除直接安全评估。

英文摘要

Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39$\times$ more quantization error overall ($p < 5 \times 10^{-7}$), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9\% of rows ($κ= 0.873$, 95\% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95\% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

2606.10153 2026-06-10 cs.LG 新提交

Compositional Generative Modeling from Decentralized Data

来自分散数据的组合生成建模

Mashrur M. Morshed, Vishnu Naresh Boddeti

AI总结 提出去中心化组合流匹配(DCFM)框架,通过结构约束实现分散数据中生成因子的组合,无需交换原始数据,在条件图像生成、机器人空间规划和医学属性共现建模中显著优于基线。

详情
Comments
ICML 2026
AI中文摘要

学习物理世界的组合性质需要联合观察相互作用因素。然而,由于实际数据通常是分散的,这些因素被碎片化地隔离在孤岛中。现有的去中心化生成方法仅关注对孤岛数据并集的建模,忽略了整体所隐含的新颖组合。为弥合这一差距,我们引入了去中心化组合流匹配(DCFM),这是一个在全局生成因子集上强制执行结构约束的框架,无需交换任何原始数据。DCFM使得新颖组合能够通过同伴交互涌现,即使没有单一数据源能独立支持该组合。实验上,DCFM在条件图像生成、机器人空间规划和医学属性共现建模中显著优于联邦学习和混合专家基线。

英文摘要

Learning the compositional nature of the physical world requires joint observation of interacting factors. However, because practical data is often decentralized, these factors are fragmented across isolated silos. Existing decentralized generative approaches focus only on modeling the union of siloed data, overlooking novel combinations implied by the collective whole. To bridge this gap, we introduce Decentralized Compositional Flow Matching (DCFM), a framework that enforces structural constraints across the global set of generative factors, without exchanging any raw data. DCFM enables novel combinations to emerge through peer interactions, even when no single data source can independently support the composition. Empirically, DCFM substantially outperforms federated learning and mixture-of-experts baselines across conditional image generation, robotic spatial planning, and medical attribute co-occurrence modeling.

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 新提交

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

详情
Comments
40 pages, 29 figures
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2606.10142 2026-06-10 cs.CV 新提交

DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

DB-3DME:从数据集到基准测试,实现与人类对齐的自动3D网格评估

Nanshan Jia, Zhenyu Zhao, Sui Huang, Jingshen Wang, Zeyu Zheng

AI总结 提出DB-3DME数据集与基准,包含2619个合成3D网格及其人类评分,通过微调视觉编码器优化VLM评估性能,显著超越现有模型。

详情
Comments
CVPR 2026 workshop paper. 10 pages, 3 figures, 6 tables. Dataset available at GitHub and Hugging Face
AI中文摘要

近年来3D生成的进展在真实性、可控性和效率上取得了显著提升,但3D资产的评估仍未被充分探索。现有的评估范式,包括人工评估、学习指标和视觉语言模型(VLM)作为评判者,在成本、可扩展性、分辨率处理或任务特定对齐方面存在局限性。在这项工作中,我们专注于3D网格评估,并引入了DB-3DME,即用于3D网格评估的数据集和基准。DB-3DME包含2,619个合成3D网格,并配有关于几何和提示遵从性的人工评分。利用该数据集,我们系统地基准测试了最先进的VLM,并发现3D表示的视觉编码是与人类对齐的评估性能的关键因素。受此发现启发,我们通过调整视觉编码器同时冻结语言模型,微调了一个开放权重的VLM——Qwen-2.5-VL-7B,用于3D网格评估。微调后的模型在多个评估维度上显著优于现有的预训练VLM,为自动3D网格评估建立了新的基准。我们在GitHub和Hugging Face上公开发布了基准数据集,以促进未来的研究。

英文摘要

Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

2606.10137 2026-06-10 cs.LG 新提交

Ambiguous Strategic Classification

模糊策略分类

Ivri Hikri, Nir Rosenfeld

AI总结 研究在监管要求部分信息披露下,学习器如何联合优化分类器及其不确定性,引入模糊性概念并开发高效算法。

详情
AI中文摘要

策略分类中的一个常见假设是分类器是公开的。然而,系统是否会选择完全披露,以及为什么,仍不清楚。我们研究了一个监管要求系统披露部分(而非全部)信息的设置。这引发了一个学习任务,其中学习器必须联合优化分类器及其周围的不确定性。为此,我们从稳健机制设计中采纳了模糊性概念,在我们的设置中,这允许学习器揭示一组或一系列可能的分类器,同时私下选择最终实现哪一个。我们研究了模糊性如何影响学习任务,开发了计算最佳响应和训练的高效算法,并通过我们的方法在新设置中实证探索了策略学习及其结果。

英文摘要

A common assumption in strategic classification is that the classifier is public knowledge. However, it remains unclear whether, and why, a system would choose to commit to full disclosure. We study a setting in which regulation requires the system to disclose some, but not all, of the information. This induces a learning task in which the learner must jointly optimize the classifier and the uncertainty surrounding it. To this end, we adopt from robust mechanism design the notion of ambiguity, which in our setting allows the learner to reveal a set or range of possible classifiers, while privately choosing which of them to ultimately realize. We investigate how ambiguity affects the learning task, develop efficient algorithms for computing best-responses and training, and empirically explore strategic learning and its outcomes in this novel setting and using our approach.

2606.10136 2026-06-10 cs.CV 新提交

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结 提出iSAGE框架,通过专家点击模型错误像素而非任意像素,无需辅助机制即可匹配密集监督,在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

详情
Comments
47 pages, 8 tables, 6 figures
AI中文摘要

遥感中的语义分割需要昂贵的像素级标注,且由于模型很少能在传感器、平台或地理区域间迁移,几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制(伪标签、传播、CRF、基础模型提示、辅助头)将稀疏点击扩展为密集监督,这些机制均基于模型的预测分布。在该分布中,一个自信的错误像素与一个自信的正确像素在结构上无法区分,因此任何读取该分布的规则都无法区分两者;区分信号位于模型外部。本文假设,专家针对模型错误(而非任意像素)的点击足以匹配密集监督,无需扩展机制。iSAGE(基于专家指导的迭代稀疏标注)在一个集成的开源平台上实现了这一假设,其中错误加权损失放大了每次点击的梯度,而标注记录本身即为数据集,可扩展、可纠正、可审计。实验采用最小努力策略:每帧每类最多一个标注像素。在BsB Aerial上,iSAGE恢复了密集监督的97.2%(在0.040%的像素上达到74.79% mIoU),并呈现出对比性的类别动态:无定形类别(渗透区域)从种子点开始饱和,而小类别(汽车)需要后期迭代的努力。在ISPRS Vaihingen(外部基准)上,iSAGE以0.011%的像素达到76.78% mIoU,匹配密集基线(76.65%)并超越所有已发表方法。在相同流程下,四种输出读取机制(预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机)比iSAGE低7.4至14.5个百分点。在调查的31种方法中,iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

2606.10129 2026-06-10 cs.LG cs.NE 新提交

Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

使用深度强化学习发现可解释的进化算法多参数控制策略

Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang

AI总结 针对进化算法多参数控制缺乏可解释策略的问题,提出深度强化学习结合动作空间分解、奖励平移和长期折扣的方法,蒸馏出符号控制策略,在OneMax问题上超越现有基线。

详情
Comments
arXiv admin note: text overlap with arXiv:2505.12982
AI中文摘要

虽然深度强化学习(deep-RL)已越来越多地应用于进化算法中的参数控制,但由于难以推导出适合形式化研究的有效、可解释的多参数策略,参数控制的严格理论分析在很大程度上仍局限于单参数设置。我们展示了如何利用深度强化学习克服这一障碍,以优化OneMax的(1+($\lambda$,$\lambda$))-遗传算法作为代表性案例研究,这是少数几个已正式证明动态控制具有超常数加速的问题之一。我们首先表明标准方法在这种多参数设置下难以收敛,并引入算法无关的增强技术,针对动作空间分解、奖励平移和长期折扣。在这些技术到位后,我们比较了常见的深度强化学习方法,发现双深度Q网络(Double Deep Q-Networks)独特地避免了近端策略优化(Proximal Policy Optimization)中观察到的策略崩溃,从而产生适合下游分析的轨迹。至关重要的是,我们通过将学习到的行为蒸馏为透明的符号控制策略,超越了神经网络的“黑箱”性质。由此产生的策略不仅为未来的理论分析提供了可解释性,而且表现出卓越的性能,在广泛的问题规模上始终优于现有基线。

英文摘要

While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+($λ$,$λ$))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box'' nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

AI总结 提出帕累托引导的教师对齐框架,通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化,在减少人口统计差异的同时保持个性化保真度,实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情
AI中文摘要

个性化说服性文本生成可以提高相关性和参与度,但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题:在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架,结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择,以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架,使用一个受控的上下文丰富的人口统计网格(匹配性别和年龄对)以及一个统一的五审计评估套件,涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中,没有单一的对齐策略能同时主导所有目标。相反,方法占据了公平-个性化帕累托前沿的不同区域:一些方法实现更强的差异减少,而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明,公平缓解效果依赖于目标,并在领域和模型族系间不一致地迁移,这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

2606.10124 2026-06-10 cs.LG cs.AI 新提交

FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching

FedSteer: 通过校正投影和缓存驯服联邦学习中的极端梯度陈旧性

Haoran Zhang, Cainã Figueiredo Pereira, Marie Siew, Xutong Liu, Carlee Joe-Wong, Rachid El-Azouzi

AI总结 针对联邦学习中客户端参与不均导致的梯度陈旧问题,提出FedSteer方法,利用客户端梯度缓存构建子空间,通过投影和缓存策略校正陈旧梯度,显著提升训练稳定性与精度。

详情
Comments
UAI 2026
AI中文摘要

联邦学习(FL)在客户端不持续参与训练轮次时,常遭受聚合方差的影响。虽然重用非活跃客户端的陈旧模型更新是减少这种方差的常见技术,但我们发现,在客户端参与偏斜的情况下,由此产生的更新陈旧性可能变得严重到足以破坏训练稳定性。为了解决这个问题,我们提出了FedSteer,一种新颖的方法,该方法从最近客户端梯度的缓存中构建一个梯度子空间,作为当前优化景观的低维表示。FedSteer将活跃客户端的真实梯度投影到这个子空间上,以找到一组最优坐标。对于非活跃客户端,FedSteer重用这些坐标,并结合由其他活跃客户端漂移的已演化的子空间。这个过程有效地将过时的梯度“引导”向当前的全局目标。此外,还辅以选择性缓存策略,识别代表性客户端子集以形成子空间,从而减少服务器内存。实验表明,FedSteer显著优于基线,在挑战性场景中防止性能崩溃,并在其他场景中实现超过7%的精度提升。

英文摘要

Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusing stale model updates from inactive clients is a common technique to reduce this variance, we find that with skewed client participation, the resulting update staleness can become severe enough to destabilize training. To remedy this, we propose FedSteer, a novel method that constructs a gradient subspace from a cache of recent client gradients to serve as a low-dimensional representation of the current optimization landscape. FedSteer projects an active client's true gradient onto this subspace to find a set of optimal coordinates. For an inactive client, FedSteer reuses these coordinates with the now-evolved subspace drifted by other active clients. This process effectively "steers" outdated gradients toward the current global objective. This is complemented by a selective caching strategy that identifies a representative client subset to form the subspace, reducing server memory. Experiments demonstrate that FedSteer significantly outperforms baselines, preventing performance collapse in challenging scenarios while delivering accuracy gains of over 7% in others.

2606.10115 2026-06-10 cs.CV 新提交

Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

利用预测不确定性增强模型改进PET/CT全身病灶分割

Bashirul Azam Biswas, Biratal Raj Wagle, Zhihan Yang, Marc A. Seltzer, Matthew E. Maeder, James B. Yu, Indrani Bhattacharya

AI总结 提出一种不确定性感知框架,结合贝叶斯集成、体素不确定性量化与不确定性增强训练,提升PET/CT全身病灶分割的鲁棒性和病灶检测能力,在AutoPET-III和Deep-PSMA数据集上验证。

详情
Comments
32 pages, 10 figures, 5 tables
AI中文摘要

准确的全身正电子发射断层扫描(PET)/计算机断层扫描(CT)病灶分割对于癌症分期和治疗计划至关重要。PET提供不同放射性示踪剂的功能代谢信息,而CT提供解剖定位。由于细微的影像特征、混杂因素和读者间差异,从PET/CT影像中勾画病灶在临床上具有挑战性。现有的深度学习方法存在训练随机性、预测不一致、高肿瘤负荷病例中病灶遗漏以及缺乏不确定性量化等问题,限制了其临床可靠性。以nnU-Net为基线,我们提出了一种用于全身PET/CT病灶分割的不确定性感知框架,该框架整合了(1)贝叶斯集成以减少训练随机性,(2)具有认知和偶然分解的体素级不确定性量化,以及(3)认知不确定性增强训练以提高病灶检测。使用两个公开数据集AutoPET-III(1,611次扫描)和Deep-PSMA(200次扫描),包含多种癌症类型的FDG和PSMA研究,进行训练和评估。在未见过的AutoPET-III测试集上,贝叶斯集成相比确定性nnU-Net模型提高了鲁棒性和性能。不确定性图突出了模型不一致的区域,并与错误分类(尤其是假阳性)相关。不确定性增强训练以增加假阳性体积为代价提高了病灶恢复,反映了精确率-召回率的权衡。一种病例自适应路由策略通过在基础模型和增强模型之间进行选择,进一步提高了Dice系数。据我们所知,这是第一项在多示踪剂、泛癌种PET/CT分割中系统研究不确定性量化,并将贝叶斯集成与不确定性感知建模相结合的工作。

英文摘要

Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.