arXivDaily arXiv每日学术速递 周一至周五更新
重置
2507.13595 2026-06-10 cs.CV 版本更新

NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

NoiseSDF2NoiseSDF: 从含噪监督中学习干净的神经场

Tengkai Wang, Weihao Li, Ruikai Cui, Shi Qiu, Nick Barnes

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出NoiseSDF2NoiseSDF方法,通过最小化含噪SDF表示之间的MSE损失,从含噪点云中学习干净的神经SDF,实现隐式去噪和表面优化。

详情
Comments
16 pages, 7 figures
AI中文摘要

从点云重建准确的隐式表面表示仍然是一项具有挑战性的任务,特别是当数据使用低质量扫描设备捕获时。这些点云通常包含大量噪声,导致表面重建不准确。受2D图像中Noise2Noise范式的启发,我们引入了NoiseSDF2NoiseSDF,一种旨在将此概念扩展到3D神经场的新方法。我们的方法通过最小化含噪SDF表示之间的MSE损失,从含噪点云中通过含噪监督学习干净的神经SDF,使网络能够隐式去噪并细化表面估计。我们在ShapeNet、ABC、Famous和Real数据集等基准上评估了NoiseSDF2NoiseSDF的有效性。实验结果表明,我们的框架显著提高了从含噪输入重建的表面质量。

英文摘要

Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

2505.14608 2026-06-10 cs.CL cs.AI cs.LG 版本更新

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

对机器文本检测器的攻击保留风格指纹

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 研究机器文本检测器对抗攻击的局限性,提出一种同时优化不可检测性和特定人类风格的 paraphrasing 方法,发现单文档检测不可靠,需多文档分析。

详情
AI中文摘要

尽管机器文本检测器的开发取得了显著进展,但机器文本容易被操纵以逃避检测,这导致有人认为该问题本质上是难以解决的。在这项工作中,我们研究了这种逃避策略的局限性。我们证明,尽管当前的攻击(从提示工程到检测器引导的优化)可以有效降低标准检测器的性能,但它们无法抹去机器文本底层的风格“指纹”。我们表明,利用风格特征空间的少样本检测器对这些逃避尝试具有鲁棒性,即使对于明确调整以逃避检测的模型生成的样本也能可靠地检测。这引发了一个问题:风格是否代表了对机器检测攻击的通用防御?我们通过引入一种新颖的 paraphrasing 方法来证明答案是“不”,该方法同时优化不可检测性和对特定人类风格的遵循。我们表明,与先前方法不同,这种攻击有效逃避了所有考虑的检测器,包括那些利用写作风格的检测器。然而,我们发现这种逃避并非绝对:随着可供分析的文档数量增加,人类和机器分布再次变得可区分。总体而言,我们的发现表明,可靠的机器文本检测需要从单文档分析转向多文档分析。

英文摘要

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

2509.04027 2026-06-10 cs.AI cs.CL 版本更新

Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought

CoT-Space: 一种通过强化学习实现内部慢思考的理论框架

Zeyu Gan, Hao Yi, Yong Liu

发表机构 * Zeyu Gan, Yi Hao, Yong Liu(GAN 赵毅、LIU 刘永)

AI总结 本文提出CoT-Space理论框架,通过强化学习将推理过程从离散的token预测任务转化为连续的推理层面语义空间中的优化过程,揭示了测试时扩展中最优CoT长度的收敛是欠拟合与过拟合基本权衡的自然结果。

详情
Comments
Preprint Edition
AI中文摘要

测试时扩展,主要通过强化学习(RL)中的多步链式推理(CoT)体现,已成为增强大型语言模型(LLMs)推理能力的关键范式。然而,仍存在显著的理论空白:传统token级分析无法捕捉推理层面扩展的宏观动态。为此,我们引入CoT-Space,一种新的理论框架,将推理过程从离散的token预测任务转换为连续的推理层面语义空间中的优化过程。通过从噪声和风险视角建模推理轨迹,并复兴经典学习理论中的基础原理,我们证明观察到的收敛到最优CoT长度是欠拟合与过拟合基本权衡的自然结果。我们进一步利用RL作为工具,在实验中激发并验证这些结果。我们的发现为通过RL实现内部测试时扩展提供了机制解释,为现代LLMs中优化推理轨迹提供了系统性的理论基础。

英文摘要

Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists: traditional token-level analysis fails to capture the macroscopic dynamics of reasoning-level scaling. To address this, we introduce CoT-Space, a novel theoretical framework that recasts the reasoning process from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By modeling the reasoning trajectory from both noise and risk perspectives and revitalizing foundational principles from classical learning theory, we demonstrate that the observed convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. We further utilize RL as a tool to elicit and verify these results in our experiments. Our findings provide a mechanistic explanation for the internal test-time scaling via RL, offering a principled theoretical foundation to optimize reasoning trajectories in modern LLMs.

2509.19936 2026-06-10 cs.CV 版本更新

CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation

CapStARE: 基于胶囊的序列架构实现鲁棒高效的目光估计

Miren Samaniego, Igor Rodriguez, Elena Lazkano

发表机构 * University of the Basque Country(巴斯克大学)

AI总结 提出CapStARE,结合冻结ConvNeXt骨干、注意力路由胶囊和双GRU解码器,在ETH-XGaze等数据集上实现实时高精度目光估计,兼顾空间鲁棒性与计算效率。

详情
Comments
Preprint for Patter Recognition Journal
AI中文摘要

人类目光估计对于人机交互、社交机器人和辅助系统等应用至关重要。然而,在非约束环境中实现准确、可解释且实时的性能仍然具有挑战性。现有的基于外观的方法通常在空间鲁棒性、计算效率和上下文信息的有效利用之间面临权衡。为了解决这一问题,我们引入了CapStARE,一种基于胶囊的架构,它结合了用于高效特征提取的冻结ConvNeXt骨干网络、用于结构化面部推理的基于注意力路由的胶囊形成,以及用于短时域观测窗口上轻量级序列建模的双GRU解码器。这种设计保留了可解释的部分-整体面部关系,同时通过局部上下文一致性提高了预测稳定性。实验结果表明,该方法在ETH-XGaze(3.36)和MPIIFaceGaze(2.65)上表现强劲,同时在Gaze360(9.06)上也具有竞争力的泛化能力,且所有测试均实现实时推理(<10毫秒)。这些发现表明,所提出的方法为现实交互环境中基于外观的目光估计提供了一个实用且鲁棒的框架。相关代码和实验结果公开于:this https URL

英文摘要

Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare

2509.17251 2026-06-10 stat.ML cs.LG 版本更新

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

线性回归中的风险比较:隐式正则化主导显式正则化

Jingfeng Wu, Peter L. Bartlett, Sham M. Kakade, Jason D. Lee, Bin Yu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Alphabetical order Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind)

AI总结 本文通过实例比较线性回归中梯度下降、岭回归和随机梯度下降的有限样本风险,发现梯度下降优于岭回归,但与随机梯度下降不可比,且在某些问题中梯度下降可能更差。

详情
Comments
Accepted for presentation at the Conference on Learning Theory (COLT) 2026
AI中文摘要

现有理论表明,对于按容量和源条件分类的线性回归问题,梯度下降(GD)始终是极小化最优的,而岭回归和在线随机梯度下降(SGD)对于某些类别的问题则是多项式次优的。超越极小化理论,本文为任何良好设定的线性回归问题提供了这些算法有限样本风险的实例比较。我们的分析得出三个关键发现。首先,GD 优于岭回归:在可比较的正则化下,GD 的过剩风险始终在岭回归的一个常数因子内,但即使经过最优调整,岭回归也可能多项式地更差。其次,GD 与 SGD 不可比。虽然已知对于某些问题 GD 可以多项式地优于 SGD,但反之亦然:我们受良性过拟合理论启发构造了问题,其中最优停止的 GD 多项式地更差。最后,对于一类重要子问题——具有快速且连续衰减协方差谱的问题,GD 优于 SGD,这包括所有满足标准容量条件的问题。

英文摘要

Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of that of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

2509.16518 2026-06-10 cs.CV cs.AR 版本更新

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

FG-Attn:在视频扩散模型中利用细粒度稀疏注意力

Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对视频扩散模型中注意力层计算开销大的问题,提出FG-Attn,一种低开销的细粒度稀疏注意力机制,在MxN块粒度上跳过分数计算,实现最高2.45倍加速。

详情
AI中文摘要

使用扩散变压器进行媒体生成可能需要评估极长序列上的注意力,其中注意力层占生成延迟的大部分。利用注意力图中的稀疏性为降低这一成本提供了有前景的机会。在这项工作中,我们展示了扩散变压器中的注意力图在视频生成模型中表现出显著的细粒度稀疏性。然而,现有的稀疏注意力方法过于粗粒度,留下了大量未处理的冗余计算,或者在更细粒度上产生高开销。我们提出FG-Attn,一种新颖的低开销细粒度稀疏注意力机制,它在MxN块的粒度上跳过分数计算,其中N>=1且M>=16,每个块是M个查询和N个键之间查询-键点积的结果。FG-Attn解决了GPU上稀疏注意力内核中硬件利用率不足的关键挑战,同时避免了不规则内存访问和冗余操作的开销。FG-Attn可以完全取代现有的稀疏注意力方法,并将块稀疏注意力方法扩展到现代GPU上的更细粒度。在70%稀疏度下,FG-Attn比最先进的FlashInfer快2.45倍,平均减少注意力内核时间14.7%。FG-Attn将端到端视频生成时间比Flash Attention 3加速高达1.40倍(平均1.18倍)。

英文摘要

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

2508.13446 2026-06-10 cs.RO 版本更新

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

CAST: 反事实标签提升视觉-语言-动作模型中的指令跟随能力

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

发表机构 * University of California Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学)

AI总结 针对VLA模型难以遵循细粒度指令的问题,提出利用视觉语言模型生成反事实标签增强数据集,提升语言基础多样性,实验表明该方法在导航和操作任务中显著提升指令跟随成功率。

详情
AI中文摘要

通用机器人应能理解并遵循用户指令。尽管当前视觉-语言-动作(VLA)模型为将开放词汇语言指令映射到机器人动作提供了强大架构,但它们难以遵循细粒度命令。原因之一是现有机器人数据集缺乏语义多样性和语言基础,特别是对于相似观测缺乏细粒度任务多样性。为解决此问题,我们提出一种新方法,利用视觉语言模型创建反事实标签来增强现有机器人数据集。通过用这些标签增强现有数据集,我们增加了机器人数据集语言基础的多样性和粒度,最终提升了VLA的语言跟随能力。我们通过在3个不同室内外环境中进行视觉语言导航实验,评估了所得模型遵循语言指令的能力,范围从简单的以物体为中心的指令到复杂的指代任务。实验表明,反事实重标记(无需额外数据收集)显著提升了VLA策略的指令跟随能力,超越了最先进方法,并且与在未增强数据上训练的VLA相比,成功率翻倍。我们还评估了该方法在操作VLA上的表现,发现在有干扰物的任务中性能有类似提升。

英文摘要

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

2508.13362 2026-06-10 cs.LG 版本更新

Optimization-based Online Conformal Prediction for Multi-step Forecasting

基于优化的在线共形预测用于多步预测

Ruipu Li, Daniel Menacho, Alexander Rodríguez

发表机构 * University of Michigan(密歇根大学)

AI总结 提出O2CP框架,通过双层优化结构建模多步误差依赖,在保证边际覆盖有效性的同时生成更窄的预测区间,实验表明在自动驾驶、气候预测等领域优于现有方法。

详情
AI中文摘要

共形预测(CP)因其无分布覆盖保证而非常适合时间序列预测中的不确定性量化。然而,现有的多步方法往往难以平衡覆盖有效性与效率:它们要么独立校准每个预测步长,忽略时间相关性,要么强制执行严格的同步覆盖,导致区间过于保守。在这项工作中,我们提出了O2CP:基于优化的在线共形预测,这是一个统一的在线共形预测框架,显式建模多步误差依赖关系,同时不牺牲长期边际覆盖保证。我们首先证明,只要校准参数保持在定义的“安全”区域内,标准的在线共形更新就能保持有效性。利用这一理论见解,我们引入了一个双层架构:外层定义可接受的参数集以确保有效性,内层执行约束优化以建模联合误差分布并最小化整个预测步长的目标函数。为了使其计算可行,我们开发了一种轻量级采样策略,无需大型校准集即可估计联合分布。在包括自动驾驶、气候预测和公共卫生在内的真实世界数据集上的大量实验表明,O2CP始终优于最先进的基线方法,在实现目标覆盖率的同时,预测区间显著更窄,且长期遗憾更小。

英文摘要

Conformal prediction (CP) is well-suited for uncertainty quantification in time series forecasting due to its distribution-free coverage guarantees. However, existing multi-step methods often struggle to balance coverage validity with efficiency: they either calibrate horizons independently, ignoring temporal correlations, or enforce strict simultaneous coverage, resulting in overly conservative intervals. In this work, we propose O2CP: Optimization-based Online Conformal Prediction, a unified framework for online conformal prediction that explicitly models multi-step error dependencies without sacrificing long-term marginal coverage guarantees. We first prove that standard online conformal updates maintain validity as long as calibration parameters remain within a defined "safe" region. Leveraging this theoretical insight, we introduce a two-layer architecture: an outer layer that defines admissible parameter sets to ensure validity, and an inner layer that performs constrained optimization to model joint error distributions and minimize horizon-wide objectives. To make this computationally feasible, we develop a lightweight sampling strategy that estimates joint distributions without requiring large calibration sets. Extensive experiments on real-world datasets, including autonomous driving, climate forecasting, and public health, demonstrate that O2CP consistently outperforms state-of-the-art baselines, achieving target coverage with significantly sharper prediction intervals and reduced regret over long horizons.

2504.02323 2026-06-10 cs.CL 版本更新

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring and Feedback

CoTAL:面向可泛化形成性评估评分与反馈的人机协同提示工程

Clayton Cohn, Ashwin T S, Naveeduddin Mohammed, Gautam Biswas

发表机构 * Vanderbilt University(范德比大学)

AI总结 提出CoTAL方法,结合证据中心设计、人机协同提示工程和思维链提示,迭代优化LLM评分,在多个领域提升GPT-4评分性能达38.9%,并获师生认可。

详情
Comments
Submitted to Computers and Education: Artificial Intelligence. Currently under review
AI中文摘要

大型语言模型(LLM)为辅助教师和支持学生学习创造了新机遇。尽管研究者已在教育背景下探索了各种提示工程方法,但这些方法在科学、计算和工程等领域的泛化程度仍待深入研究。本文提出思维链提示+主动学习(CoTAL),一种基于LLM的形成性评估评分方法,该方法(1)利用证据中心设计(ECD)将评估和评分标准与课程目标对齐,(2)应用人机协同提示工程自动化响应评分,(3)结合思维链(CoT)提示以及教师和学生反馈,迭代优化问题、评分标准和LLM提示。我们的研究结果表明,CoTAL提升了GPT-4在多个领域的评分性能,相比无提示工程基线(即无标注示例、思维链提示或迭代优化),增益高达38.9%。教师和学生认为CoTAL在评分和解释响应方面有效,他们的反馈产生了有价值的见解,提高了评分准确性和解释质量。

英文摘要

Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.

2508.07048 2026-06-10 cs.SD cs.AI cs.LG eess.AS 版本更新

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion: 基于掩码扩散的并行ASR解码

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Jongchan Kim, Hyungon Ryu, Hyuk-Jae Lee, Nam-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Soongsil University(顺天大学) NVIDIA Corporation(英伟达公司)

AI总结 提出Whisfusion,在冻结的Whisper音频嵌入上训练专用掩码扩散解码器,通过并行扩散解码实现非自回归ASR,在多种语言基准上超越Whisper-large-v3,速度提升4-5倍。

详情
Comments
16 pages, 3 figures
AI中文摘要

自回归(AR)编码器-解码器模型主导着高质量的多语言ASR,但其从左到右的解码器使得推理延迟随转录长度增加。一种自然的替代方案,CTC风格的非自回归(NAR)系统避免了这一瓶颈,但其条件独立性假设牺牲了转录级别的生成建模。掩码扩散语言模型(例如LLaDA、MDLM)提供了一种有竞争力的NAR文本生成方法。我们探究这类模型是否能在消除从左到右瓶颈的同时,将NAR ASR带入强AR ASR系统的准确率范围。我们提出Whisfusion,它在冻结的Whisper-large-v3音频嵌入之上从头训练一个专用的掩码扩散解码器,仅需几步即可去噪掩码转录。我们在约68k小时的11种语言语音上训练,采用高掩码专门化以将训练与推理的完全掩码起始点对齐,并通过并行扩散解码进行解码。Whisfusion在英语、欧洲和CJK基准测试的组平均准确率上超越Whisper-large-v3,同时运行速度快4-5倍,在准确率和吞吐量上均超越Whisper-turbo。它达到与Canary和Qwen3-ASR竞争的准确率,同时运行速度快3-7倍。这些结果确立了掩码扩散作为高吞吐量多语言转录的帕累托竞争性非自回归范式。代码和模型权重可在https://this URL获取。

英文摘要

Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.

2503.24007 2026-06-10 cs.LG cs.AI 版本更新

CITRAS: Covariate-Informed Transformer for Time Series Forecasting

CITRAS: 协变量感知的Transformer时间序列预测

Yosuke Yamaguchi, Issei Suemitsu, Wenpeng Wei

发表机构 * Research & Development Group, Hitachi, Ltd.(日立有限公司研发部)

AI总结 提出CITRAS,一种仅解码器Transformer,通过KV移位和注意力分数平滑机制灵活整合已知协变量的未来部分,并捕获局部和全局跨变量依赖,提升预测精度。

详情
Journal ref
IEEE Access, vol. 14, pp. 77983-77998, 2026
AI中文摘要

在时间序列预测中,协变量代表影响目标变量的外部因素。一些协变量仅在过去可观测(观测协变量,如记录的天气数据),而另一些则预先已知(已知协变量,如日历事件或折扣计划)。尽管协变量有潜力提升预测性能,但大多数基于深度学习的预测模型难以处理由已知协变量的未来部分引起的变量长度差异,且无法灵活利用它们。此外,捕获目标变量与协变量之间的依赖关系并非易事,因为模型必须准确反映协变量的局部影响,同时建模全局跨变量依赖。为应对这些挑战,我们提出CITRAS,一种仅解码器Transformer,灵活整合多个目标变量、观测协变量和已知协变量。在保持强大自回归建模能力的同时,CITRAS在分块跨变量注意力中引入两种新机制:键值移位和注意力分数平滑。键值移位通过基于并发依赖将已知协变量的未来部分与目标变量对齐,无缝融入预测过程。注意力分数平滑通过平滑历史注意力分数,将局部精确的分块跨变量依赖细化为全局变量级依赖。实验上,CITRAS在协变量感知和多变量设置下的广泛真实世界数据集上展现出强大性能,展示了其利用跨变量和跨时间依赖提升预测准确性的通用能力。

英文摘要

In time series forecasting, covariates represent external factors that influence target variables. Some covariates are observable only in the past (observed covariates, such as recorded weather data), while others are known in advance (known covariates, such as calendar events or discount schedules). Although covariates have the potential to enhance forecasting performance, most deep learning-based forecasting models struggle to address the length discrepancy between variables caused by the future portion of known covariates and fail to leverage them flexibly. Moreover, capturing dependencies between target variables and covariates is non-trivial, as models must accurately reflect the local impact of covariates while simultaneously modeling global cross-variate dependencies. To address these challenges, we propose CITRAS, a decoder-only Transformer that flexibly integrates multiple target variables, observed covariates, and known covariates. While preserving strong autoregressive modeling capabilities, CITRAS introduces two novel mechanisms within patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates the future portion of known covariates into the forecasting process by aligning them with target variables based on their concurrent dependencies. Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the historical attention scores. Experimentally, CITRAS demonstrates strong performance across a wide range of real-world datasets in both covariate-informed and multivariate settings, showcasing its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

2507.19137 2026-06-10 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute(日内瓦研究所) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究通过对话语音分析,发现感知人格在不同工作情境下显著变化,并识别出与各人格特质相关的声学特征。

详情
AI中文摘要

先前研究表明,用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知(APP)的兴趣,旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质,独立于情境。然而,心理学研究表明,感知人格会随情境和场景而变化。在本研究中,我们调查了参与两种工作情境(中性面试和压力客户互动)的参与者对话语音与感知人格之间的关系。我们的主要发现是:1)感知人格在不同互动中显著不同;2)响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性,而在压力情境中,神经质与这些特征相关;3)手工声学特征和非语言特征在感知人格推断中优于说话人嵌入;4)压力互动更能预测神经质,这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

2507.17188 2026-06-10 cs.NI cs.AI cs.CR 版本更新

LLM-Aided Joint Secrecy Precoding and Trajectory for RSMA-Based Heterogeneous UAV Networks

基于RSMA的异构无人机网络中LLM辅助的联合保密预编码与轨迹设计

Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Department of Applied Data Science, San Jose State University(圣何塞州立大学应用数据科学系)

AI总结 针对RSMA异构无人机网络中的安全通信问题,提出分层优化框架:内层用SDR-S2DC算法求解固定位置下的保密预编码,外层用LLM引导的多智能体强化学习优化轨迹,实现保密速率与能效的权衡。

详情
AI中文摘要

本文研究了速率分割多址接入(RSMA)使能的异构无人机网络中的安全通信问题,其中多个无人机在存在窃听者的情况下协作服务地面终端。通过联合考虑保密速率最大化和推进能量消耗最小化,我们构建了一个多目标优化问题,涉及无人机轨迹设计、服务关联、功率分配和保密预编码,并受到移动性、碰撞避免、服务容量和通信约束。所构建的问题由于无人机轨迹、RSMA传输变量和保密预编码之间的耦合而高度非凸。为了解决由此产生的非凸且高度耦合的优化问题,我们提出了一种分层优化框架。内层使用基于半定松弛(SDR)的S2DC算法,结合惩罚函数和凸差(D.C.)规划,在固定无人机位置下求解保密预编码问题。外层引入了一种大语言模型(LLM)引导的启发式多智能体强化学习方法(LLM-HeMARL)用于轨迹优化。LLM-HeMARL高效地整合了LLM生成的专家启发式策略,使无人机能够学习能量感知、安全驱动的轨迹,而无需实时LLM调用的推理开销。仿真结果表明,我们的方法在保密速率和能效方面优于现有基线,并在不同的无人机群规模和随机种子下具有一致的鲁棒性。

英文摘要

This paper investigates secure communications in rate-splitting multiple access (RSMA) enabled heterogeneous UAV networks, where multiple UAVs collaboratively serve ground terminals in the presence of eavesdroppers. By jointly considering secrecy rate maximization and propulsion energy consumption minimization, we formulate a multi-objective optimization problem involving UAV trajectory design, service association, power allocation, and secrecy precoding under mobility, collision-avoidance, service-capacity, and communication constraints. The formulated problem is highly non-convex due to the coupling among UAV trajectories, RSMA transmission variables, and secrecy constraints.To address the resulting non-convex and highly coupled optimization problem, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (D.C.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)-guided heuristic multi-agent reinforcement learning approach (LLM-HeMARL) for trajectory optimization. LLM-HeMARL efficiently incorporates LLM-generated expert heuristic policy, enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

2507.15294 2026-06-10 cs.SD cs.MM 版本更新

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

MeMo: 视觉受损条件下的实时视听目标说话人提取的注意力动量

Junjie Li, Wenxuan Wu, Shuai Wang, Zexu Pan, Kong Aik Lee, Helen Meng, Haizhou Li

发表机构 * Department of Electrical and Electronic Engineering, Faculty of Engineering, The Hong Kong Polytechnic University(电子工程系,工程学院,香港理工大学) Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong(系统工程与工程管理系,香港中文大学) School of Artificial Intelligence (SAI), The Chinese University of Hong Kong, Shenzhen(人工智能学院(SAI),香港中文大学深圳校区) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院,南京大学) Tongyi Lab, Alibaba Group, Singapore(通义实验室,阿里巴巴集团,新加坡)

AI总结 提出MeMo框架,通过两个自适应记忆库存储注意力信息,在视觉线索缺失时维持注意力动量,实现实时目标说话人提取,SI-SNR提升至少2dB。

详情
AI中文摘要

视听目标说话人提取(AV-TSE)旨在通过利用视觉线索作为指导,从多说话人环境中分离出目标说话人的声音。然而,AV-TSE系统的性能严重依赖于这些视觉线索的质量。在视觉线索缺失或严重退化的极端场景中,系统可能无法准确提取目标说话人。相比之下,人类即使在缺乏明确辅助信息的情况下也能保持对目标说话人的注意力。受这种人类认知能力的启发,我们提出了一种名为MeMo的新框架,该框架包含两个自适应记忆库来存储注意力相关信息。MeMo专为实时场景设计:一旦建立初始注意力,系统就会随时间维持注意力动量,即使视觉线索变得不可用。我们进行了全面的实验来验证MeMo的有效性。实验结果表明,我们提出的框架相比相应基线实现了至少2 dB的SI-SNR提升。

英文摘要

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.

2410.15595 2026-06-10 cs.AI cs.CL cs.LG 版本更新

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述:数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团)

AI总结 综述直接偏好优化(DPO)在理论、变体、数据集和应用方面的进展,指出其作为RL-free替代方案的潜力与局限,并提出未来研究方向。

详情
Comments
Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey
AI中文摘要

随着大语言模型(LLMs)的快速发展,将策略模型与人类偏好对齐变得日益关键。直接偏好优化(DPO)作为一种有前景的对齐方法,作为从人类反馈中强化学习(RLHF)的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性,但文献中目前缺乏对这些方面的深入综述。在这项工作中,我们对DPO中的挑战和机遇进行了全面回顾,涵盖理论分析、变体、相关偏好数据集和应用。具体而言,我们基于关键研究问题对近期DPO研究进行分类,以提供对DPO当前格局的透彻理解。此外,我们提出了几个未来研究方向,为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

2409.02426 2026-06-10 cs.LG cs.CV 版本更新

Breaking the Curse of Dimensionality: Diffusion Models Efficiently Learn Low-Dimensional Distributions

打破维度诅咒:扩散模型高效学习低维分布

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出新数学框架,证明扩散模型通过等价于子空间聚类,能以线性于内在维度的样本复杂度学习低维分布,避免维度诅咒。

详情
Comments
37 pages, 8 figures, 2 tables
AI中文摘要

尽管扩散模型在广泛的生成任务中取得了经验上的成功,但其学习数据分布能力的基本原理仍不清楚。在这项工作中,我们开发了一个新的数学框架,解释了扩散模型如何能够从有限数量的训练样本中有效学习低维分布,而不受维度诅咒的影响。具体来说,受图像数据内在低维结构的启发,我们在理论上分析了一个数据分布被建模为低秩高斯混合的场景。在合适的网络参数化下,我们表明优化扩散模型的训练目标等价于在训练样本上解决经典子空间聚类问题,其中每个子空间基对应于一个高斯分量的低秩协方差。这种等价性使我们能够证明,学习底层分布的样本复杂度与数据的内在维度呈线性关系,而不是与环境维度呈指数关系。我们的理论发现得到了经验证据的进一步支持,这些证据展示了在合成和真实世界图像数据集上的泛化相变现象。此外,我们建立了学习到的子空间基与图像数据语义属性之间的对应关系,为可控图像生成提供了原则性基础。

英文摘要

Despite their empirical success across a wide range of generative tasks, the fundamental principles underlying the ability of diffusion models to learn data distributions are poorly understood. In this work, we develop a new mathematical framework that explains how diffusion models can effectively learn low-dimensional distributions from a finite number of training samples without suffering from the curse of dimensionality. Specifically, motivated by the intrinsic low-dimensional structure of image data, we theoretically analyze a setting in which the data distribution is modeled as a mixture of low-rank Gaussians. Under suitable network parameterization, we show that optimizing the training objective of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples, where each subspace basis corresponds to the low-rank covariance of a Gaussian component. This equivalence allows us to show that the sample complexity for learning the underlying distribution scales linearly with the intrinsic dimension of the data, rather than exponentially with the ambient dimension. Our theoretical findings are further supported by empirical evidence that demonstrates phase transition phenomena in generalization on both synthetic and real-world image datasets. Moreover, we establish a correspondence between the learned subspace bases and semantic attributes of image data, providing a principled foundation for controllable image generation.

2506.08134 2026-06-10 cs.AI cs.CY 版本更新

Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem

立场:机器学习社区必须构建AI增强的同行评审生态系统

Qiyao Wei, Samuel Holt, Jing Yang, Markus Wulfmeier, Mihaela van der Schaar

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对ML领域稿件激增导致同行评审危机,本文主张将AI辅助评审作为优先研究课题,提出利用大语言模型作为协作工具,增强事实核查、评审指导、作者改进和决策支持,并强调需要更细粒度的评审数据。

详情
Comments
18 pages, 3 figures. Accepted (Oral) at the ICML 2026 Position Paper Track
AI中文摘要

同行评审是机器学习(ML)科学进步的基石,但正面临规模危机。向NeurIPS、ICML和ICLR等顶级ML会议提交的稿件数量呈指数级增长,超过了合格评审者的有限容量,引发了对评审质量、一致性和评审者疲劳的担忧。本文立场认为,AI辅助同行评审必须成为紧急的研究和基础设施优先事项。我们倡导一个全面的AI增强生态系统,利用大语言模型(LLMs)不是替代人类判断,而是作为作者、评审者和领域主席(ACs)的复杂协作者。我们提出了AI在增强事实核查、指导评审者表现、协助作者改进质量以及支持ACs决策中的具体角色。关键的是,我们认为此类系统的开发依赖于获取更细粒度、结构化和符合伦理的同行评审过程数据。我们概述了一个研究议程,包括说明性实验,以开发和验证这些AI助手,并讨论了重大的技术和伦理挑战。我们呼吁ML社区主动构建这个AI辅助的未来,确保科学验证的持续完整性和可扩展性,同时保持高标准的同行评审。

英文摘要

Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.

2505.23341 2026-06-10 cs.CV 版本更新

Dual-stream attention-guided learning for weakly supervised whole slide image classification

双流注意力引导学习用于弱监督全切片图像分类

Daoxi Cao, Hangbei Cheng, Yijin Li, Ruolin Zhou, Xuehan Zhang, Xinyi Li, Binwei Li, Xuancheng Gu, Jianan Zhang, Xueyu Liu, Yongfei Wu

发表机构 * College of Computer Science and Technology, College of Data Science, Taiyuan University of Technology(太原科技大学计算机科学与技术学院、数据科学学院) College of Humanities, Law and Foreign Languages, Taiyuan University of Technology(太原科技大学人文学院、法律与外语学院) College of Artificial Intelligence, Taiyuan University of Technology(太原科技大学人工智能学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络安全学院) School of Mathematics, Taiyuan University of Technology(太原科技大学数学学院)

AI总结 提出双流注意力引导学习框架,通过师生双流架构和注意力引导伪标签,解决弱监督下全切片图像中关键区域识别和实例关系建模问题,在合成和真实病理数据集上优于现有方法。

详情
AI中文摘要

全切片图像(WSIs)因其超高分辨率和丰富的形态学信息在癌症诊断中发挥关键作用,多实例学习(MIL)已成为解决WSIs巨大尺寸和实例细粒度标注稀缺的主流范式。然而,现有大多数MIL方法难以仅使用切片级标签准确识别诊断关键局部区域(实例),并且在高效建模实例间关系方面存在不足。为解决这些问题,我们提出了一种双流注意力引导学习(DSAGL)框架。DSAGL通过师生双流架构桥接切片级监督和实例级学习,并通过生成注意力引导伪标签缓解实例歧义。该框架采用共享轻量级编码器高效建模长距离依赖,并利用基于注意力的融合机制增强对稀疏信息区域的敏感性。在合成基准和真实病理WSI数据集上的大量实验表明,DSAGL在弱监督下始终优于最先进的MIL方法,实现了卓越的判别性能和鲁棒性。

英文摘要

Whole slide images (WSIs) play a crucial role in cancer diagnosis due to their ultra-high resolution and rich morphological information, and multiple instance learning (MIL) has become a prevalent paradigm to solve the massive size of WSIs and the scarcity of fine-grained annotations of instance. However, most existing MIL methods struggle to accurately identify diagnostically critical local regions (instance) using only slide-level labels, and suffer from modelling the relationship of instances efficiently. To address these defects, we propose a Dual-Stream Attention-Guided Learning (DSAGL) framework. DSAGL bridges slide-level supervision and instance-level learning through a teacher-student dual-stream architecture, and mitigates instance ambiguity by generating attention-guided pseudo labels. The framework employs a shared lightweight encoder to efficiently model long-range dependencies and an attention-based fusion mechanism to enhance sensitivity to sparse, informative regions. Extensive experiments on synthetic benchmarks and real-world pathological WSI datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL methods, achieving superior discriminative performance and robustness under weak supervision.

2506.14753 2026-06-10 cs.CV cs.LG 版本更新

Cost-Aware Routing for Efficient Text-To-Image Generation

面向文本到图像生成的高效路由:成本感知方法

Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

发表机构 * Tandon School of Engineering, New York University(纽约大学Tandon工程学院) Google(谷歌) Eigen 4D Inc.(Eigen 4D公司)

AI总结 提出成本感知路由框架,根据提示复杂度自动选择不同去噪步数或模型,在保证高质量的同时降低计算成本,优于单一模型。

详情
Comments
Accepted by TMLR
AI中文摘要

扩散模型以其通过迭代去噪过程为输入提示生成高保真图像的能力而闻名。不幸的是,由于固有的顺序生成过程,高保真度也伴随着高计算成本。在这项工作中,我们寻求在质量和计算成本之间实现最优平衡,并提出一个框架,允许每个提示的计算量根据其复杂度而变化。每个提示自动路由到最合适的文本到图像生成函数,该函数可能对应扩散模型的不同去噪步数,或一个不同的、独立的文本到图像模型。与统一的成本降低技术(例如,蒸馏、模型量化)不同,我们的方法通过学习将昂贵的选择(例如,100+去噪步)仅保留给少数复杂提示,而对较简单的提示采用更经济的选择(例如,小型蒸馏模型),从而实现最优权衡。我们在COCO和DiffusionDB上经验性地证明,通过学习路由到九个已训练的文本到图像模型,我们的方法能够提供比这些模型单独使用时更高的平均质量。代码可在以下网址获取:https://this URL。

英文摘要

Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due to the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone. Code is available at https://github.com/winglicopy/CATImage.

2406.08726 2026-06-10 cs.CL 版本更新

Standard Language Ideology in AI-Generated Language

AI生成语言中的标准语言意识形态

Genevieve Smith, Eve Fleisig, Ishita Rustagi, Xavier Yin

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出一个分类法,揭示大型语言模型如何强化标准语言意识形态,导致语言变体的边缘化,并讨论其社会影响及应对建议。

详情
Comments
To appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
AI中文摘要

大型语言模型(LLMs)生成的文本强化了标准语言意识形态:偏向于某些被认为比其它语言变体更具声望、权威和合法性的语言变体。本文贡献了一个基于社会技术的分面分类法,阐明了生成式AI系统如何再现标准语言意识形态及其社会影响。我们引入了标准AI生成语言意识形态的概念,以解释AI系统如何赋予某些语言变体合法性,同时边缘化其他变体,构建了性能差异、刻板印象、挪用和抹除的模式。然后,我们讨论了关于什么是理想系统行为的持续紧张,以及生成式AI工具尝试或拒绝模仿不同语言变体的优缺点。为了解决塑造生成式AI的权力关系以及我们分类法中识别的机制——合法化、刻板印象、挪用和抹除——我们提出了强调问责、社区代理、控制和所有权的建议。这些建议将语言多样性视为在公正的AI未来中需要保护、珍视和维持的资源。

英文摘要

Large language models (LLMs) generate text that reinforces standard language ideology: a bias towards certain language varieties that are granted more prestige, authority, and legitimacy than others. This paper contributes a sociotechnically grounded faceted taxonomy that illustrates how generative AI systems reproduce standard language ideology and its societal implications. We introduce the concept of standard AI-generated language ideology to explain how AI systems confer legitimacy on certain language varieties while marginalizing others, structuring patterns of performance disparity, stereotyping, appropriation, and erasure. We then discuss ongoing tensions around what constitutes desirable system behavior, as well as advantages and drawbacks of generative AI tools attempting or refusing to imitate different language varieties. To address the power relations shaping generative AI and the mechanisms identified in our taxonomy--legitimation, stereotyping, appropriation, and erasure--we offer recommendations that emphasize accountability, community agency, control, and ownership. These recommendations recognize linguistic diversity as a resource to be protected, valued, and sustained as part of a just AI future.

2506.09171 2026-06-10 cs.LG cs.AI cs.CL 版本更新

Fact-Augmented Lookahead Planning for LLM Agents

面向LLM智能体的事实增强前瞻规划

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LWM-Planner框架,通过从轨迹中提取关键事实并用于条件化动作提议、世界模型模拟和状态值估计,实现无需参数更新的在线规划改进,在多个环境上优于ReAct/Reflexion和纯搜索基线。

详情
Comments
Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026). Camera-ready version. 9-page main text plus appendices (63 pages total), 1 figure
AI中文摘要

大型语言模型(LLM)能力日益增强,但在交互式、部分可观测、长周期环境中,当搜索无引导或近期历史不足时,LLM智能体仍难以有效规划。我们提出LWM-Planner,一种事实增强的前瞻规划框架,仅通过上下文学习改善智能体行为。每个回合后,智能体从轨迹中提取任务关键原子事实,通过轻量级预测一致性过滤器验证候选事实(并可选择压缩),然后使用生成的事实集来条件化动作提议、单步潜在世界模型模拟和状态值估计。规划通过递归、有限深度的前瞻进行,基于累积事实和近期历史对候选轨迹进行搜索,实现无需参数更新的在线改进。我们提供抽象风格的动机:将事实视为减少状态混淆(代理$\epsilon_{\mathrm{sim}}$),将事实条件模拟视为降低单步误差(代理$\delta_{\mathrm{model}}$),但不声称形式化保证。实验上,在文本FrozenLake变体、CrafterMini和ALFWorld上,该方法在累积回报上优于ReAct/Reflexion和纯搜索基线,表明额外的测试时搜索在由紧凑的经验派生事实引导时最为有用。

英文摘要

Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation: treating facts as reducing state aliasing (proxy $ε_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $δ_{\mathrm{model}}$), without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.

2506.03411 2026-06-10 cs.LG cs.GT 版本更新

A Machine Learning Theory Perspective on Strategic Litigation

战略诉讼的机器学习理论视角

Melissa Dutz, Han Shao, Avrim Blum, Aloni Cohen

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of Maryland(马里兰大学) The University of Chicago(芝加哥大学)

AI总结 从机器学习理论出发,建模普通法体系中战略诉讼者通过选择案件影响下级法院决策规则的问题,分析其影响力和最优策略,发现反直觉现象。

详情
AI中文摘要

战略诉讼是指提起诉讼的目标不仅限于解决特定纠纷,而是产生更广泛的影响。在普通法体系中,案件产生深远影响的一种方式是通过确立新的法律先例,后续法院必须遵循。本文从机器学习理论的角度探讨战略诉讼。我们考虑一个普通法法律体系的抽象模型,其中下级法院通过应用从上级法院过去裁决中学习到的决策规则来裁决新案件。在该模型中,我们探索战略诉讼者的力量,他们战略性地将案件提交给上级法院,以影响下级法院在未来案件中应用的决策规则。我们探讨的问题包括:战略诉讼者能产生什么影响?战略诉讼者应该将哪些案件提交法院?当战略诉讼者确信法院会做出不利于他们的裁决时,提起诉讼是否有意义?我们表明,这一战略案件选择问题具有有趣的结构,即使是简单的设置也会表现出反直觉的现象。当案件由一维点表示且下级法院的学习算法是最近邻时,或者当案件由d维点表示且下级法院的学习算法是支持向量机时,我们刻画了可诱导决策规则的集合,并开发了根据战略诉讼者目标选择最优案件集提交给上级法院的算法。

英文摘要

Strategic litigation involves bringing a case to court with the goal of having an impact beyond resolving the particular dispute at hand. In a common law system, one way a case may have far-reaching impact is by establishing new legal precedent that later courts must follow. In this paper, we explore strategic litigation from the perspective of machine learning theory. We consider an abstract model of a common law legal system where a lower court decides new cases by applying a decision rule learned from a higher court's past rulings. In this model, we explore the power of a strategic litigator, who strategically brings cases to the higher court to influence the decision rule applied by the lower court in future cases. We explore questions including: What impact can a strategic litigator have? Which cases should a strategic litigator bring to court? Does it ever make sense for a strategic litigator to bring a case when they are sure the court will rule against them? We show that this strategic case selection problem has interesting structure, with even simple settings exhibiting counterintuitive phenomena. When cases are represented by points in one dimension and the lower court's learning algorithm is nearest neighbor, or as points in d dimensions and the lower court's learning algorithm is a support vector machine, we characterize the set of inducible decision rules and develop algorithms for selecting an optimal set of cases to bring to the higher court given the strategic litigator's objectives.

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV:用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 提出Visual-TCAV框架,结合概念激活向量和积分梯度,生成类无关显著性图并估计概念归因,在受控实验中比TCAV更忠实于真实解释。

详情
Comments
Accepted in TMLR
AI中文摘要

卷积神经网络在图像分类中表现出色,但由于模型规模和复杂性,解释其预测具有挑战性。最先进的显著性方法生成局部解释,突出输入图像中识别类别的区域,但无法解释感兴趣的概念如何贡献于预测。另一方面,基于概念的方法(如TCAV)提供了网络对人类定义概念敏感性的见解,但无法计算其在特定预测中的归因,也无法显示其在输入图像中的位置。我们引入了Visual-TCAV,一种新颖的可解释性框架,旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量(CAV)生成类无关的显著性图,显示网络识别特定概念的位置。此外,它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性,显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

2505.23851 2026-06-10 cs.CL cs.AI cs.SC 版本更新

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ASyMOB:代数符号数学运算基准

Michael Shalyt, Rotem Elimelech, Ido Kaminer

发表机构 * MIT(麻省理工学院) Technion - Israel Institute of Technology(技术学院-以色列理工学院)

AI总结 提出ASyMOB基准,包含35,368个符号数学问题,通过扰动测试揭示大模型在符号数学推理中的鲁棒性不足,并发现LLM与CAS的互补潜力。

详情
Comments
Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark
AI中文摘要

大型语言模型(LLM)越来越多地应用于符号数学,然而现有评估常常混淆模式记忆与真正推理。为弥补这一空白,我们提出\textbf{ASyMOB},一个包含\textit{35,368}个经过验证的符号数学问题的高分辨率数据集,涵盖积分、极限、微分方程、级数和超几何函数。与以往基准不同,\textbf{ASyMOB}通过符号、数值和等价保持变换系统地扰动每个种子问题,从而实现对泛化能力的细粒度评估。我们的评估揭示了三个关键发现:(1)大多数模型的性能在微小扰动下崩溃,而顶级系统表现出明显的鲁棒性\textit{机制转变};(2)集成代码工具稳定了性能,尤其对较弱模型;(3)我们识别出计算机代数系统(CAS)失败而LLM成功的例子,以及仅通过LLM-CAS混合方法解决的问题,突显了有前景的集成前沿。\textbf{ASyMOB}作为一个原则性诊断工具,用于衡量和加速构建可验证、可信赖的AI以促进科学发现。

英文摘要

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present \textbf{ASyMOB}, a high-resolution dataset of \textit{35,368} validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, \textbf{ASyMOB} systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent \textit{regime shift} in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. \textbf{ASyMOB} serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

2505.16319 2026-06-10 cs.LG 版本更新

FreshRetailNet-LT: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail

FreshRetailNet-LT:面向生鲜零售中潜在需求恢复与预测的缺货标注删失需求数据集

Yangyang Wang, Jiawei Gu, Li Long, Xin Li, Li Shen, Zhouyu Fu, Xiangjun Zhou, Xu Jiang

发表机构 * Fresh Retail, Inc.(新鲜零售公司)

AI总结 针对生鲜零售中缺货导致的销售数据删失问题,提出首个大规模基准数据集FreshRetailNet-50K,包含50,000条高时间分辨率小时级销售序列及缺货标注,并展示了两阶段需求建模方法,将预测准确率提升2.73%,需求低估偏差从7.37%降至近零。

详情
Comments
FreshRetailNet-LT is a new version of FreshRetailNet-50K, spanning dataset over two years
AI中文摘要

准确的需求估计对于零售业务指导易腐产品的库存和定价策略至关重要。然而,它面临缺货期间删失销售数据的根本挑战,其中未观察到的需求会造成系统性政策偏差。现有数据集缺乏解决这种删失效应所需的时间分辨率和标注。为填补这一空白,我们提出了FreshRetailNet-50K,这是首个用于删失需求估计的大规模基准。它包含来自18个主要城市898家商店的50,000条商店-产品时间序列的详细小时级销售数据,涵盖863个易腐SKU,并精心标注了缺货事件。该数据集独有的小时级库存状态记录,结合丰富的上下文协变量(包括促销折扣、降水和时间特征),使得超越现有解决方案的创新研究成为可能。我们展示了一个两阶段需求建模的用例:首先,利用精确的小时级标注重建缺货期间的潜在需求;然后,利用恢复的需求在第二阶段训练鲁棒的需求预测模型。实验结果表明,该方法将预测准确率提高了2.73%,同时将系统性需求低估从7.37%降至接近零偏差。凭借前所未有的时间粒度和全面的真实世界信息,FreshRetailNet-50K在需求插补、易腐库存优化和因果零售分析方面开辟了新的研究方向。该数据集独特的标注质量和规模解决了零售AI中长期存在的局限性,提供了即时解决方案和未来方法论创新的平台。数据(此 https URL )和代码(此 https URL )已公开。

英文摘要

Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K) and code (https://github.com/Dingdong-Inc/frn-50k-baseline}) are openly released.

2505.11702 2026-06-10 cs.LG stat.ML 版本更新

Post-Training Augmentation Invariance

训练后增强不变性

Keenan Eikenberry, Lizuo Liu, Yoonsang Lee

发表机构 * Department of Mathematics, Dartmouth College(达特茅斯学院数学系)

AI总结 提出训练后增强不变性框架,通过轻量级MLP适配器网络在预训练模型潜空间上实现近似不变性,无需微调且保持原始特征。

详情
AI中文摘要

本文开发了一个训练后增强不变性的框架,其目标是为预训练网络添加不变性属性,同时不改变其在原始非增强输入分布上的行为。我们精确定义了这一概念,并引入了增强编码器,这是一种概率编码器,形式化了基于增强的编码过程,并作为我们的基本研究对象。我们提出了两种增强编码器的损失函数,即马尔可夫-瓦瑟斯坦最小化和瓦瑟斯坦相关性最大化,并通过实验证明,这两种损失函数可用于训练轻量级的单隐藏层MLP适配器网络$E_{\theta}$,当将其附加到预训练网络$F$的潜空间时,确实能实现(近似)训练后增强不变性。例如,在STL10上使用$F=\text{DINO}$特征时,复合网络$C\circ E_{\theta}\circ F$(其中$C$是线性分类器,$E_{\theta}$是我们提出的适配器网络之一)在任意旋转图像上达到94%的分类准确率,而没有适配器$E_{\theta}$的$C\circ F$网络则降至71%。类似地,我们可以将噪声不变分类结果从58%提升至86%。重要的是,我们无需微调即可获得这些结果($F$的权重全程冻结),并且我们的方法对原始特征的破坏很小,因为$E_{\theta}$在非增强潜分布上几乎等距作用。相比之下,我们展示了使用替代候选损失函数(特别是SimCLR和HSIC最大化)训练的适配器网络产生了不具竞争力的分类结果,并从根本上破坏了原始潜空间。代码见https://this URL。

英文摘要

This work develops a framework for post-training augmentation invariance, in which our goal is to add invariance properties to a pretrained network without altering its behavior on the original, non-augmented input distribution. We define this notion precisely and additionally introduce augmented encoders, which are probabilistic encoders that formalize augmentation-based encoding processes and that serve as our fundamental object of study. We introduce two losses for augmented encoders, namely, Markov-Wasserstein minimization and Wasserstein correlation maximization, and we demonstrate empirically that both losses can be used to train lightweight, one-hidden-layer MLP adapter networks $E_θ$ that, when appended to the latent space of a pretrained network $F$, do indeed lead to (approximate) post-training augmentation invariance. For example, on STL10 with $F=\text{DINO}$ features, the composite network $C\circ E_θ\circ F$, where $C$ is a linear classifier and where $E_θ$ is one of our proposed adapter networks, achieves 94% classification accuracy on arbitrarily rotated images, whereas a network of the form $C\circ F$ without the adapter $E_θ$ drops to 71% accuracy. Similarly, we can boost noise-invariant classification results from 58% up to 86%. Significantly, we obtain these results with no fine-tuning (the weights of $F$ remain frozen throughout), and our methods introduce little corruption to the original features, since $E_θ$ acts nearly isometrically on the non-augmented latent distribution. In contrast, we show that adapter networks trained with alternative candidate losses, specifically SimCLR and HSIC maximization, produce uncompetitive classification results and fundamentally corrupt the original latent space. Code available at https://github.com/keenan-eikenberry/augmentation_invariance

2505.11034 2026-06-10 cs.CV cs.AI cs.LG 版本更新

CleanPatrick: A Benchmark for Image Data Cleaning

CleanPatrick: 图像数据清洗基准

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

发表机构 * University of Basel(巴塞尔大学) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学大学) University Hospital of Basel(巴塞尔大学医院) Northwestern University(西北大学) Northeast Dermatology Associates(东北皮肤科诊所) Medical University of Vienna(维也纳医科大学) Banner Health(Banner健康系统)

AI总结 提出首个大规模图像数据清洗基准CleanPatrick,基于Fitzpatrick17k皮肤病数据集,收集大量众包标注并采用项目反应理论聚合,将问题检测形式化为排序任务,评估多种方法。

详情
Comments
Accepted at Journal of Data-centric Machine Learning Research (DMLR)
AI中文摘要

鲁棒的机器学习依赖于干净的数据,然而当前的图像数据清洗基准依赖于合成噪声或狭窄的人类研究,限制了比较和现实相关性。我们引入CleanPatrick,这是图像领域首个大规模数据清洗基准,基于公开的Fitzpatrick17k皮肤病学数据集构建。我们收集了来自933名医学众包工作者的496,377个二元标注,识别出离题样本(4%)、近似重复(21%)和标签错误(32%),并采用受项目反应理论启发的聚合模型,随后经过专家审查以获得高质量的真实标签。CleanPatrick将问题检测形式化为排序任务,并采用反映真实审计流程的标准排序指标。我们基准测试了经典异常检测器、感知哈希、SSIM、Confident Learning、NoiseRank、FINE、BHN和SelfClean。在CleanPatrick上,自监督表示在近似重复检测方面表现出色,经典方法在受限审查预算下实现了有竞争力的离题检测,而在保守的人类判断下检测不合理标签对于细粒度医学分类仍然具有挑战性。通过发布数据集和评估框架,CleanPatrick使得图像清洗策略的系统比较成为可能。

英文摘要

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (32%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

2501.00745 2026-06-10 cs.CL cs.AI cs.GT cs.IR econ.TH 版本更新

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

基于大型语言模型的搜索引擎对抗攻击动力学

Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文将排名操纵攻击建模为无限重复囚徒困境,分析合作维持条件,发现降低攻击成功率可能反而激励攻击,防御措施在某些情况下无效。

详情
Comments
New Frontiers in Game-Theoretic Learning Workshop, International Conference on Machine Learning (ICML) 2026
AI中文摘要

基于大型语言模型(LLM)的搜索引擎日益集成,改变了信息检索的格局。然而,这些系统容易受到对抗攻击,尤其是排名操纵攻击,攻击者通过精心制作网页内容来操纵LLM的排名并推广特定内容,从而在竞争对手中获得不公平优势。在本文中,我们研究了排名操纵攻击的动力学。我们将此问题建模为无限重复囚徒困境,其中多个参与者策略性地决定合作还是攻击。我们分析了合作得以维持的条件,识别了关键因素,如攻击成本、折现率、攻击成功率和触发策略,这些因素影响参与者的行为。我们识别了系统动力学中的临界点,表明当参与者具有前瞻性时,合作更有可能维持。然而,从防御角度来看,我们发现简单地降低攻击成功概率,在某些条件下反而会激励攻击。此外,限制攻击成功率上限的防御措施在某些情况下可能徒劳无功。这些见解凸显了保护基于LLM的系统的复杂性。我们的工作为理解和缓解其脆弱性提供了理论基础和实践见解,同时强调了自适应安全策略和深思熟虑的生态系统设计的重要性。

英文摘要

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

2505.08213 2026-06-10 cs.RO 版本更新

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

HandCept: 用于灵巧手精确本体感知的视觉-惯性融合框架

Huang Junda, Honghao Guo, Hao Wu, Zhengyang Liu, Marcelo H Ang, Jianshu Zhou

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 提出HandCept,首个视觉-惯性本体感知框架,通过零样本学习和无延迟扩展卡尔曼滤波融合腕部RGB-D相机与9轴IMU,实现2°-4°关节角估计误差且无漂移,优于纯视觉或纯惯性方法。

详情
Comments
8 pages, 7 figures, conference
AI中文摘要

随着机器人向通用操作发展,灵巧手变得越来越关键。然而,由于体积和通用性的限制,灵巧手的本体感知仍然是一个瓶颈。在这项工作中,我们提出了HandCept,这是第一个旨在克服传统灵巧手关节角估计方法挑战的视觉-惯性本体感知框架。HandCept解决了在动态环境中实现准确且鲁棒的关节角估计的难题,在这种环境中,视觉和惯性测量都容易受到噪声和漂移的影响。它利用零样本学习方法,使用腕部RGB-D相机和9轴IMU,通过无延迟扩展卡尔曼滤波器(EKF)实时融合。我们的结果表明,HandCept实现了通常在$2^{\circ}$到$4^{\circ}$之间的关节角估计误差,且没有可观察到的漂移,优于纯视觉和纯惯性方法。此外,我们验证了IMU系统的稳定性和均匀性,表明IMU之间的公共基帧简化了系统标定。为了支持仿真到现实的迁移,我们还开源了我们的高保真渲染管线,这对于在没有真实世界真值的情况下进行训练至关重要。这项工作为灵巧手的本体感知提供了一种鲁棒、可泛化的解决方案,对机器人操作和人机交互具有重要意义。this https URL

英文摘要

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB

2505.01458 2026-06-10 cs.RO cs.AI 版本更新

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

具身智能时代基于物理模拟器的机器人导航与操作综述

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

发表机构 * Department of Computer Science, City University of Hong Kong(城市大学计算机科学系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Department of Informatics, Universität Hamburg(汉堡大学信息学院)

AI总结 本文综述了物理模拟器在缩小具身智能中导航与操作的模拟到现实差距方面的关键特性、任务支持及硬件需求,并提供了基准数据集、指标、平台和方法资源。

详情
Comments
Under Review
AI中文摘要

导航和操作是具身智能的核心能力,但直接在现实世界中训练智能体执行这些任务成本高、耗时且不安全。因此,模拟到现实的迁移已成为关键方法,然而模拟到现实的差距仍然存在。本综述通过分析先前综述中关注有限的属性,考察了物理模拟器如何解决这一差距。我们还分析了它们在导航和操作任务中的特性,以及它们的硬件需求。此外,我们提供了包含基准数据集、指标、模拟平台和方法的资源,以帮助研究人员在考虑硬件约束的同时选择合适的工具。

英文摘要

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.