arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2504.12747 2026-05-29 cs.CV

Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

针对个性化文本到图像合成的跨图像一致性约束隐私保护

Guanyu Wang, Kailong Wang, Yihao Huang, Mingyi Zhou, Geguang Pu, Li Li

AI总结提出跨图像反个性化框架，通过强制扰动图像间的风格一致性并采用动态比率调整策略，增强对扩散模型个性化攻击的抵抗能力。

详情

AI中文摘要

扩散模型和个性化技术的快速发展使得仅凭少量公开图像就能重建个人肖像成为可能。虽然这种能力赋能了各种创意应用，但也带来了严重的隐私问题，因为攻击者可以利用它生成高度逼真的冒充图像。为应对这些威胁，反个性化方法被提出，通过向已发布图像添加对抗性扰动来破坏个性化模型的训练。然而，现有方法很大程度上忽视了个性化固有的多图像特性，而是采用一种朴素的独立应用扰动策略（如同在单图像设置中常见的那样）。这忽略了利用图像间关系实现更强隐私保护的机会。因此，我们倡导从群体层面看待针对个性化的隐私保护。具体而言，我们引入了跨图像反个性化（CAP），一种通过强制扰动图像间的风格一致性来增强对个性化抵抗能力的新型框架。此外，我们开发了一种动态比率调整策略，可在攻击迭代过程中自适应地平衡一致性损失的影响。在经典CelebHQ和VGGFace2基准上的大量实验表明，CAP显著改进了现有方法。

英文摘要

The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.

URL PDF HTML ☆

赞 0 踩 0

2504.12512 2026-05-29 cs.RO cs.SY eess.SY

Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild

野外移动操作抓取策略的实用见解

Isabella Huang, Richard Cheng, Sangwoon Kim, Dan Kruse, Carolyn Chen, Lukas Kaul, JC Hancock, Shanmuga Harikumar, Mark Tjersland, James Borders, Dan Helmick

AI总结本文通过SHOPPER移动操作机器人在真实杂货店中的部署实验，提出并分析了通用抓取策略的设计方法及数百次抓取尝试中的关键失败模式，为机器人社区提供了实用见解和待解决的关键挑战。

Comments 8 pages, 8 figures, submitted to IROS 2025

详情

AI中文摘要

移动操作机器人不断进步，其抓取能力也在快速发展。然而，仍存在显著差距阻碍最先进的移动操作机器人在现实世界中广泛部署，包括它们在非结构化环境中可靠抓取物品的能力。为帮助弥合这一差距，我们开发了SHOPPER，一个旨在推动可靠且可泛化抓取策略边界的移动操作机器人平台。我们开发了这些抓取策略，并将其部署在真实的杂货店中——这是一个因其可操作物品、固定装置和布局的极大多样性而被选中的极具挑战性的环境。在这项工作中，我们提出了设计通用抓取策略以在真实杂货店中拾取任何物品的详细方法。此外，我们提供了对最新真实世界现场测试的深入分析，讨论了与数百次不同抓取尝试中基本故障模式相关的关键发现。通过我们的详细分析，我们旨在提供有价值的实用见解并识别关键的抓取挑战，从而引导机器人社区关注该领域亟待解决的开放问题。

英文摘要

Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.

URL PDF HTML ☆

赞 0 踩 0

2503.20897 2026-05-29 cs.CV

Domain-Agnostic Feature Modulation for Semi-Supervised Domain Generalization

面向半监督领域泛化的域无关特征调制

Venuri Amarasinghe, Kalinga Bandara, Isun Randila, Asini Jayakody, Chamuditha Jayanga Galappaththige, Ranga Rodrigo

AI总结针对半监督领域泛化中无域标签的挑战，提出一种特征调制策略与损失缩放函数，通过增强类判别特征、抑制域特定信息并动态降低伪标签置信度阈值，显著提升模型在多个基准上的泛化性能。

Comments Accepted at CVPRW 2026

详情

AI中文摘要

半监督领域泛化（SSDG）利用少量标注数据与大量未标注数据来增强模型泛化能力。现有SSDG方法大多依赖伪标签（PL）处理未标注数据，且常假设可获取域标签——这一特权并非总是可用。然而，域偏移引入域噪声，导致不一致的伪标签，从而降低模型性能。源自FixMatch的方法尤其受限于较低的伪标签准确率，削弱了未标注数据的效用。为解决此问题，我们应对更具挑战性的域标签不可知SSDG场景，即在训练过程中未标注数据的域标签不可用。首先，我们提出一种特征调制策略，该策略在抑制域特定信息的同时增强类判别特征。此调制将特征推向“相似平均表示”（类原型的改进版本），该表示跨域鲁棒，促使分类器区分紧密相关的类别，并促使特征提取器形成紧密聚类、域不变的表征。其次，为缓解域噪声并提高伪标签准确率，我们引入一个损失缩放函数，该函数动态降低伪标签的固定置信度阈值，从而优化未标注数据的利用。凭借这些关键创新，我们的方法在四个主要领域泛化基准上取得了显著改进——即使在没有域标签的情况下。我们将公开代码。

英文摘要

Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.

URL PDF HTML ☆

赞 0 踩 0

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服：从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

AI总结本文通过开发轻量级说服文本检测模型（在SemEval 2023任务3子任务3中达到最优性能）并应用于澳大利亚联邦选举2022 Facebook广告数据集，揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情

DOI: 10.1609/icwsm.v20i1.42714
Journal ref: Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608

AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧，在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究，提出了一种连接模型开发与实际应用的综合方法。首先，我们引入了一个轻量级说服文本检测模型，该模型在SemEval 2023任务3子任务3中达到了最先进性能，同时所需的计算资源和训练数据远少于现有方法。其次，我们通过收集澳大利亚联邦选举2022 Facebook广告（APA22）数据集，对其中一部分进行说服标注，并对模型进行微调以使其从主流新闻适应社交媒体内容，从而展示了该模型的实际效用。然后，我们应用微调后的模型对APA22数据集的其余部分进行标注，揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性，还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

URL PDF HTML ☆

赞 0 踩 0

2503.00779 2026-05-29 cs.RO

Phantom: Training Robots Without Robots Using Only Human Videos

Phantom: 仅使用人类视频训练机器人，无需机器人

Marion Lepert, Jiaying Fang, Jeannette Bohg

AI总结提出一种仅从人类视频演示中训练机器人操作策略的框架，通过手部姿态估计和视觉数据编辑将人类演示转化为机器人兼容的观察-动作对，实现零样本部署并达到最高92%的成功率。

Comments Project website at https://phantom-human-videos.github.io

详情

Journal ref: The 9th Conference on Robot Learning (CoRL 2025)

AI中文摘要

训练通用机器人需要从大规模且多样化的数据源中学习。当前方法严重依赖难以扩展的遥操作演示。我们提出一个可扩展的框架，可直接从人类视频演示中训练操作策略，无需任何机器人数据。我们的方法利用手部姿态估计和视觉数据编辑，将人类演示转化为机器人兼容的观察-动作对。我们修复人类手臂并叠加渲染的机器人以对齐视觉域。这使得无需任何微调即可在真实硬件上实现零样本部署。我们在包括可变形物体操作、多物体清扫和插入等一系列任务上展示了高达92%的强成功率。我们的方法可泛化到新环境并支持闭环执行。通过证明仅使用人类视频即可训练有效策略，我们的方法拓宽了可扩展机器人学习的路径。

英文摘要

Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates-up to 92%-on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.

URL PDF HTML ☆

赞 0 踩 0

2502.16548 2026-05-29 cs.LG cs.AI cs.CV

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

AI总结提出一种可组合多模态框架，通过整合cine CMR影像、结构化临床指标和非结构化文本记录，实现比单模态AI算法更准确的心力衰竭预后预测，并支持个性化治疗优化。

详情

AI中文摘要

目的。根据世界卫生组织（WHO）及其他公共卫生机构的数据，心力衰竭是全球主要死因之一，每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展，生存率和射血分数有所改善，但由于其复杂性和多因素特征，仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架，旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据，明确整合了电影心脏磁共振（cine CMR）序列、结构化临床指标（如实验室结果、人口统计学数据）和非结构化文本记录（如病史、处方）。通过整合这些多种数据源，我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比，该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外，它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据，该方法支持更全面的预后评估，并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

URL PDF HTML ☆

赞 0 踩 0

2412.15632 2026-05-29 cs.CV

A New Method to Capturing Compositional Knowledge in Linguistic Space

一种在语言空间中捕获组合知识的新方法

Jiahe Wan

AI总结提出YUKINO方法，通过文本反转和“no”逻辑正则化，在无需硬负样本的情况下提升视觉语言模型的组合理解能力，在SugarCREPE基准上超越现有多模态SOTA模型8%以上。

详情

DOI: 10.1016/j.neucom.2026.133150
Journal ref: Neurocomputing 2026, 679, 133150

AI中文摘要

组合理解使视觉语言模型能够解释图像和文本中对象、属性和关系之间的复杂联系。然而，现有方法通常依赖硬负样本和微调，这可能会高估改进效果，且受限于获取硬负样本的难度。在这项工作中，我们引入了零样本组合理解（ZS-CU），这是一个无需硬负训练数据即可增强组合理解的新任务。我们提出了YUKINO（通过带有“NO”的文本反转产生的组合理解知识），该方法利用文本反转将未标记图像映射到预训练CLIP模型中的伪标记。我们提出引入“no”逻辑正则化来解决反转中标记交互的问题。此外，我们建议使用知识蒸馏来降低文本反转的时间复杂度。实验结果表明，YUKINO在SugarCREPE基准上比现有多模态SOTA模型高出8%以上，并且在图像检索任务中也取得了显著改进。

英文摘要

Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.

URL PDF HTML ☆

赞 0 踩 0

2411.14279 2026-05-29 cs.CV cs.CL

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

超越文本：通过多模态双注意力和软图像引导减少大型视觉语言模型中的语言偏差

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

AI总结针对大型视觉语言模型因语言偏差导致的幻觉问题，提出LACING框架，采用多模态双注意力机制和软图像引导策略，在不增加训练资源的情况下增强视觉理解并减少幻觉。

Comments EMNLP 2025

详情

AI中文摘要

大型视觉语言模型在各种视觉语言任务中取得了令人印象深刻的结果。然而，尽管表现出有前景的性能，大型视觉语言模型仍因语言偏差而产生幻觉，导致对图像的关注度降低和视觉理解效率低下。我们确定了这种偏差的两个主要原因：1. 大语言模型预训练阶段与多模态对齐阶段之间训练数据的规模差异。2. 文本数据短期依赖性导致的学习推理偏差。因此，我们提出了LACING，一个系统性框架，旨在通过多模态双注意力机制和软图像引导来解决大型视觉语言模型的语言偏差。具体来说，多模态双注意力机制引入了一种并行双注意力机制，增强了整个模型中视觉输入的整合。软图像引导在训练和推理过程中引入了一个可学习的软视觉提示，以替代视觉输入，旨在迫使大型视觉语言模型优先处理文本输入。然后，软图像引导进一步提出了一种使用软视觉提示的新解码策略，以减轻模型对相邻文本输入的过度依赖。综合实验表明，我们的方法有效地消除了大型视觉语言模型的语言偏差，增强了视觉理解并减少了幻觉，无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。

英文摘要

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

URL PDF HTML ☆

赞 0 踩 0

2411.00278 2026-05-29 cs.LG

KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks

KAN-AD：基于Kolmogorov-Arnold网络的时间序列异常检测

Quan Zhou, Changhua Pei, Fei Sun, Jing Han, Zhengwei Gao, Dan Pei, Haiming Zhang, Gaogang Xie, Jianhui Li

AI总结针对时间序列异常检测中预测模型易过拟合局部波动的问题，提出用截断傅里叶展开替代B样条的KAN-AD方法，通过强调全局模式并抵抗局部扰动，在四个基准上平均检测精度提升15%。

Comments 11 pages, ICML 2025

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:79136-79149, 2025

AI中文摘要

时间序列异常检测（TSAD）支撑着云服务和网络系统中的实时监控，能够快速识别异常以防止代价高昂的故障。大多数基于预测模型的TSAD方法倾向于通过强调微小波动而过度拟合。我们的分析表明，有效的TSAD应专注于通过平滑局部模式对“正常”行为进行建模。为此，我们将时间序列建模重新表述为用平滑单变量函数逼近序列。每个单变量函数的局部平滑性确保拟合的时间序列对局部扰动保持鲁棒。然而，由于B样条函数固有的局部化特性，直接实现KAN易受这些扰动影响。因此，我们提出KAN-AD，用截断傅里叶展开替代B样条，并引入一种新颖的轻量级学习机制，该机制在强调全局模式的同时对局部扰动保持鲁棒。在四个流行的TSAD基准上，KAN-AD相比最先进的基线实现了平均15%的检测精度提升（峰值超过27%）。值得注意的是，其可训练参数少于1000个，相比原始KAN推理速度提升50%，展示了该方法的效率和实际可行性。

英文摘要

Time series anomaly detection (TSAD) underpins real-time monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling "normal" behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus propose KAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach's efficiency and practical viability.

URL PDF HTML ☆

赞 0 踩 0

2408.15451 2026-05-29 cs.LG cs.CR stat.ME

Certified Causal Defense with Generalizable Robustness

具有泛化鲁棒性的认证因果防御

Yiran Qiao, Yu Yin, Chen Chen, Jing Ma

AI总结提出GLEAN框架，通过可认证因果因子学习解耦因果关系与虚假相关性，并设计因果认证防御策略，实现跨分布偏移域的鲁棒性泛化。

Comments Accepted by AAAI 2025

详情

AI中文摘要

尽管机器学习模型在各种场景中已被证明有效，但普遍认为许多模型容易受到对抗性攻击。近年来，出现了大量对抗性防御的研究。其中，认证防御因其对输入在特定范围内（例如$l_2$球）的任意对抗性扰动具有理论保证而闻名。然而，该领域现有的大多数工作难以将其认证鲁棒性泛化到具有分布偏移的其他数据域中。这一问题的根源在于难以消除不同域中虚假相关性对鲁棒性的负面影响。为解决此问题，本文提出了一种新颖的认证防御框架GLEAN，该框架将因果视角引入认证防御的泛化问题。具体而言，我们的框架集成了一个可认证的因果因子学习组件，以解耦输入与标签之间的因果关系和虚假相关性，从而排除虚假相关性对防御的负面影响。在此基础上，我们设计了一种因果认证防御策略来处理对潜在因果因子的对抗性攻击。通过这种方式，我们的框架不仅对训练分布中数据上的恶意噪声具有鲁棒性，而且能够将其鲁棒性泛化到具有分布偏移的各个域中。在基准数据集上的大量实验验证了我们的框架在不同数据域中认证鲁棒性泛化的优越性。代码见补充材料。

英文摘要

While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials.

URL PDF HTML ☆

赞 0 踩 0

2404.07977 2026-05-29 cs.CV

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Gaga: 通过3D感知记忆库分组任意高斯体

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang

AI总结提出Gaga框架，利用零样本类别无关分割模型预测的不一致2D掩码，通过3D感知记忆库关联不同视角下的物体掩码，实现开放世界3D场景的重建与分割。

Comments TMLR Camera-Ready Version. Project Page: https://weijielyu.github.io/Gaga

详情

AI中文摘要

我们介绍了Gaga，一个通过利用零样本类别无关分割模型预测的不一致2D掩码来重建和分割开放世界3D场景的框架。与先前依赖视频对象跟踪或对比学习方法的3D场景分割方法不同，Gaga利用空间信息并通过新颖的3D感知记忆库有效关联不同相机姿态下的物体掩码。通过消除训练图像中连续视角变化的假设，Gaga展现出对相机姿态变化的鲁棒性，尤其有利于稀疏采样图像，确保精确的掩码标签一致性。此外，Gaga可兼容来自不同来源的2D分割掩码，并与不同的开放世界零样本类别无关分割模型展现出稳健性能，显著增强了其通用性。大量的定性和定量评估表明，Gaga的性能优于现有最先进方法，凸显了其在3D场景理解与操作等实际应用中的潜力。

英文摘要

We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.

URL PDF HTML ☆

赞 0 踩 0

2310.14161 2026-05-29 cs.LG

Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation

通过对抗性实例增强促进精确求解器的泛化能力

Haoyang Liu, Yufei Kuang, Jie Wang, Xijun Li, Yongdong Zhang, Feng Wu

AI总结针对学习型MILP求解器在未见实例上性能下降的问题，提出对抗性实例增强方法AdaSolver，通过将不可微的实例增强建模为上下文赌博机问题并联合对抗训练增强策略与求解器，显著提升基于模仿学习和强化学习的分支定界求解器的泛化能力。

详情

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

AI中文摘要

机器学习已成功应用于提高混合整数线性规划（MILP）求解器的效率。然而，由于训练分布的多样性有限，基于学习的求解器在未见过的MILP实例上——尤其是在扰动环境中的大规模实例上——常常遭受严重的性能下降。为解决这一问题，我们提出了一种新颖方法，称为对抗性实例增强，该方法无需了解新实例生成的问题类型，以促进分支定界（B&B）求解器中基于学习的分支模块的数据多样性（AdaSolver）。我们使用MILP实例的二分图表示，并通过学习到的增强策略增强图结构，从而获得各种扰动实例以正则化求解器。AdaSolver的主要技术贡献在于，我们将不可微的实例增强建模为上下文赌博机问题，并对抗性地训练基于学习的求解器和增强策略，从而实现对增强策略的高效梯度训练。据我们所知，AdaSolver是首个通用且有效的框架，用于理解和改进基于模仿学习（IL-based）和基于强化学习（RL-based）的B&B求解器的泛化能力。大量实验表明，通过生成各种增强实例，AdaSolver在各种分布上均显著提升了求解效率。

英文摘要

Machine learning has been successfully applied to improve the efficiency of Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based solvers often suffer from severe performance degradation on unseen MILP instances -- especially on large-scale instances from a perturbed environment -- due to the limited diversity of training distributions. To tackle this problem, we propose a novel approach, which is called Adversarial Instance Augmentation and does not require to know the problem type for new instance generation, to promote data diversity for learning-based branching modules in the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph representations for MILP instances and obtain various perturbed instances to regularize the solver by augmenting the graph structures with a learned augmentation policy. The major technical contribution of AdaSolver is that we formulate the non-differentiable instance augmentation as a contextual bandit problem and adversarially train the learning-based solver and augmentation policy, enabling efficient gradient-based training of the augmentation policy. To the best of our knowledge, AdaSolver is the first general and effective framework for understanding and improving the generalization of both imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) B&B solvers. Extensive experiments demonstrate that by producing various augmented instances, AdaSolver leads to a remarkable efficiency improvement across various distributions.

URL PDF HTML ☆

赞 0 踩 0

2306.10356 2026-05-29 cs.LG cs.AI eess.SP

MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting

MATNet：基于多层级融合Transformer的日前光伏发电预测模型

Matteo Tortora, Francesco Conte, Gianluca Natrella, Paolo Soda

AI总结提出一种基于多层级融合Transformer的多模态架构MATNet，通过多级联合融合和软注意力机制利用历史光伏数据与气象数据，在日前多步光伏发电预测中显著优于基线模型（RMSE 0.0445，相对提升约65%），并展现出对缺失数据的鲁棒性和跨域零样本泛化能力。

详情

AI中文摘要

可再生能源发电的准确预测对于促进可再生能源融入电力系统至关重要。聚焦光伏（PV）单元，预测方法主要分为基于物理和基于数据两大类，其中基于人工智能（AI）的模型提供了最先进的性能。然而，这些基于AI的模型虽然能够捕捉数据中的复杂模式和关系，却忽略了现象背后的物理先验知识。因此，本文提出MATNet，一种新颖的基于Transformer的多模态架构，用于多步日前光伏发电预测。该模型通过多层级联合融合方法输入历史光伏数据以及历史和预报气象数据，在多个融合阶段采用软注意力机制。我们在Ausgrid基准数据集上评估了MATNet的有效性，其显著优于各种基线模型，实现了0.0445的RMSE，相比表现最佳的基线方法相对提升约65%。分析进一步通过一系列消融研究、对缺失数据的敏感性分析（突显了MATNet对输入退化的鲁棒性）、在五个外部光伏数据集上的跨站点零样本泛化评估（证明了MATNet在显著域偏移下的鲁棒性）以及对模型计算复杂度的评估（确认了其在预测精度与计算效率之间的良好平衡）得到丰富。这些结果凸显了MATNet作为促进光伏能源融入电网的可靠且高效解决方案的潜力。代码可在https://github.com/arco-group/MATNet获取。

英文摘要

Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Focusing on photovoltaic (PV) units, forecasting methods can be divided into two main categories: physics-based and data-based strategies, with Artificial Intelligence (AI)-based models providing state-of-the-art performance. However, while these AI-based models can capture complex patterns and relationships in the data, they ignore the underlying physical prior knowledge of the phenomenon. Therefore, in this paper, we propose MATNet, a novel transformer-based multimodal architecture for multi-step day-ahead PV power generation forecasting. The model is fed with historical PV data and historical and forecast weather data through a multi-level joint fusion approach, employing a soft-attention mechanism at multiple fusion stages. We evaluate the effectiveness of MATNet on the Ausgrid benchmark dataset, where it significantly outperforms various baseline models, achieving an RMSE of 0.0445, corresponding to a relative improvement of approximately 65% compared to the best-performing baseline method. The analysis is further enriched by a comprehensive set of ablation studies, a sensitivity analysis on missing data, which highlights MATNet's resilience to input degradation, a cross-site zero-shot generalization evaluation on five external PV datasets, demonstrating MATNet's robustness under significant domain shifts, and an assessment of the model's computational complexity, confirming its favorable balance between predictive accuracy and computational efficiency. These results highlight MATNet's potential as a reliable and efficient solution to facilitate the integration of PV energy into the power grid. The code is available at https://github.com/arco-group/MATNet.

URL PDF HTML ☆

赞 0 踩 0

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

AI总结研究有界记忆下语言生成的极限问题，通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情

AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中，学习器每次观察来自未知目标语言的一个示例，并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史，这是一个强假设，因为实际算法只保留有限的过去信息。学习理论中的经典工作表明，记忆约束会显著改变可学习性；我们将此扩展到语言生成。首先，我们研究无记忆生成器。在温和的枚举限制下，每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制，我们精确刻画了何时无记忆生成是可能的。对于有限集合，我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。我们进一步表明，最后$W$个示例的滑动窗口不会改善这种最坏情况密度，而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。最后，我们重新审视极限识别，其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体，其中学习器只记住其之前的猜测。在这里，尽管精确识别在仅包含三种语言的集合上失败，但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。这些结果表明，有界记忆对这些任务的影响不同：生成对于每个可数集合仍然可实现，而密度和识别仅限于有限集合，且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

URL PDF HTML ☆

赞 0 踩 0

2605.30319 2026-05-29 stat.ML cs.AI cs.DS cs.LG math.ST stat.TH

Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion

通过矩阵补全改进异质性处理效应估计的保证

Anay Mehrotra, Phuc Tran, Van H. Vu, Manolis Zampetakis

AI总结针对面板数据中的异质性处理效应估计问题，提出一种基于矩阵补全的简单高效估计器，在低秩假设下实现行向$\ell_2$误差$ ilde{O}(\sqrt{1/n + n/m^2})$，并首次建立了低秩逼近的行向$\ell_2$扰动界。

详情

AI中文摘要

现代因果推断的一个核心目标是估计异质性处理效应，以回答诸如“干预如何影响每个单元”的问题，而不仅仅是平均效应。我们研究面板数据下的该问题，其中我们观察到$n$个单元在$m$个时间点上的数据，且处理分配未知且非均匀。该设置中的数据自然表示为所有单元-时间处理效应的矩阵。估计异质性处理效应可以表示为对该矩阵中每一行平均值的良好估计。这使我们能够将问题表述为矩阵补全，在自然低秩假设下可解。然而，现有的矩阵补全保证不足以得到估计异质性处理效应所需的每行保证的有意义界；粗略地说，它们仅适用于估计平均处理效应界，正如最近一系列工作所示。我们给出一个简单、计算高效的估计器，在不知道倾向性且标准低秩和正则性假设下，实现行向$\ell_2$误差$ ilde{O}(\sqrt{ rac{1}{n} + rac{n}{m^2}})$。在技术上，我们的分析首次建立了低秩逼近的尖锐行向$\ell_2$扰动界，补充了现有的谱、Frobenius和逐元素扰动理论。

英文摘要

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention affect each unit," rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\ell_2$ error of $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.

URL PDF HTML ☆

赞 0 踩 0

2605.30318 2026-05-29 cs.GR cs.AI cs.CV

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

快门之前：3D场景中美学的且可执行的人像摄影规划

Ruixiang Jiang, Chang Wen Chen

AI总结提出在3D场景中生成人像姿态、相机、照明和曝光方案的方法，通过构建摄影场景图实现美学引导的规划，生成视觉上引人注目且几何与光度可行的人像。

详情

AI中文摘要

人像摄影在很大程度上是在快门打开之前决定的：主体的姿态、相机配置和照明设备必须在周围的3D场景中协调。相比之下，大多数现有的计算方法侧重于2D图像空间中的后期制作，例如修饰、重新照明或编辑已经存在的图像；捕获前的摄影规划仍然很大程度上未被探索。我们引入了3D美学人像规划，即生成人体姿态、相机、照明和曝光计划的任务，这些计划在满足3D场景中的几何和光度可行性的同时，产生视觉上引人注目的人像。我们的方法构建了一个摄影场景图，该图表示场景可供性、主体-场景关系以及与人像相关的照明结构。基于这种表示，我们对先前的尝试和当前的取景器观察进行美学引导的比较规划。在多样化的室内和室外场景中的实验表明，我们的方法生成的人像比竞争基线更受人类评分者和MLLM评估者的青睐，同时保持高物理合理性。总之，我们的结果指明了从捕获后校正走向捕获前计算人像规划的道路。项目仓库：https://github.com/songrise/Before-the-Shutter

英文摘要

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

URL PDF HTML ☆

赞 0 踩 0

2605.29169 2026-05-29 cs.CR cs.AI

Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices

积分格与模格中进化筛法的领域信息表示

Ahmad Tashfeen, Qi Cheng

AI总结针对格密码中最短向量问题（SVP），通过引入领域信息表示和交叉操作，将Ajtai等人的筛法改进为遗传算法，并自然扩展到模格。

Comments Published (16 pages) in the proceedings of EvoApplications 2026. You may find the proceedings version here at https://link.springer.com/chapter/10.1007/978-3-032-23604-3_9

2605.28746 2026-05-29 math.OC cs.AI cs.NE

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

偏好形状的期望超体积和R2改进：精确计算与单调性

Michael T. M. Emmerich

AI总结本文研究了贝叶斯多目标优化中偏好形状的期望改进准则，精确计算了超体积和R2指标的期望改进，并分析了其单调性和几何特性。

Comments 17 pages; Changes v1 (added strict Pareto compliance proof, removed missing figure references and redundant graphics section, added Liang et al 2026 citation in outlook. Improved figures and language

详情

AI中文摘要

本文研究了贝叶斯多目标优化中偏好形状的期望改进准则。我们考虑了两个常用于类似算法目的但几何性质不同的指标族。超体积指标基于一个反乌托邦参考点，测量目标空间中的支配体积。R2指标基于一个乌托邦点，通过加权Tchebycheff标量化包络评估近似集。本文的目的是明确哪些偏好变换保留了精确计算、Pareto兼容性和单调性，哪些变换改变了底层几何。在超体积方面，我们通过Deng表示重新审视了经典的EHVI，在期望坐标中制定了乘积密度加权的EHVI，讨论了基于锥的EHVI作为线性锥变换后的普通EHVI，并将这些情况与截断EHVI区分开来，后者可能违反方差单调性。在R2方面，我们证明精确积分R2改进通常不是普通的目标空间加权超体积。障碍是低维的：Lebesgue密度超体积无法看到Tchebycheff标量化仍能检测到的某些边界贡献。然后我们证明精确积分R2改进恰好是一个标量化空间体积，即当前标量化包络与参考包络之间的Tchebycheff阴影的测度。该表示产生了离散R2的有限和ER2I算法、精确积分R2的求积方法，以及一个成就空间高斯代理公式，其中ER2I是标量高斯期望改进的积分。

英文摘要

This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.24244 2026-05-29 stat.ML cs.LG

MEDAL: Manifold Embedding Distillation via Autoencoder Learning

MEDAL: 通过自编码器学习的流形嵌入蒸馏

Irene Chang, Tarek M. Zikry, Genevera I. Allen

AI总结提出MEDAL框架，通过约束自编码器将流形嵌入蒸馏为可复用的编码器-解码器模型，实现留出验证、超参数选择和分布偏移检测。

详情

AI中文摘要

低维嵌入被广泛用作高维数据的视觉摘要，并支持下游科学发现。然而，流行的非线性降维方法（如t-SNE和UMAP）通常仅根据视觉吸引力选择，缺乏严格的定量验证。主要原因是流形嵌入通常不提供样本外映射或返回原始特征空间的逆映射；这使得留出验证（监督学习的黄金标准）几乎不可能。为了解决这些挑战，我们开发了一个新颖的框架MEDAL（通过自编码器学习的流形嵌入蒸馏），它将拟合的流形嵌入蒸馏为可复用的编码器-解码器模型。MEDAL训练一个约束自编码器，其瓶颈精确匹配任何教师嵌入，而解码器重建原始输入；这为新样本提供了显式映射、近似逆映射以及流形空间中基于逐点重建的失真度量。这将静态流形嵌入转换为可在留出数据上评估的模型，从而实现定量验证，包括比较不同降维方法以及超参数调优。在多个基准和科学案例研究中，我们展示了MEDAL能够通过留出验证确定最优流形嵌入和超参数，揭示难以在二维嵌入中保留的生物相干区域，并在新样本映射到固定参考流形时检测分布偏移。MEDAL为任何现有降维技术提供了一个通用验证包装器，将提高科学工作流中降维的严谨性和可靠性。

英文摘要

Low-dimensional embeddings are widely used as visual summaries of high-dimensional data and to enable downstream scientific discoveries. Yet, popular nonlinear dimension reduction methods, such as t-SNE and UMAP, are often selected based on visual appeal alone and without rigorous quantitative validation. A major reason is that manifold embeddings typically do not provide an out-of-sample map nor an inverse back to the original feature space; this makes held-out validation, the gold standard in supervised learning, all but impossible. To address these challenges, we develop a novel framework, MEDAL (Manifold Embedding Distillation via Autoencoder Learning), which distills a fitted manifold embedding into a reusable encoder--decoder model. MEDAL trains a constrained autoencoder whose bottleneck exactly matches any teacher embedding while the decoder reconstructs the original input; this yields an explicit map for new samples, an approximate inverse, and a pointwise reconstruction-based measure of distortion in the manifold space. This converts static manifold embeddings into models that can be evaluated on held-out data, enabling quantitative validation including comparing different dimension reduction methods as well as hyperparameter tuning. Across multiple benchmark and scientific case studies, we show that MEDAL enables held-out validation to determine optimal manifold embeddings and hyperparameters, reveals biologically coherent regions that are difficult to preserve in two dimensional embeddings, and detects distribution shift when new samples are mapped into a fixed reference manifold. MEDAL provides a general validation wrapper to any existing dimension reduction technique that will improve the rigor and reliability of dimension reduction in scientific workflows.

URL PDF HTML ☆

赞 0 踩 0

2604.04956 2026-05-29 physics.soc-ph cs.AI cs.CY physics.pop-ph

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

人工智能加速的行星成本，第二部分：第十个行星边界与6.5年倒计时

William Yicheng Zhu, Lei Zhu

AI总结本研究指出，大规模语言模型（LLM）的指数级扩展导致“思考”本身的热力学后果，并预测在6.5年内将突破行星热阈值，提出AI热排放构成第十个行星边界。

Comments Minor revisions for clarity

详情

AI中文摘要

近期，自主大型语言模型（LLM）代理的超指数级扩展标志着更广泛、根本性的范式转变：从机器主要替代人类双手（体力劳动和机械加工）转向机器代表人类思维（认知、推理和意图）。超出人类有限但高效的生物能力，“思考”本身不受控制的卸载和扩展对人类的热平衡表产生深远影响，因为思考或智能具有热力学后果。地球已经超过了长期生态稳定性所需的热耗散阈值，基于经验数据的预测揭示了一条令人担忧的轨迹：如果没有激进的结构性干预，即使在最理想的情况下（地球能量不平衡（EEI）保持恒定），人为热积累将在不到6.5年内突破关键的行星生态阈值。在这项工作中，我们确定了人工智能中影响全球热耗散率的六个因素，并描述了它们如何相互作用推动社会走向四种宏观轨迹之一。我们提出，人工智能及其热耗散融入行星系统构成了第十个行星边界（9+1）。该边界的核心经验测量是由人工智能指数增长产生的净新增废热，平衡其对减少经济和社会低效率以及因此减少基线人为废热排放的影响。我们证明，管理人工智能扩展缺乏适度的中间地带：它将要么加速关键行星热力学阈值的突破，要么成为稳定其他九个行星边界的最有效杠杆，从而保障人类文明的生存。

英文摘要

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines primarily replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (cognition, reasoning, and intention). The uncontrolled offloading and scaling of "thinking" itself, beyond human's limited but efficient biological capacity, has profound consequences for humanity's heat balance sheet, since thinking, or intelligence, carries thermodynamic consequences. The Earth has already surpassed the heat dissipation threshold required for long-term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six factors from artificial intelligence that influence the global heat dissipation rate and delineate how their interplay drives society toward one of four broad macroscopic trajectories. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitute the tenth planetary boundary (9+1). The core empirical measurement of this boundary is the net-new waste heat generated by exponential AI growth, balanced against its impact on reducing economic and societal inefficiencies and thus baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the breach of critical planetary thermodynamic thresholds, or it will serve as the single most effective lever on stabilizing the other nine planetary boundaries and through which safeguarding human civilization's survival.

URL PDF HTML ☆

赞 0 踩 0

2510.08535 2026-05-29 stat.ML cs.LG math.PR

Permutation-Invariant Spectral Learning via Dyson Diffusion

通过戴森扩散的置换不变谱学习

Tassilo Schwarz, Cai Dieball, Constantin Kogler, Renaud Lambiotte, Arnaud Doucet, Aljaž Godec, George Deligiannidis

AI总结提出戴森扩散模型，利用随机矩阵理论从分析上提取扩散过程的谱特性，将归纳偏置从架构转移到动力学，实现置换不变的谱学习，准确学习图谱并超越现有图扩散模型。

详情

AI中文摘要

扩散模型是生成建模的核心，并已通过扩散邻接矩阵表示适应于图。对于具有$n$个节点的图，存在多达$n!$个这样的表示，这一挑战仅通过使用置换等变学习架构得到部分缓解。尽管计算效率高，现有的图扩散模型难以区分某些图族及其谱，除非图数据被增强以特定的特征。这一缺陷源于在学习架构中强制执行归纳偏置。在这项工作中，我们利用随机矩阵理论从分析上提取扩散过程的谱特性，从而将大部分归纳偏置从架构推入动力学。在此基础上，我们引入了戴森扩散模型，该模型采用戴森布朗运动来捕捉邻接矩阵上Ornstein-Uhlenbeck过程的谱动力学。此外，以谱动力学为条件，我们制定了一个李群扩散，适当地建模剩余的自由度。引人注目的是，由此产生的学习问题在李代数层面上变为置换不变的。我们证明，戴森扩散模型能够准确学习图谱，并优于现有的图扩散模型。

英文摘要

Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families and their spectra, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push most of the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix. Furthermore, conditioned on the spectral dynamics, we formulate a Lie group diffusion, appropriately modeling the remaining degrees of freedom. Strikingly, the resulting learning problem becomes permutation invariant at the Lie algebra level. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2605.30273 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

LLUMI: 利用在线社区反馈改进心理健康支持中的LLM写作辅助

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

AI总结提出LLUMI框架，通过在线社区反馈（如Reddit投票）构建偏好对，结合监督微调和直接偏好优化训练开源小模型，在隐私保护下实现与GPT相当的心理健康支持性能。

详情

AI中文摘要

大型语言模型在生成心理健康问题的支持性回复方面展现出潜力，但提升其有用性、共情能力和安全性通常需要大量计算、专家输入和标注数据。同时，在心理健康相关交互中部署专有云模型会引发重要的隐私和数据治理问题。为解决这一挑战，我们提出了LLUMI设置，该设置可在受保护环境内部署。LLUMI包含两个互补组件：生成模型（GM）起草对心理健康问题的支持性回复，以及改进模型（IM）修改初始人工编写的回复。我们利用Reddit心理健康社区的反馈信号，使用社区认可模式（如点赞和点踩）构建用于监督微调和直接偏好优化的选择-拒绝回复对。我们还通过五个维度（可读性、共情、连接、可操作性和安全性）的人工评估进一步对齐LLUMI。结果表明，尽管依赖较小的开源模型而非专有云GPT模型，LLUMI在语言分析和人工评估中均实现了相当的性能。这些发现表明，使用社区衍生的偏好信号训练的开源模型可以支持高质量的心理健康支持辅助，同时为敏感的支持场景提供更保护隐私的替代方案。

英文摘要

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.30227 2026-05-29 cs.MA cs.AI

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

统一基于LLM的多智能体提示优化中的时间与结构信用分配

Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

AI总结提出通过时间信用（状态空间瓶颈识别关键轮次）和结构信用（固定角色策略隔离智能体贡献）分解误差信号，并利用离散言语化块坐标下降算法迭代优化角色提示和聚合协议，降低查询复杂度并提升性能。

Comments 15 pages, 4 figures, 6 tables

详情

AI中文摘要

虽然多智能体系统（MAS）通过协作交互使大型语言模型能够处理复杂推理任务，但由于计算图的离散、不可微性质以及全局监督信号的稀疏性，优化其动态仍然是一个严峻的挑战。现有的黑盒优化器难以将轨迹级别的失败归因于特定的局部组件，导致低效、高方差的探索。我们认为，可处理的MAS优化需要结构归纳偏差来解开误差信号。我们提出了时间和结构信用分配，它沿着两个轴分解目标：（i）时间信用，使用状态空间瓶颈识别关键轮次；（ii）结构信用，使用固定角色策略隔离智能体贡献。利用这些分解后的信号，我们引入了一种离散的、言语化的块坐标下降算法用于迭代优化。它不是不加区分的全局更新，而是在优化角色提示和聚合协议之间交替，使用LLM生成的“代理梯度”仅针对识别出的薄弱环节。在多种推理基准测试中，我们的方法在提高性能的同时显著降低了查询复杂度，为自我改进的MAS提供了一条有原则且可解释的路径。

英文摘要

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.

URL PDF HTML ☆

赞 0 踩 0

2605.30195 2026-05-29 cond-mat.mtrl-sci cs.AI cs.LG

What drives performance in molecular MPNNs? An operator-level factorial benchmark

分子MPNN性能驱动因素：算子级因子基准测试

Panyu Jiao, Shuizhou Chen, Yiheng Shen, Yuyang Wang, Runhai Ouyang, Wei Xie

AI总结通过分解分子MPNN为消息种子初始化、节点-边融合和节点更新三类算子，在84种配置下对MoleculeNet数据集进行基准测试，发现消息构建而非更新复杂度主导性能，并提出了设计启发式方法。

详情

AI中文摘要

消息传递神经网络（MPNN）广泛用于分子性质预测，但其作为整体架构部署使得难以识别特定消息传递算子如何影响性能。我们提出了一个算子级因子基准测试，将二维分子MPNN分解为消息种子初始化、节点-边融合和节点更新算子三个家族。在共享实验设置和统计分析协议下，对十个MoleculeNet数据集上的84种配置进行了基准测试。在这个受控设计中，性能变化主要与消息构建相关，而非更新复杂度。消息种子初始化在回归和分类任务中均显示出显著的家族级效应；节点-边融合在回归任务中显示出显著的家族级效应，且基于拼接的混合具有描述性优势；更新家族在任一任务家族中均未显示出统计上支持的效应。对Quinethazone分子的表征探测进一步表明，与Hadamard门控相比，基于拼接的混合能更好地区分化学上不同的杂原子并抵抗过度平滑。分别针对分类和回归任务选择的代表性配置相对于已建立的分子图神经网络（GNN）基线恢复了竞争性性能，在十个基准数据集中有八个数值上排名最佳。这些实证结果通过对代表性节点-边融合和更新算子的简洁机理分析进行了解释。我们的发现通过将模型设计从搜索整体架构转变为针对化学信息在消息传递管道中进入位置和方式的定向评估，为分子MPNN提供了实证设计启发式方法。

英文摘要

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30189 2026-05-29 cs.CR cs.AI cs.CL cs.LG

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

LoRA适配器后门中的令牌级泛化：攻击表征与行为检测

Travis Lelle

AI总结本文通过数据投毒在LoRA适配器中植入后门，发现后门在令牌特征层面泛化而非结构模式层面，并提出了基于行为统计和权重统计的两种检测方法。

Comments 45 pages, 27 tables. Code and evaluation data: https://github.com/Travis-ML/lora-backdoors. Trained adapter weights available on request

详情

AI中文摘要

我们表明，LoRA适配器（微调LLM的主要分发格式）可以通过训练数据投毒可靠地植入后门，同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上，一小部分中毒样本即可驱动一个保持干净精度的后门达到饱和。由此产生的后门在令牌特征层面而非结构模式层面泛化：在一个RFC引用上训练的模型会在任何RFC引用上激活，但不会迁移到结构相同的ISO、OWASP、CWE或NIST引用上。这种不对称性有利于攻击者，因为防御者无法通用地探测“结构化引用”。我们跨基础模型规模与系列、LoRA秩和触发字符串表征了该攻击，并针对多种子适配器队列评估了两种互补的检测路径。一个由两个探测电池统计量（outlier_gap和mean_attack_rate）构建的行为检测器，在探测电池与触发器的令牌邻域重叠时完美区分中毒适配器和干净适配器，在不重叠时以零假阳性实现高召回率。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——也能在不运行模型的情况下完美区分队列。两者结合对探测组成具有鲁棒性。因果修补将后门定位到中后层的MLP块，其中down_proj是最强的单投影原因。跨规模、系列和秩的重复实验表明，行为检测器无需重新调整即可迁移，而权重级检测器则需针对基础模型进行校准。攻击随秩单调扩展，且选择的触发锚点令牌既依赖于触发也依赖于基础模型。行为检测是适配器供应链扫描中操作上可移植的结果。

英文摘要

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

URL PDF HTML ☆

赞 0 踩 0

2605.30175 2026-05-29 astro-ph.HE cs.LG stat.ML

A new completely parameter-free clustering algorithm for unsupervised classification of BATSE gamma-ray bursts

一种用于BATSE伽马射线暴无监督分类的全新无参数聚类算法

Soumita Modak

AI总结提出一种完全无参数的聚类算法，对BATSE伽马射线暴样本进行分类，支持双群（短暴与长暴）的合并-坍缩星理论。

详情

AI中文摘要

聚类分析是一种广泛应用的机器学习技术，用于理解伽马射线暴（GRB）群体中存在的模式，以探索其物理来源。目前，尽管采用了最先进的聚类程序进行了多次尝试，但对应可区分群组的聚类数量仍存在争议。这一关键未知参数需要通过直接或间接方式（以其他调优参数的形式）评估，以便通过实施合适的聚类算法在GRB中产生聚类。虽然大多数应用的算法得出了两个物理上可解释的群组（分别以短暴和长暴为主的合并与坍缩星），但其他统计方法违反了这种二元划分。然而，任何额外聚类的物理建立尚未得到确认。因此，我们提出一种新算法，来自一种称为“完全无参数”的不同聚类流派，它以迄今未尝试过的方式对GRB进行分类。该算法从BATSE样本中指示出两个主要群组，即短持续时间和长持续时间爆发，与合并-坍缩星理论兼容。

英文摘要

Cluster analysis is a widely applied machine learning technique to understand the existing patterns in the population of gamma-ray bursts (GRBs), in order to explore their physical sources. In the present scenario, the number of clusters corresponding to differentiable groups is still under conflict, in spite of numerous attempts with the state-of-the-art clustering procedures. This crucial unknown parameter needs to be evaluated, either directly or indirectly in terms of other tuning parameters, to produce the clusters in GRBs through implementation of an appropriate clustering algorithm. While most of the applied algorithms reached two physically explained groups of merger and collapsar predominated by the short and long bursts respectively, other statistical approaches violated this binary partition. However, physical establishment of any additional cluster(s) is not yet confirmed. Therefore, we propose a new algorithm, from a different stream of clustering referred to as `completely parameter-free', which carries out the classification of GRBs in a manner that has not been tried so far. It indicates two main groups, of short and long duration bursts from the BATSE sample, compatible with the merger-collapsar theory.

URL PDF HTML ☆

赞 0 踩 0

2605.30170 2026-05-29 cs.MM cs.CV cs.LG

Unveiling the Visual Counting Bottleneck in Vision-Language Models

揭示视觉语言模型中的视觉计数瓶颈

Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

AI总结通过分解视觉计数为三个认知阶段，发现视觉语言模型在符号映射阶段失败，提出断裂数量假说：模型学习到分离的模态特定统计流形，无法实现跨模态对齐。

Comments ICML 2026

详情

AI中文摘要

尽管大型视觉语言模型（VLM）在插值任务上表现出色，但在系统泛化方面，尤其是视觉计数任务中，会遭遇灾难性失败。本文通过将视觉计数分解为三个认知阶段：视觉个体化、数量感知和符号映射，来研究这一外推瓶颈。利用合成围棋棋盘和线性探针，我们证明视觉骨干网络在进入外推区域后仍能保持稳健、线性可分离的数量表示，排除了感知失败的可能性。此外，模型保留了潜在的数量感知能力，能够成功对无法枚举的数量进行比较推理。我们将崩溃定位在符号映射阶段，即模型无法将有效的视觉数量投影到符号标记上。我们的发现支持断裂数量假说：VLM未能获得通用数字空间，而是学习了不相交的、模态特定的统计流形，这阻止了对未见数量的跨模态对齐。在最新基础模型上的验证结果表明，弥合这一差距需要引入强制统一表示的归纳先验，因为仅靠数据扩展是不够的。

英文摘要

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

URL PDF HTML ☆

赞 0 踩 0

2605.30167 2026-05-29 stat.ML cs.CV cs.LG stat.AP

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

视觉空间学习：使用卷积神经网络的单场空间插值

Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

AI总结提出基于卷积神经网络（CNN）的架构，直接从单次部分观测场学习空间插值，无需外部数据或先验场，作为克里金法的替代方案。

Comments 53 pages, 10 figures

详情

AI中文摘要

从稀疏观测中预测完整的空间相关场是空间统计和环境建模中的一个基本挑战。经典的插值方法如克里金法依赖于高斯过程假设和变异函数分析，这可能会限制其在非平稳环境中的有效性，并且需要大量的领域专业知识。在这项工作中，我们利用基于卷积神经网络（CNN）的架构进行空间插值，该架构在单个部分观测场上进行训练和应用，无需访问外部数据或先验场。模型直接在观测位置进行监督，并学习在用户定义的网格上预测未观测点的值。与克里金法不同，我们的方法不需要显式的协方差建模或变异函数估计，并且可以以数据驱动的方式灵活捕捉局部空间模式。这项工作展示了CNN在稀疏监督下进行单实例空间插值的潜力，为经典地统计方法提供了实用的替代方案，并将CNN的应用扩展到新的问题领域。

英文摘要

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

URL PDF HTML ☆

赞 0 踩 0

2605.30153 2026-05-29 stat.ML cs.IT cs.LG math.IT math.ST stat.TH

Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

扩散模型在学习低维多模态分布时具有统计最优性

Jingda Wu, Changxiao Cai

AI总结本文证明扩散模型在学习支撑在低维子空间并集上的分布时，样本复杂度仅依赖于内在维度，达到近最优的1-Wasserstein误差率，无需光滑性或有界密度假设。

Comments accepted to ICML 2026

详情

AI中文摘要

基于分数的扩散模型在学习高维分布，特别是那些具有低维和多模态结构的分布方面，已经展现出显著的实证成功。然而，对其统计效率的理论理解仍然有限。现有理论通常依赖于强正则性假设，例如一致有界密度或全局光滑的分数函数，这些假设无法捕捉此类内在结构。在这项工作中，我们研究了扩散模型在学习支撑在低维子空间并集上的分布时的样本复杂度。假设每个子空间内的数据分布是次高斯的，我们证明扩散模型最多需要$\widetilde{O}(\varepsilon^{-k \vee 2})$个样本即可在1-Wasserstein距离上达到$\varepsilon$误差，其中$k$是内在维度。这一近最优的收敛速率仅依赖于内在维度，并显著改进了先前遭受维度灾难的理论保证。值得注意的是，我们的分析适用于广泛的分布，无需施加光滑性、有界密度或对数凹性假设。总体而言，我们的结果表明，扩散模型能够统计适应内在低维结构，同时自然容纳多模态数据，为其在复杂高维学习任务中的成功提供了严格的理论依据。

英文摘要

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most $\widetilde{O}(\varepsilon^{-k \vee 2})$ samples to achieve $\varepsilon$ error in 1-Wasserstein distance, where $k$ is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly improves upon prior theoretical guarantees that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without imposing smoothness, bounded-density, or log-concavity assumptions. Overall, our results show that diffusion models can statistically adapt to intrinsic low-dimensional structure while naturally accommodating multi-modal data, offering a rigorous theoretical justification for their success in complex high-dimensional learning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.30102 2026-05-29 cs.MA cs.AI

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

当云端智能体遇到设备端智能体：混合多智能体系统的经验教训

Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

AI总结本文系统研究混合多智能体系统（结合设备端小模型和云端大模型）的设计空间，分析不同设计选择对功耗、成本和性能帕累托前沿的影响，发现最优架构高度依赖任务且前沿计算并不总能带来更好性能。

Comments 30 pages, 16 figures. Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情

AI中文摘要

智能体AI推理的设计空间涵盖两个极端：前沿大语言模型（LLM），通常托管在云端，在广泛任务上提供强性能但成本高昂；以及更具成本效益的小语言模型（SLM），适合设备端推理。结合设备端和云端模型的混合多智能体系统（MAS）提供了一种有前景的中间地带，但它们也引入了一个复杂且理解不足的设计空间，其中任务准确性、货币成本和边缘能耗紧密耦合；在缺乏通用设计原则的情况下，混合组件虽然并非最普遍的选择，但通常通过针对特定领域的临时决策引入。在这项工作中，我们更系统地审视了这一设计空间。我们调整了两种代表性的MAS架构以支持混合推理，并研究了单个设计选择如何沿着功耗、成本和性能的帕累托前沿移动工作点。我们的发现描绘了混合MAS设计的细致图景：虽然SLM可以有效受益于LLM的协助，但最优架构高度依赖任务，且更大的前沿计算并不总能转化为更好的性能。

英文摘要

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

URL PDF HTML ☆

赞 0 踩 0