arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2503.09675 2026-05-28 cs.CV

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

利用局部转移一致性加速扩散采样

Shangwen Zhu, Han Zhang, Zhantao Yang, Qianyu Peng, Zhao Pu, Huangji Wang, Fan Cheng

AI总结 提出一种无需训练的加速方法LTC-Accel,通过利用相邻步骤间转移算子的统计关系来估计当前转移算子,从而加速文本到图像和视频的扩散采样,兼容多种网络结构和现有加速技术。

详情
AI中文摘要

基于文本的扩散模型在从文本描述生成高质量图像和视频方面取得了重大突破。然而,去噪过程漫长的采样时间仍然是实际应用中的一个显著瓶颈。以往的方法要么忽略相邻步骤之间的统计关系,要么依赖于它们之间的注意力或特征相似性,这通常只适用于特定的网络结构。为了解决这个问题,我们在相邻步骤之间的转移算子中发现了一种新的统计关系,重点关注网络输出之间的关系。这种关系对网络结构没有任何要求。基于这一观察,我们提出了一种新颖的无训练加速方法,称为LTC-Accel,它利用识别出的关系基于相邻步骤估计当前转移算子。由于对网络结构没有特定假设,LTC-Accel几乎适用于所有基于扩散的方法,并且与几乎所有现有的加速技术正交,因此易于与它们结合。实验结果表明,LTC-Accel在文本到图像和文本到视频合成中显著加速了采样,同时保持了具有竞争力的样本质量。具体来说,LTC-Accel在Stable Diffusion v2中实现了1.67倍的加速,在视频生成模型中实现了1.55倍的加速。当与蒸馏模型结合时,LTC-Accel在视频生成中实现了惊人的10倍加速,允许实时生成超过16FPS。

英文摘要

Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.

2508.18271 2026-05-28 cs.CV

ObjFiller3D: Scaling 3D Object Inpainting to Dense Multi-View Consistency

ObjFiller3D:将3D物体修复扩展到密集多视图一致性

Haitang Feng, Xinkai Chen, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, Guangcong Wang

AI总结 提出ObjFiller-3D方法,通过联合优化密集采样视图的时序生成、语义感知补全和循环一致性3D编码,实现高质量且一致的多视图3D物体修复与编辑。

Comments Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D

详情
AI中文摘要

3D物体修复通常通过多视图2D图像补全实现,但独立修复的视图常出现跨视图不一致,导致重建的3D物体纹理模糊、几何不连续和视觉伪影。为克服这些限制,我们提出ObjFiller-3D,一种用于高质量且一致的3D物体补全与编辑的新方法。该方法不依赖稀疏视图编辑或逐视图2D修复,而是沿$360^\circ$轨迹联合优化密集采样视图序列,实现跨视角的全局一致性。我们设计了一个包含三个互补组件的新框架:用于建模密集视图依赖的时间驱动生成编码器、用于物体级修复的语义感知补全编码器,以及通过闭环公式强制全局一致性的循环一致性3D编码器。该框架还支持参考引导的3D修复,允许对外观进行细粒度控制。在多个数据集上的大量实验表明,ObjFiller-3D显著优于先前方法,实现了更高的重建保真度(PSNR 26.6 vs. NeRFiller的15.9)和感知质量(LPIPS 0.19 vs. Instant3dit的0.25),同时将重建时间从超过40分钟缩短至不到10分钟。这些结果突显了我们的方法在真实世界3D编辑应用中的有效性和实际潜力。项目页面:https://objfiller3d.github.io/ 代码:https://github.com/objfiller3d/ObjFiller-3D

英文摘要

3D object inpainting is commonly achieved via multi-view 2D image completion, yet independently inpainted views often suffer from cross-view inconsistencies, leading to blurred textures, geometric discontinuities, and visual artifacts in the reconstructed 3D objects. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of relying on sparse-view editing or per-view 2D inpainting, our method jointly optimizes a sequence of densely sampled views along a $360^\circ$ trajectory, enabling global coherence across viewpoints. We design a new framework with three complementary components: a Temporal-Driven Generative Encoder for modeling dense-view dependencies, a Semantic-Aware Completion Encoder for object-level inpainting, and a Cycle-Consistent 3D Encoder that enforces global coherence through a closed-loop formulation. Our framework also supports reference-guided 3D inpainting, allowing fine-grained control over appearance. Extensive experiments on diverse datasets demonstrate that ObjFiller-3D significantly outperforms prior methods, achieving higher reconstruction fidelity (PSNR 26.6 vs.\ 15.9 of NeRFiller) and perceptual quality (LPIPS 0.19 vs.\ 0.25 of Instant3dit), while reducing reconstruction time from over 40 minutes to under 10 minutes. These results highlight the effectiveness and practical potential of our approach for real-world 3D editing applications. Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D .

2508.09449 2026-05-28 cs.CV

RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

RASR: 面向实际参考图像复原的检索增强超分辨率

Jiaqi Yan, Shuning Xu, Xiangyu Chen, Dell Zhang, Jiantao Zhou, Jie Tang, Gangshan Wu, Jie Liu

AI总结 提出检索增强超分辨率(RASR)范式,通过自动检索参考图像实现实际场景下的参考超分辨率,并构建基准数据集RASR-Flickr30及基线模型RASRNet。

Comments Accepted at ISCAS 2026

详情
AI中文摘要

基于参考的超分辨率(RefSR)通过利用高质量参考图像增强纹理保真度和视觉真实性,改进了单图像超分辨率(SISR)。然而,现有RefSR方法的一个关键限制是它们依赖于手动筛选的目标-参考图像对,这严重限制了其在现实场景中的实用性。为了克服这一点,我们引入了检索增强超分辨率(RASR),这是一种新的、实用的RefSR范式,它仅根据低质量输入自动从参考数据库中检索语义相关的高分辨率图像。这使得在现实用例中实现可扩展且灵活的RefSR成为可能,例如增强在动物园或博物馆等环境中拍摄的手机照片,其中可以轻松收集或预先整理特定类别的参考数据(例如动物、艺术品)。为了促进这一方向的研究,我们构建了RASR-Flickr30,这是第一个专为RASR设计的基准数据集。与之前具有固定目标-参考对的数据集不同,RASR-Flickr30提供每类参考数据库以支持开放世界检索。我们进一步提出了RASRNet,这是一个强大的基线模型,它结合了语义参考检索器和基于扩散的RefSR生成器。它基于语义相似性检索相关参考,并采用增强语义条件的扩散生成器。在RASR-Flickr30上的实验表明,RASRNet持续优于SISR基线,实现了+0.38 dB PSNR和-0.0131 LPIPS的提升,同时生成更逼真的纹理。这些发现表明检索增强是弥合学术RefSR研究与现实世界适用性之间差距的一个有前景的方向。

英文摘要

Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.

2508.06420 2026-05-28 cs.CV

Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification

特征空间过采样解决SAR舰船分类中的类别不平衡问题

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

AI总结 针对SAR舰船分类中长尾数据集导致的类别不平衡问题,提出两种基于Major-to-minor (M2m)方法的特征空间过采样算法M2m$_f$和M2m$_u$,在OpenSARShip和FuSARShip数据集上使用ViT、VGG16和ResNet50作为特征提取器,平均F1分数分别提升8.82%和4.44%。

Comments Accepted and presented at IGARSS

详情
Journal ref
IGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 2025, pp. 2010-2014,
AI中文摘要

SAR舰船分类面临长尾数据集的挑战,这使得对代表性不足的类别的分类变得复杂。过采样方法已被证明在解决光学数据中的类别不平衡问题方面有效。在本文中,我们评估了在特征空间中进行过采样对SAR舰船分类的影响。我们提出了两种受Major-to-minor (M2m)方法启发的新算法M2m$_f$和M2m$_u$。这些算法在两个公开数据集OpenSARShip(6类)和FuSARShip(9类)上进行了测试,使用三种最先进的模型作为特征提取器:ViT、VGG16和ResNet50。此外,我们还分析了过采样方法对不同类别大小的影响。结果表明,我们的新方法优于原始的M2m和基线方法,在FuSARShip上平均F1分数提高了8.82%,在OpenSARShip上提高了4.44%。

英文摘要

SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.

2507.09466 2026-05-28 cs.LG q-bio.QM

La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

La-Proteina: 通过部分潜变量流匹配进行原子级蛋白质生成

Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, Arash Vahdat

AI总结 提出La-Proteina模型,利用部分潜变量表示和流匹配方法联合生成蛋白质的全原子结构和氨基酸序列,在多项基准测试中达到最先进性能。

详情
AI中文摘要

近年来,出现了许多用于从头蛋白质结构设计的生成模型。然而,只有少数模型能够处理直接生成全原子结构及其对应氨基酸序列这一艰巨任务。这之所以具有挑战性,例如是因为模型必须处理在生成过程中长度变化的侧链。我们提出了La-Proteina,用于原子级蛋白质设计,基于一种新颖的部分潜变量蛋白质表示:粗粒度主链结构被显式建模,而序列和原子细节则通过每个残基的固定维度潜变量捕获,从而有效规避了显式侧链表示的挑战。在此部分潜变量空间中的流匹配则对序列和全原子结构的联合分布进行建模。La-Proteina在多个生成基准测试中达到了最先进的性能,包括全原子共设计性、多样性和结构有效性,这一点通过详细的结构分析和评估得到了证实。值得注意的是,La-Proteina在原子级基序支架设计性能上也超越了之前的模型,解锁了关键的原子结构条件蛋白质设计任务。此外,La-Proteina能够生成多达800个残基的共设计蛋白质,而在此规模下大多数基线模型都会崩溃并无法生成有效样本,这证明了La-Proteina的可扩展性和鲁棒性。

英文摘要

Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

2507.08014 2026-05-28 cs.CL cs.AI cs.CY

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

大规模真实对话分析揭示LLM越狱的复杂性界限

Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

AI总结 通过分析超过200万条真实对话,发现越狱尝试的复杂性并不显著高于正常对话,且攻击复杂性随时间保持稳定,表明LLM安全演化受人类创造力限制。

Comments Code: https://github.com/ACMCMC/risky-conversations Results: https://huggingface.co/risky-conversations Visualizer: https://huggingface.co/spaces/risky-conversations/Visualizer

详情
AI中文摘要

随着大型语言模型(LLM)的日益部署,理解越狱策略的复杂性和演变对于AI安全至关重要。我们对来自不同平台(包括专门的越狱社区和通用聊天机器人)的超过200万条真实对话进行了大规模实证分析,研究了越狱复杂性。使用一系列复杂性指标,涵盖概率度量、词汇多样性、压缩比和认知负荷指标,我们发现越狱尝试并未表现出显著高于正常对话的复杂性。这一模式在专门的越狱社区和普通用户群体中一致成立,表明攻击的复杂性存在实际界限。时间分析显示,虽然用户攻击的毒性和复杂性随时间保持稳定,但助手响应的毒性有所下降,表明安全机制正在改进。复杂性分布中缺乏幂律标度进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者与防御者之间军备竞赛不断升级的主流说法,反而表明LLM安全演化受人类创造力限制,而防御措施持续进步。我们的结果突显了学术越狱披露中的关键信息危害,因为超出当前复杂性基线的复杂攻击可能破坏观察到的平衡,并在防御适应之前造成广泛伤害。

英文摘要

As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

2507.06999 2026-05-28 cs.CV cs.CL cs.LG

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

有意学习,直觉行动:解锁多模态大语言模型的测试时推理能力

Yahan Yu, Yuyang Dong, Masafumi Oyamada

AI总结 提出D2I框架,通过训练时使用基于规则的格式奖励进行有意推理以增强模态对齐,推理时移除显式策略转为直觉推理,从而提升多模态大语言模型的推理能力,无需额外标注或复杂奖励。

Comments 22 pages, 24 figures

详情
AI中文摘要

推理对于大型语言模型(LLMs)至关重要,尤其是在数学问题求解等复杂任务中。然而,多模态推理在模态对齐和训练可扩展性方面仍面临挑战,因为许多现有方法依赖于额外的标注或复杂的基于规则的奖励。为了解决这些问题,我们提出了“有意到直觉”推理框架(D2I),该框架无需额外标注或复杂奖励即可提升多模态大语言模型(MLLMs)的理解和推理能力。在训练过程中,D2I使用仅由基于规则的格式奖励监督的有意推理策略来增强模态对齐。在推理过程中,它通过移除这些显式策略转向直觉推理,使模型能够在其响应中隐式应用所获得的能力。D2I在域内和域外基准测试中均优于基线,突显了格式奖励在培养可迁移多模态推理技能方面的有效性,并表明将训练时的推理深度与测试时的响应灵活性解耦是有益的。

英文摘要

Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability, as many existing methods rely on additional annotations or complex rule-based rewards. To address these issues, we propose the Deliberate-to-Intuitive reasoning framework (D2I), which improves the understanding and reasoning abilities of multimodal LLMs (MLLMs) without extra annotations or complex rewards. During training, D2I uses deliberate reasoning strategies supervised only by rule-based format rewards to enhance modality alignment. During inference, it shifts to intuitive reasoning by removing these explicit strategies, allowing the model to implicitly apply the acquired abilities in its responses. D2I outperforms baselines on both in-domain and out-of-domain benchmarks, highlighting the effectiveness of format rewards in fostering transferable multimodal reasoning skills and suggesting the benefit of decoupling training-time reasoning depth from test-time response flexibility.

2506.08928 2026-05-28 cs.LG stat.ME stat.ML

Local MDI+: Local Feature Importances for Tree-Based Models

Local MDI+: 基于树的模型的局部特征重要性

Zhongyuan Liang, Zachary T. Rewolinski, Abhineet Agarwal, Tiffany M. Tang, Bin Yu

AI总结 提出Local MDI+ (LMDI+)方法,通过扩展MDI+框架到局部特征重要性,在12个基准数据集上平均提升10%的预测性能,并展现出更高的稳定性和可解释性。

详情
AI中文摘要

基于树的集成方法(如随机森林)由于其预测性能和计算效率,在表格数据上仍然是深度学习模型的首选。这些优势使其在高风险领域得到广泛应用,在这些领域中,可解释性对于确保可信预测至关重要。这推动了流行的局部特征重要性方法(如LIME和TreeSHAP)的发展。然而,这些方法依赖于忽略模型内部结构的近似,并依赖于可能不稳定的扰动。这些问题在全局设置中通过MDI+得到解决,MDI+是一种全局特征重要性方法,它通过利用决策树和最小二乘法在变换后的节点基上的等价性,结合了基于树和线性的特征重要性。然而,全局MDI+分数无法在面临异质个体特征时解释预测。为了解决这一差距,我们提出了Local MDI+ (LMDI+),这是MDI+框架的一种新颖扩展,用于量化每个特定样本的特征重要性。在12个真实世界基准数据集上,LMDI+在识别实例特定的预测特征方面优于现有基线,仅使用所选特征时平均预测性能提升10%。它进一步展现出更高的稳定性,在不同随机种子的重复模型拟合中,一致地产生相似的实例级特征重要性排名。消融实验表明,LMDI+的每个组件都对这一提升有贡献,并且这些改进不仅限于随机森林,还扩展到梯度提升模型。最后,我们展示了LMDI+通过为每个分类基准识别紧密匹配的反事实案例,以及在住房数据集案例研究中发现同质子群,从而实现了局部可解释性的用例。

英文摘要

Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local feature importance methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model's internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a global feature importance method which combines tree-based and linear feature importances by exploiting an equivalence between decision trees and least squares on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework that quantifies feature importances for each particular sample. Across twelve real-world benchmark datasets, LMDI+ outperforms existing baselines at identifying instance-specific predictive features, yielding an average 10% improvement in predictive performance when using only the selected features. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across repeated model fits with different random seeds. Ablation experiments show that each component of LMDI+ contributes to these gains, and that the improvements extend beyond random forests to gradient boosting models. Finally, we show that LMDI+ enables local interpretability use cases by identifying closely matched counterfactuals for each classification benchmark and discovering homogeneous subgroups in a housing dataset case study.

2506.05012 2026-05-28 cs.RO physics.comp-ph physics.flu-dyn

Realizing Robotic Swimming with Unified Fluid-Robot Multiphysics

实现统一流体-机器人多物理场的水下机器人游泳

Jeong Hun Lee, Junzhe Hu, Sofia Kwok, Carmel Majidi, Zachary Manchester

AI总结 提出一个可微分的统一流体-机器人多物理场仿真框架,通过最小作用量原理联合推导耦合的机械臂和不可压缩Navier-Stokes方程,并利用离散变分力学和隐函数定理实现稳定、准确的联合仿真与梯度计算,成功优化仿生鳗鱼机器人的波动游泳和高动态C形逃逸动作,并验证了从仿真到实物的迁移。

Comments 9 pages, 10 figures, accepted to Robotics: Science and Systems 2026

详情
AI中文摘要

在水下机器人领域,实现与鱼类相当的游泳效率和敏捷性一直是一个难以达到的目标。这种运动能力依赖于机器人身体与周围流体之间复杂的涡旋相互作用。然而,模拟这些由耦合的常微分方程和偏微分方程控制的动力学,比经典刚性机器人系统的多体动力学要困难得多。我们提出了一个可微分的框架,将强耦合的流体-机器人多物理场作为一个统一的优化问题进行仿真。耦合的机械臂和不可压缩Navier-Stokes方程通过最小作用量原理从单个拉格朗日量中联合推导出来。我们采用离散变分力学,推导出一个稳定、条件良好且物理精确的方案,用于联合仿真铰接体及其周围的流体。我们利用隐函数定理计算完全耦合动力学的导数。利用这个仿真器及其梯度,我们实现了波动游泳步态,并优化了仿生鳗鱼机器人的高动态C形逃逸动作。我们在物理硬件上验证了这两种步态,展示了成功的仿真到实物迁移。仿真代码、硬件数据和鳗鱼机器人的示意图可在此处找到:https://unified-fluid-robot-multiphysics.github.io/

英文摘要

Matching the swimming efficiency and agility of fish has remained an elusive goal in underwater robotics. Such locomotion capabilities rely on complex vortex interactions between the robot's body and the surrounding fluid. However, simulating these dynamics, which are governed by coupled ordinary and partial differential equations, is significantly more difficult than the multi-body dynamics of classical rigid robotic systems. We present a differentiable framework for simulating strongly coupled fluid-robot multiphysics as a unified optimization problem. The coupled manipulator and incompressible Navier-Stokes equations are derived together from a single Lagrangian using the principle of least action. We employ discrete variational mechanics to derive a stable, well-conditioned, and physically accurate scheme for jointly simulating articulated bodies and the surrounding fluid. We leverage the implicit function theorem to compute derivatives of the fully coupled dynamics. Using this simulator and its gradients, we realize undulating swimming gaits and optimize a highly dynamic C-start escape maneuver for a bioinspired eel robot. We validate both gaits on physical hardware, demonstrating successful sim-to-real transfer. Simulation code, hardware data, and schematics for the eel robot can be found here: https://unified-fluid-robot-multiphysics.github.io/

2505.21771 2026-05-28 cs.CV cs.AI

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

MMTABREAL:多模态表格理解的真实世界基准

Prasham Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

AI总结 针对多模态表格理解,构建了包含500个真实表格和4021个问答对的人工筛选基准MMTABREAL,评估发现现有模型在视觉定位、空间对齐和多步推理上存在20-40%的性能差距。

详情
AI中文摘要

多模态表格,即与图表、地图、图标和颜色编码交织的表格布局,在实际应用中无处不在,但对多模态大语言模型(MLLMs)来说仍然困难。尽管在文本和图像理解方面取得了进展,但对以表格为中心的多模态推理的系统评估仍然有限。我们引入了MMTABREAL,一个多模态表格基准,包含人工筛选的500个真实世界表格及其对应的4021个问答对。MMTABREAL涵盖四种问题类型、五种推理类别和八种结构原型。对最先进模型的评估揭示了显著差距,特别是在视觉定位、空间对齐和多步推理方面,相对于现有基准性能下降了20-40%。这些结果强调了需要更紧密融合视觉与表格结构并支持显式数值/逻辑运算的架构。MMTABREAL仅用于评估,提供了一个严谨、可复现的测试平台,反映了真实世界多模态表格的语言、结构和推理复杂性。

英文摘要

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控:增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

AI总结 提出TELLME方法,通过改进大型语言模型的内部表征透明度,帮助监控者识别不当和敏感行为,并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情
AI中文摘要

大型语言模型(LLMs)的能力日益增强,但其思维和决策过程的机制仍不清楚。思维链(CoTs)常被用来外化LLMs的思维,但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角,以改善对其潜在思维的可监控性。然而,以往的方法仅尝试开发外部模块,而非使LLMs本身更易于监控。本文提出了一种新方法TELLME,提高了LLMs的透明度,并帮助监控者识别不合适和敏感的行为。此外,我们在去毒化任务上展示了TELLME的有效性,LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

2503.02857 2026-05-28 cs.CV cs.AI cs.CY

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024:2024年传播的深度伪造多模态野外基准

Nuria Alina Chandra, Hannah Lee, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Changyeon Lee, Jongwook Choi, Sejin Paik, Aerin Kim, Oren Etzioni

AI总结 针对现有学术基准过时且不反映真实深度伪造的问题,提出包含2024年社交媒体和用户提交的多模态深度伪造基准Deepfake-Eval-2024,评估发现开源模型性能大幅下降,而商业模型和微调模型表现更优但未达到专家水平。

详情
AI中文摘要

在生成式人工智能日益逼真的时代,稳健的深度伪造检测对于减少欺诈和虚假信息至关重要。尽管许多深度伪造检测器在学术数据集上报告了高准确率,但我们表明这些学术基准已经过时,不能代表现实世界的深度伪造。我们引入了Deepfake-Eval-2024,这是一个新的深度伪造检测基准,由2024年从社交媒体和深度伪造检测平台用户收集的野外深度伪造组成。Deepfake-Eval-2024包含45小时的视频、56.5小时的音频和1,975张图像,涵盖了最新的操纵技术。该基准包含来自52种不同语言、88个不同网站的多样化媒体内容。我们发现,在Deepfake-Eval-2024上评估时,开源最先进的深度伪造检测模型的性能急剧下降,与之前的基准相比,视频模型的AUC下降了50%,音频模型下降了48%,图像模型下降了45%。我们还评估了商业深度伪造检测模型和在Deepfake-Eval-2024上微调的模型,发现它们比现成的开源模型性能更优,但尚未达到深度伪造取证分析师的准确率。数据集可在https://github.com/nuriachandra/Deepfake-Eval-2024获取。

英文摘要

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

2505.19342 2026-05-28 cs.LG cs.AI

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA:面向多设备Transformer推理的通信高效加速

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan

AI总结 提出ASTRA框架,通过序列并行与混合精度注意力机制,在低带宽环境下实现高效多设备Transformer推理,显著加速并保持精度。

详情
AI中文摘要

多设备推理可以通过并行计算降低Transformer延迟。然而,现有方法需要高设备间带宽,使其在带宽受限环境中不实用。我们提出ASTRA,一个通信高效的框架,将序列并行与混合精度注意力相结合,其中非局部token嵌入作为低位向量量化码传输,而局部注意力保持全精度。为了在激进压缩下保持精度,ASTRA引入了噪声增强量化和分布式分类token。在视觉和语言模型(如ViT和GPT2)上,ASTRA在低至10 Mbps的带宽下,相比单设备推理实现了高达2.64倍的加速,相比先前的多设备基线实现了高达15.25倍的加速。即使在非理想网络条件(如丢包和动态网络)下,ASTRA在大模型(如Llama-3-8B)上仍然保持鲁棒性。

英文摘要

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

2505.17233 2026-05-28 cs.LG cs.SD eess.AS

Semantic-Aware Interpretable Multimodal Music Auto-Tagging

语义感知可解释多模态音乐自动标注

Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

AI总结 提出一种利用多模态音乐特征组和期望最大化算法实现可解释音乐自动标注的方法,在保持竞争性能的同时提供决策过程透明度。

Comments Accepted at Interspeech 2025

详情
AI中文摘要

音乐自动标注对于大规模数字图书馆中的音乐组织和发现至关重要。尽管基础模型在该领域取得了卓越性能,但其输出往往缺乏可解释性,限制了研究人员和最终用户的信任与可用性。在这项工作中,我们提出了一种可解释的音乐自动标注框架,该框架利用从信号处理、深度学习、本体工程和自然语言处理中导出的具有音乐意义的多模态特征组。为了增强可解释性,我们对特征进行语义聚类,并采用期望最大化算法,根据每个特征组对标注过程的贡献分配不同的权重。我们的方法在实现具有竞争力的标注性能的同时,提供了对决策过程的更深入理解,为更透明和以用户为中心的音乐标注系统铺平了道路。

英文摘要

Music auto-tagging is essential for organizing and discovering music in extensive digital libraries. While foundation models achieve exceptional performance in this domain, their outputs often lack interpretability, limiting trust and usability for researchers and end-users alike. In this work, we present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features, derived from signal processing, deep learning, ontology engineering, and natural language processing. To enhance interpretability, we cluster features semantically and employ an expectation maximization algorithm, assigning distinct weights to each group based on its contribution to the tagging process. Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process, paving the way for more transparent and user-centric music tagging systems.

2309.17057 2026-05-28 cs.AI

Tell Me a Story! Narrative-Driven XAI with Large Language Models

给我讲个故事!基于大语言模型的叙事驱动可解释人工智能

David Martens, James Hinns, Camille Dams, Mark Vergouwen, Theodoros Evgeniou

AI总结 提出XAIstories方法,利用大语言模型将SHAP或反事实解释转化为叙事,提升用户对AI决策的理解和体验,实验表明超90%普通用户认为叙事可信,数据科学家83%愿意使用。

详情
Journal ref
10.1016/j.dss.2025.114402
AI中文摘要

在当今许多AI应用中,黑盒机器学习模型因其通常更高的准确性而占据主导地位,这加剧了对可解释人工智能(XAI)的需求。现有的XAI方法,例如广泛使用的SHAP值或反事实(CF)解释,对于用户来说往往过于技术化,难以理解和采取行动。为了增强对AI决策解释的理解和整体用户体验,我们引入了XAIstories,它利用大语言模型提供关于AI预测如何做出的叙述:SHAPstories基于SHAP解释,而CFstories基于CF解释。我们研究了我们的方法对用户体验和理解AI预测的影响。结果令人瞩目:超过90%的受访普通公众认为SHAPstories生成的叙述令人信服。数据科学家主要看到SHAPstories在向普通公众传达解释方面的价值,83%的数据科学家表示他们可能会为此目的使用SHAPstories。在图像分类设置中,超过75%的参与者认为CFstories比或等同于他们自己编造的故事更令人信服。此外,CFstories在创建叙述方面带来了十倍的加速。我们还发现,在我们测试的信用评分设置中,SHAPstories帮助用户更准确地总结和理解AI决策,正确回答理解问题的频率显著高于仅提供SHAP值时。因此,结果表明XAIstories可能显著帮助解释和理解AI预测,最终支持各种应用中的更好决策。

英文摘要

In many AI applications today, the predominance of black-box machine learning models, due to their typically higher accuracy, amplifies the need for Explainable AI (XAI). Existing XAI approaches, such as the widely used SHAP values or counterfactual (CF) explanations, are arguably often too technical for users to understand and act upon. To enhance comprehension of explanations of AI decisions and the overall user experience, we introduce XAIstories, which leverage Large Language Models to provide narratives about how AI predictions are made: SHAPstories do so based on SHAP explanations, while CFstories do so for CF explanations. We study the impact of our approach on users' experience and understanding of AI predictions. Our results are striking: over 90% of the surveyed general audience finds the narratives generated by SHAPstories convincing. Data scientists primarily see the value of SHAPstories in communicating explanations to a general audience, with 83% of data scientists indicating they are likely to use SHAPstories for this purpose. In an image classification setting, CFstories are considered more or equally convincing as the users' own crafted stories by more than 75% of the participants. CFstories additionally bring a tenfold speed gain in creating a narrative. We also find that SHAPstories help users to more accurately summarize and understand AI decisions, in a credit scoring setting we test, correctly answering comprehension questions significantly more often than they do when only SHAP values are provided. The results thereby suggest that XAIstories may significantly help explaining and understanding AI predictions, ultimately supporting better decision-making in various applications.

2505.09861 2026-05-28 cs.LG cs.AI cs.IR stat.ME

LiDDA: Data Driven Attribution at LinkedIn

LiDDA:领英的数据驱动归因

John Bencina, Erkut Aykutlug, Yue Chen, Zerui Zhang, Stephanie Sorenson, Shao Tang, Changshuai Wei

AI总结 提出一种基于Transformer的统一归因方法,处理成员级、聚合级数据和外部宏观因素,并在领英大规模实施,显著提升营销效果。

详情
AI中文摘要

数据驱动归因基于从数据中学习到的因果模式,将转化功劳分配给营销互动,是现代营销智能的基础,对任何营销业务和广告平台至关重要。本文介绍了一种统一的基于Transformer的归因方法,能够处理成员级数据、聚合级数据以及外部宏观因素的整合。我们详细描述了该方法在领英的大规模实施,展示了显著的影响。我们还分享了广泛适用于营销和广告技术领域的经验与见解。

英文摘要

Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing business and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learnings and insights which are broadly applicable to the marketing and ad tech fields.

2504.04924 2026-05-28 cs.CV eess.IV

Inter-event Interval Microscopy for Event Cameras

事件相机的帧间间隔显微术

Changqing Su, Yanqin Chen, Zihan Lin, Zhen Cheng, You Zhou, Bo Xiong, Zhaofei Yu, Tiejun Huang

AI总结 提出基于事件相机的帧间间隔显微术(IEIM),通过量化连续事件的时间间隔实现静态和动态场景的强度重建,在荧光显微镜中实现高时空分辨率和动态范围。

详情
AI中文摘要

事件相机是一种创新的仿生传感器,与传统相机不同,它通过感知强度变化而非直接感知强度,并将这些变化记录为连续的“事件”流。从这些稀疏事件中重建强度一直是一个具有挑战性的问题。以往的方法主要集中在将运动诱发的事件转换为视频,或通过在事件相机采集端集成调制器件来实现静态场景的强度成像。在本文中,我们首次利用静态事件相机在荧光显微镜中实现了静态和动态场景的事件到强度转换。与主要依赖事件积分的传统方法不同,所提出的帧间间隔显微术(IEIM)量化了每个像素处连续事件之间的时间间隔。在事件相机中,由于阈值固定,时间间隔可以精确表示强度。在硬件层面,所提出的IEIM在配备事件相机的显微镜中集成了脉冲光调制器件,称为基于脉冲调制的事件驱动荧光显微镜。此外,我们收集了包含高动态范围和高速度场景的IEIMat数据集。在IEIMat数据集上的实验结果表明,所提出的IEIM在空间和时间分辨率、动态范围方面优于其他方法,且带宽更低。代码和IEIMat数据集将公开提供。

英文摘要

Event cameras, an innovative bio-inspired sensor, differ from traditional cameras by sensing changes in intensity rather than directly perceiving intensity and recording these variations as a continuous stream of "events". The intensity reconstruction from these sparse events has long been a challenging problem. Previous approaches mainly focused on transforming motion-induced events into videos or achieving intensity imaging for static scenes by integrating modulation devices at the event camera acquisition end. In this paper, for the first time, we achieve event-to-intensity conversion using a static event camera for both static and dynamic scenes in fluorescence microscopy. Unlike conventional methods that primarily rely on event integration, the proposed Inter-event Interval Microscopy (IEIM) quantifies the time interval between consecutive events at each pixel. With a fixed threshold in the event camera, the time interval can precisely represent the intensity. At the hardware level, the proposed IEIM integrates a pulse light modulation device within a microscope equipped with an event camera, termed Pulse Modulation-based Event-driven Fluorescence Microscopy. Additionally, we have collected IEIMat dataset under various scenes including high dynamic range and high-speed scenarios. Experimental results on the IEIMat dataset demonstrate that the proposed IEIM achieves superior spatial and temporal resolution, as well as a higher dynamic range, with lower bandwidth compared to other methods. The code and the IEIMat dataset will be made publicly available.

2504.20736 2026-05-28 cs.RO cs.CV

A Survey on Event-based Optical Marker Systems

基于事件的光学标记系统综述

Nafiseh Jabbari Tofighi, Maxime Robic, Fabio Morbidi, Pascal Vasseur

AI总结 本文综述了基于事件的光学标记系统(EBOMS),分析其异步操作原理和鲁棒性,并介绍了在目标检测、姿态估计和光通信等领域的应用。

Comments 11 pages, 6 figures, 2 table

详情
AI中文摘要

事件相机的出现,以其低延迟、高动态范围和低功耗,标志着机器感知和机器人视觉的转折点。特别是,这些神经形态传感器与广泛可用的被动或主动光学标记(例如AprilTags、闪烁LED阵列)的结合,最近开辟了一个新的机遇领域。本综述论文对基于事件的光学标记系统(EBOMS)进行了全面回顾。我们分析了这些系统所基于的基本原理和技术,特别关注其异步操作和对挑战性光照条件的鲁棒性。我们还描述了EBOMS最相关的应用,包括目标检测与跟踪、姿态估计和光通信。文章最后讨论了这一快速发展的多学科领域可能的未来研究方向。

英文摘要

The advent of event-based cameras, with their low latency, high dynamic range, and reduced power consumption, marked a turning point in machine perception and robotic vision. In~particular, the combination of these neuromorphic sensors with widely-available passive or active optical markers (e.g. AprilTags, arrays of blinking LEDs), has recently opened up a new field of opportunities. This survey paper provides a comprehensive review of Event-Based Optical Marker Systems (EBOMS). We~analyze the underlying principles and technologies on which these systems are based, with a special focus on their asynchronous operation and robustness against challenging lighting conditions. We also describe the most relevant applications of EBOMS, including object detection and tracking, pose estimation, and optical communication. The article concludes with a discussion of possible future research directions in this rapidly-emerging and multidisciplinary area.

2504.04540 2026-05-28 cs.CV cs.AI

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

点、视觉与文本:点云能否提升大语言模型的空间推理能力?一项偏差控制研究

Weichen Zhang, Ruiying Peng, Xin Zeng, Jianjie Fang, Ziyou Wang, Kaiyuan Li, Heng Dong, Wei Li, Chen Gao, Xin Wang, Xinlei Chen, Yong Li

AI总结 本文通过引入包含文本、视觉和点云模态的3D空间推理基准ScanReQA,评估不同模态下大语言模型的空间推理能力,发现点云和视觉模态的模型表现优于纯文本模型,并揭示了3D大语言模型中的注意力下沉现象。

详情
AI中文摘要

利用点云中空间信息进行3D空间推理的3D大语言模型(LLMs)引起了广泛关注。尽管取得了一些有希望的结果,但点云相对于其他模态的优势仍不明确。此外,现有的3D基准不足以公平评估多模态大语言模型理解空间概念的能力。为了解决这些挑战,我们引入了ScanReQA,一个涵盖文本、视觉和点云模态的3D空间推理基准。然后,我们评估了文本、2D和3D大语言模型在该基准上的性能,以比较不同模态在理解空间概念方面的有效性。此外,我们分析了使用点云的3D大语言模型背后的推理机制。我们的发现表明:1)二元空间推理对当前的3D大语言模型仍然具有挑战性;2)基于点云和视觉模态的多模态大语言模型展现出比大语言模型更强的空间推理能力;3)3D大语言模型表现出类似于2D大语言模型中的注意力下沉现象,这损害了空间推理。我们认为这些结论有助于3D大语言模型的下一步发展,并为其他模态的基础模型提供见解。我们在项目页面发布了数据集和代码:https://github.com/EmbodiedCity/ScanReQA.code。

英文摘要

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

2501.04144 2026-05-28 cs.CV cs.GR

Chirpy3D: Part-Aware Multi-View Diffusion for Creative Fine-Grained Object Generation

Chirpy3D: 面向创意细粒度物体生成的部件感知多视角扩散

Kam Woh Ng, Jing Yang, Jia Wei Sii, Chee Seng Chan, Jiankang Deng, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

AI总结 提出Chirpy3D,一种部件感知多视角扩散框架,从无姿态2D图像中学习层次化部件潜在空间,实现部件级交换、插值和零样本组合,无需3D数据或手动标注。

Comments 20 pages. Code at https://github.com/kamwoh/chirpy3d

详情
AI中文摘要

理解并生成物体的细粒度结构——例如具有物种特异性喙、翅膀和尾巴的鸟类——是计算机视觉中长期存在的挑战。我们提出Chirpy3D,一种部件感知多视角扩散框架,它从无姿态的2D图像中学习层次化部件潜在空间,仅使用现成的2D部件分割掩码作为空间指导——无需任何3D数据、相机姿态或手动部件标注。该潜在空间支持直观的部件级交换、插值和零样本组合。自监督特征一致性损失进一步促进跨视角的结构对齐,即使在混合或未见过的部件组合下也能实现连贯生成。我们的核心贡献是可控制的部件感知潜在空间和多视角扩散模型。通过任何可微分渲染器(如NeRF)支持下游3D生成,但这与主框架正交,使Chirpy3D成为在缺乏结构化3D数据时进行创意物体生成的灵活基础。代码已发布在https://github.com/kamwoh/chirpy3d。

英文摘要

Understanding and generating the fine-grained structure of objects -- such as birds with species-specific beaks, wings, and tails -- is a long-standing challenge in computer vision. We propose Chirpy3D, a part-aware multi-view diffusion framework that learns a hierarchical part latent space from unposed 2D images, using only off-the-shelf 2D part segmentation masks as spatial guidance -- without requiring any 3D data, camera poses, or manual part annotations. This latent space enables intuitive part-level swapping, interpolation, and zero-shot composition. A self-supervised feature consistency loss further encourages structural alignment across views, allowing coherent generation even with hybrid or unseen part combinations. Our core contribution is the controllable part-aware latent space and multi-view diffusion model. Downstream 3D generation is supported via any differentiable renderer such as NeRF but is orthogonal to the main framework, making Chirpy3D a flexible foundation for creative object generation in the absence of structured 3D data. Code is released at https://github.com/kamwoh/chirpy3d.

2503.22655 2026-05-28 cs.AI cs.CV cs.MM

Text-Only Data Synthesis for Vision Language Model Training

仅文本数据合成用于视觉语言模型训练

Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong

AI总结 提出一个跨集成的三阶段多模态数据合成框架,仅从文本生成高质量多模态训练数据,用于视觉语言模型的预训练和指令微调。

详情
AI中文摘要

训练视觉语言模型(VLM)通常需要大规模、高质量的图像-文本对,但收集或合成此类数据成本高昂。相比之下,文本数据丰富且廉价,这引发了一个问题:能否仅从文本中合成高质量的多模态训练数据?为解决这一问题,我们提出了一个跨集成的三阶段多模态数据合成框架,生成了两个数据集:Unicorn-1.2M和Unicorn-471K-Instruction。在第一阶段:多样化字幕数据合成,我们通过使用大语言模型(LLM)扩展稀疏字幕种子,构建了120万语义多样的高质量字幕。在第二阶段:指令微调数据生成,我们进一步将47.1万个字幕处理为多轮指令微调任务,以支持复杂推理。最后,在第三阶段:模态表示迁移,这些文本字幕表示被转换为视觉表示,从而产生多样化的合成图像表示。这一三阶段过程使我们能够构建用于预训练的Unicorn-1.2M和用于指令微调的Unicorn-471K-Instruction,而无需依赖真实图像。通过消除对真实图像的依赖,同时保持数据质量和多样性,我们的框架为VLM训练提供了一种成本效益高且可扩展的解决方案。

英文摘要

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

2503.18893 2026-05-28 cs.CL cs.LG

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

xKV:通过对齐奇异向量提取的跨层KV缓存压缩

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Hung-Yueh Chiang, Yash Akhauri, Xilai Dai, Huiqiang Jiang, Yucheng Li, Luis Ceze, Kai-Chiang Wu, Mohamed S. Abdelfattah

AI总结 提出xKV,一种通过跨层共享低秩子空间压缩KV缓存的后训练方法,实现高达8倍压缩且保持长上下文任务精度,并引入选择性重建实现端到端加速。

Comments ICML 2026

详情
AI中文摘要

长上下文大型语言模型(LLMs)支持强大的应用,但由于键值状态(KV-Cache)导致高内存成本。最近的研究尝试跨层共享KV-Cache,但这些方法要么需要昂贵的预训练,要么依赖于实践中通常有限的逐token跨层余弦相似度。我们通过中心核对齐(CKA)表明,KV-Cache的主要奇异向量在层间对齐良好。受此观察启发,我们提出xKV,一种后训练压缩方法,将分组层的KV-Cache联合分解为共享的低秩子空间,大幅减少KV-Cache内存。在广泛使用的LLMs上,xKV实现了高达8倍的KV-Cache压缩,同时在长上下文任务和多轮设置中保持准确性。为进一步提高效率,我们在解码时引入选择性重建(SR)。结合SR,xKV相比全注意力基线实现了高达4.23倍的端到端加速,并在相似精度水平下以30%更高的吞吐量超越了显著基线。总体而言,xKV提供了一种即插即用的方法,用于减少长上下文LLM推理的内存和延迟。我们的代码公开于:https://github.com/abdelfattah-lab/xKV。

英文摘要

Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8x KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23x end-to-end speedup over the full attention baseline, and surpasses notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

2503.11477 2026-05-28 cs.AI

Heterogeneous Causal Discovery of Repeated Undesirable Health Outcomes

重复不良健康结果的异质因果发现

Shishir Adhikari, Guido Muscioni, Mark Shapiro, Plamen Petrov, Elena Zheleva

AI总结 提出一个集成因果结构学习算法与异质因果效应估计的端到端框架,用于从观测数据中发现稳健且可解释的因果假设,并在两个大规模医疗应用中验证其有效性。

详情
AI中文摘要

理解触发或预防患者亚群中不良健康结果的因素对于设计针对性干预措施至关重要。虽然随机对照试验和专家主导的患者访谈是识别这些因素的标准方法,但它们可能耗时或不可行。因果发现通过从观测数据中生成因果假设提供了传统方法的替代方案,但其实际效用受到强假设或不可检验假设的限制。本文提出了一种新颖的端到端框架,该框架独特地集成了因果结构学习算法集成与异质因果效应估计。通过聚合多个算法的结果,该框架识别出在不同建模假设下仍然存在的稳健因果关系,同时揭示这些效应如何在特定患者情境中变化。所提出的异质因果发现框架提高了稳健性,并为实践者提供了一组优先排序的、可操作的、临床可解释的假设。我们通过两个大规模医疗应用展示了该框架的有效性:使用保险索赔和电子健康记录数据集,识别糖尿病患者重复急诊就诊和ICU患者再入院的驱动因素和抑制因素。我们在两种情境下的结果均将慢性病管理和护理协调确定为关键干预措施,同时揭示干预效果取决于特定的患者层面修饰因素。我们采用多层验证策略,包括通过模拟恢复真实情况、与临床文献对齐、由临床专家验证以及在现代医疗系统中使用外部数据集进行可移植性验证,以展示该框架的实际效用。

英文摘要

Understanding the factors that trigger or prevent undesirable health outcomes across patient subpopulations is essential for designing targeted interventions. While randomized controlled trials and expert-led patient interviews are standard methods for identifying these factors, they can be time-consuming or infeasible. Causal discovery offers an alternative to conventional approaches by generating cause-and-effect hypotheses from observational data, yet its practical utility is limited by strong or untestable assumptions. This work presents a novel, end-to-end framework that uniquely integrates an ensemble of causal structure learning (CSL) algorithms with heterogeneous causal effect estimation. By aggregating results across multiple algorithms, the framework identifies robust causal relationships that persist under different modeling assumptions while simultaneously revealing how these effects vary across specific patient contexts. The proposed heterogeneous causal discovery framework improves robustness and provides practitioners with a prioritized set of actionable, clinically interpretable hypotheses. We demonstrate the framework's effectiveness through two large-scale healthcare applications: identifying drivers and inhibitors of repeat emergency department visits among diabetic patients and hospital readmissions among ICU patients, using insurance claims and electronic health record datasets. Our results, across both settings, identify chronic disease management and care coordination as key interventions, while revealing that intervention effectiveness depends on specific patient-level modifiers. We employ a multi-layered validation strategy, including ground-truth recovery via simulations, alignment with clinical literature, validation by expert clinicians, and portability in modern healthcare systems using an external dataset, to demonstrate the framework's practical utility.

2503.04863 2026-05-28 cs.CV cs.AI

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Manboformer: 通过时空注意力机制学习高斯表示

Ziyue Zhao, Qining Qi, Jianfa Ma

AI总结 针对自动驾驶3D语义占用预测中高斯表示性能不足的问题,提出利用时空自注意力机制优化GaussianFormer,以提升模型性能。

Comments After careful self-check, we found several unnoticed deficiencies and incomplete discussions in this manuscript. To ensure the rigor and accuracy of academic results, we decide to withdraw this preprint. A refined, complete, and rigorous version will be submitted soon

详情
AI中文摘要

与基于体素的网格预测相比,在自动驾驶的3D语义占用预测领域,GaussianFormer提出使用3D高斯来描述场景,基于对象的稀疏3D语义高斯是另一种内存需求更低的方案。每个3D高斯函数表示一个灵活的兴趣区域及其语义特征,通过注意力机制迭代优化。实验中发现,该方法所需的高斯函数数量大于原始密集网格网络的查询分辨率,导致性能受损。因此,我们考虑通过利用未使用的时序信息来优化GaussianFormer。我们从先前的网格占用网络中学习时空自注意力机制,并将其改进应用于GaussianFormer。实验使用NuScenes数据集进行,目前正在进行中。

英文摘要

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. Each 3D Gaussian function represents a flexible region of interest and its semantic features, which are iteratively refined by the attention mechanism. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance. Therefore, we consider optimizing GaussianFormer by using unused temporal information. We learn the Spatial-Temporal Self-attention Mechanism from the previous grid-given occupation network and improve it to GaussianFormer. The experiment was conducted with the NuScenes dataset, and the experiment is currently underway.

2502.12468 2026-05-28 cs.LG cs.AI

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

MCTS-Judge:LLM作为裁判在代码正确性评估中的测试时扩展

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

AI总结 提出MCTS-Judge框架,利用蒙特卡洛树搜索在测试时进行多视角分解评估,显著提升LLM作为裁判在代码正确性评估中的准确性和效率。

详情
AI中文摘要

LLM作为裁判的范式在评估生成内容方面显示出潜力,但在推理密集型场景(如编程)中缺乏可靠性。受近期推理模型进展和扩展定律变化的启发,我们率先将测试时计算引入LLM作为裁判,提出MCTS-Judge,一种资源高效的、系统2思维框架,用于代码正确性评估。MCTS-Judge利用蒙特卡洛树搜索(MCTS)将问题分解为更简单的、多视角的评估。通过结合基于当前轨迹中历史动作的自我评估和基于先前rollout的树的上置信界(UCT)的节点选择策略,MCTS-Judge平衡了全局优化和当前轨迹的细化。我们进一步设计了一种高精度的、单元测试级别的奖励机制,以鼓励大语言模型(LLM)进行逐行分析。在三个基准和五个LLM上的大量实验证明了MCTS-Judge的有效性,它将基础模型的准确率从41%提升到80%,同时比o1系列模型少使用3倍的token。进一步的评估验证了其推理轨迹在逻辑、分析、全面性和整体质量上的优越性,同时揭示了LLM作为裁判范式的测试时扩展定律。

英文摘要

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

2412.08052 2026-05-28 cs.LG stat.ML

CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

CANDOR: 反事实注释的双重稳健离策略评估

Aishwarya Mandyam, Shengpu Tang, Jiayu Yao, Jenna Wiens, Barbara E. Engelhardt

AI总结 提出基于双重稳健框架的离策略评估方法,通过仅在奖励模型组件中融入反事实注释,在理论保证和实验上优于其他策略。

Comments 11 pages, published in the conference proceedings of the Conference on Health Inference and Learning (2026)

详情
AI中文摘要

离策略评估(OPE)对于将上下文赌博算法应用于高风险决策环境(如医疗保健)至关重要,因为新治疗策略在部署前必须进行评估。不幸的是,OPE技术本质上受到可用数据广度的限制,这可能不足以评估新策略的性能。最近的工作尝试通过添加专家注释的反事实样本来改善数据集覆盖。然而,此类注释通常不完美,可能导致比不使用任何注释更差的估计器性能。为了更好地利用不完美注释,我们提出了一族基于双重稳健(DR)框架的OPE估计器,该框架将重要性采样(IS)与奖励模型(直接方法,DM)相结合以获得更好的统计保证。我们研究了三种融入反事实注释的方式。在温和假设下,我们证明仅在DM组件中使用注释能产生最理想的理论结果。在多个医疗保健任务(包括真实世界电子健康记录(EHR)数据)上的实验表明,该策略在错误指定的奖励模型和不准确的注释下最为稳健。通过解决不完美注释带来的挑战,这项工作拓宽了OPE方法的适用性,并促进了医疗保健中决策策略的更安全部署。

英文摘要

Off-policy evaluation (OPE) is critical for applying contextual bandit algorithms to high-stakes decision-making settings such as healthcare, where new treatment policies must be evaluated prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not be sufficient to evaluate the performance of a new policy. Recent work attempts to improve dataset coverage by adding expert-annotated counterfactual samples. However, such annotations are often imperfect and can lead to worse estimator performance than using no annotations at all. To better leverage imperfect annotations, we propose a family of OPE estimators grounded in the doubly robust (DR) framework, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We study three ways of incorporating counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable theoretical results. Experiments on multiple healthcare tasks, including real-world electronic health records (EHR) data, show that this strategy is most robust under misspecified reward models and inaccurate annotations. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer deployment of decision-making policies in healthcare.

2410.10241 2026-05-28 cs.LG cs.AI stat.ML

Revisiting Graph Autoencoders as Implicit Contrastive Learners

重新审视图自编码器作为隐式对比学习器

Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Zulun Zhu, Liang Chen

AI总结 本文通过对比学习视角重新审视图自编码器,揭示其隐式对比学习本质,并强调对比视图设计的关键作用,提出非对称子图视图作为重要设计维度。

Comments KDD 2026 research track. Code available at https://github.com/EdisonLeeeee/lrGAE

详情
AI中文摘要

图自编码器(GAEs)和图对比学习(GCL)是图上自监督表示学习的两种主要范式,但它们通常被孤立研究并被视为根本不同的方法。在这项工作中,我们通过对比学习的视角重新审视GAEs,并表明基于结构和基于特征的GAEs都可以概念化为隐式图对比学习器。这一视角揭示了许多现有GAEs的主要区别在于对比视图的构建方式,而非学习目标或架构。基于这一见解,我们引入了一个统一公式,强调对比视图设计是GAEs中一个核心且先前较少探索的维度。特别是,我们识别出由子图视图不匹配产生的非对称对比视图,作为先前GAE研究中一个重要但未充分探索的设计轴。我们在统一框架内形式化这一见解,并在代表性图学习任务上进行系统实验,以检验其对性能和效率的影响。我们的结果表明,将GAEs解释为隐式对比学习器能更清晰地理解现有模型,并为设计有效且可扩展的图自编码器提供实用指导。

英文摘要

Graph autoencoders (GAEs) and graph contrastive learning (GCL) are two major paradigms for self-supervised representation learning on graphs, yet they are often studied in isolation and treated as fundamentally different approaches. In this work, we revisit GAEs through the lens of contrastive learning and show that both structure-based and feature-based GAEs can be conceptualized as implicitly graph contrastive learners. This perspective reveals that many existing GAEs differ primarily in how contrastive views are constructed, rather than in their learning objectives or architectures. Building on this insight, we introduce a unified formulation that highlights contrastive view design as a central and previously less explored dimension in GAEs. In particular, we identify asymmetric contrastive views, arising from mismatches in subgraph views, as an important yet underexplored design axis in prior GAE research. We formalize this insight within a unified framework and conduct systematic experiments on representative graph learning tasks to examine its impact on performance and efficiency. Our results show that interpreting GAEs as implicit contrastive learners offers a clearer understanding of existing models and provides practical guidance for designing effective and scalable graph autoencoders.

2410.04498 2026-05-28 cs.LG

AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

AdaMemento: 自适应记忆辅助策略优化用于强化学习

Renye Yan, Yaozhong Gan, You Wu, Junliang Xing, Ling Liangn, Yeshang Zhu, Yimao Cai

AI总结 针对稀疏奖励强化学习问题,提出AdaMemento框架,通过记忆反思模块利用正负经验、细粒度内在动机引导探索以及集成学习自适应协调利用与探索,显著提升性能。

详情
AI中文摘要

在强化学习的稀疏奖励场景中,记忆机制通过像人类一样反思过去经验,为策略优化提供了有前景的捷径。然而,当前基于记忆的强化学习方法仅仅存储和重用高价值策略,缺乏对多样化过去经验的深入精炼和过滤,从而限制了记忆的能力。在本文中,我们提出了AdaMemento,一种自适应记忆增强的强化学习框架。我们不仅记忆正面的过去经验,还设计了一个记忆反思模块,通过学习基于实时状态预测已知局部最优策略,同时利用正面和负面经验。为了有效收集信息丰富的轨迹用于记忆,我们进一步引入了一种细粒度的内在动机范式,其中相似状态中的细微差别可以被精确区分以引导探索。然后,通过集成学习自适应地协调对过去经验的利用和新策略的探索,以逼近全局最优。此外,我们从理论上证明了我们新的内在动机和集成机制的优势。通过59个定量和可视化实验,我们确认AdaMemento能够区分细微状态以更好地探索,并有效利用记忆中的过去经验,相比之前的方法取得了显著改进。

英文摘要

In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply store and reuse high-value policies, lacking a deeper refining and filtering of diverse past experiences and hence limiting the capability of memory. In this paper, we propose AdaMemento, an adaptive memory-enhanced RL framework. Instead of just memorizing positive past experiences, we design a memory-reflection module that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states. To effectively gather informative trajectories for the memory, we further introduce a fine-grained intrinsic motivation paradigm, where nuances in similar states can be precisely distinguished to guide exploration. The exploitation of past experiences and exploration of new policies are then adaptively coordinated by ensemble learning to approach the global optimum. Furthermore, we theoretically prove the superiority of our new intrinsic motivation and ensemble mechanism. From 59 quantitative and visualization experiments, we confirm that AdaMemento can distinguish subtle states for better exploration and effectively exploiting past experiences in memory, achieving significant improvement over previous methods.

2407.21075 2026-05-28 cs.AI cs.CL cs.LG

Apple Intelligence Foundation Language Models

Apple Intelligence 基础语言模型

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Daniel Parilla, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren

AI总结 本文介绍了为 Apple Intelligence 功能开发的基础语言模型,包括一个约30亿参数的设备端高效运行模型和一个用于私有云计算的服务器端大模型,并描述了其架构、训练数据、优化过程和评估结果。

详情
AI中文摘要

我们介绍了为支持 Apple Intelligence 功能而开发的基础语言模型,包括一个约30亿参数的模型,旨在设备上高效运行,以及一个用于私有云计算的大型服务器端语言模型。这些模型旨在高效、准确且负责任地执行广泛的任务。本报告描述了模型架构、用于训练模型的数据、训练过程、模型如何针对推理进行优化以及评估结果。我们强调了对负责任人工智能的关注,以及这些原则如何贯穿于模型开发的整个过程。

英文摘要

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

2405.09689 2026-05-28 cs.LG cs.AI cs.SC

Generalized Holographic Reduced Representations

广义全息约简表示

Calvin Yeung, Zhuowen Zou, SungHeon Jeong, Wenjun Huang, Nathaniel D Bastian, Mohsen Imani

AI总结 提出广义全息约简表示(GHRR),通过灵活的非交换绑定操作改进超维计算对复杂组合结构的编码能力,并在语言建模任务中验证其可替代注意力机制并提升性能。

详情
AI中文摘要

超维计算(HDC)是一种计算和数据高效的范式,在连接主义和符号主义人工智能方法之间架起桥梁。然而,HDC的简单性给编码复杂组合结构带来了挑战,尤其是在其绑定操作中。为了解决这个问题,我们提出了广义全息约简表示(GHRR),它是傅里叶全息约简表示(FHRR)的扩展,FHRR是一种特定的HDC实现。GHRR引入了一种灵活的非交换绑定操作,能够改进复杂数据结构的编码,同时保留HDC的鲁棒性和透明性等理想特性。在这项工作中,我们介绍了GHRR框架,证明了其理论性质及其对HDC性质的遵循,探索了其核和绑定特性,并通过实证实验展示了其灵活的非交换性以及对组合结构增强的解码准确性。我们还证明了GHRR中的绑定比其他HDC变体更具表现力;特别地,我们展示了GHRR中的绑定可以实现一种注意力机制。我们通过在Transformer中将其注意力机制替换为GHRR等价物并在语言建模任务上进行测试来验证这一点,结果显示与普通Transformer相比性能有所提升。

英文摘要

Hyperdimensional Computing (HDC) is a computationally and data-efficient paradigm that acts as a bridge between connectionist and symbolic approaches to artificial intelligence (AI). However, HDC's simplicity poses challenges for encoding complex compositional structures, especially in its binding operation. To address this, we propose Generalized Holographic Reduced Representations (GHRR), an extension of Fourier Holographic Reduced Representations (FHRR), a specific HDC implementation. GHRR introduces a flexible, non-commutative binding operation, enabling improved encoding of complex data structures while preserving HDC's desirable properties of robustness and transparency. In this work, we introduce the GHRR framework, prove its theoretical properties and its adherence to HDC properties, explore its kernel and binding characteristics, and perform empirical experiments showcasing its flexible non-commutativity, enhanced decoding accuracy for compositional structures. We also demonstrate that binding in GHRR is more expressive than that in other HDC variants; in particular, we show that binding in GHRR can implement a kind of attention mechanism. We verify this by replacing the attention mechanism in a transformer with its GHRR-equivalent and testing it on a language modeling task, showing improved performance compared to a vanilla transformer.