arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2505.13350 2026-05-18 cs.RO

Approximating Global Contact-Implicit MPC via Sampling and Local Complementarity

Sharanya Venkatesh, Bibit Bianchini, Alp Aydinoglu, William Yang, Michael Posa

发表机构 * GRASP Laboratory at the University of Pennsylvania(宾夕法尼亚大学GRASP实验室) Boston Dynamics(波士顿动力) Amazon Robotics(亚马逊机器人技术)

AI总结 为实现通用的灵巧操作,机器人需要快速规划并执行富含接触的运动行为。现有基于模型的控制器无法在实时中对指数级可能的接触序列进行全局优化,而隐式接触控制方法虽简化了模型,但仅能局部近似,限制了对接触空间的探索。本文提出一种结合局部互补性控制与全局采样的新方法,在每个控制周期中先进行无接触阶段的采样,再基于每个采样点进行富含接触的局部模型预测控制,从而实现全局感知的隐式接触控制器,能够在实时中完成非凸物体的精确非抓取操作。

Comments S.V. and B.B. contributed equally to this work. Accepted to RA-L 2025; presented at ICRA 2026. Project page: https://approximating-global-ci-mpc.github.io

Journal ref IEEE Robotics and Automation Letters, volume 10, number 11, pages 12117-12124, September 2025

详情
英文摘要

To achieve general-purpose dexterous manipulation, robots must rapidly devise and execute contact-rich behaviors. Existing model-based controllers are incapable of globally optimizing in real-time over the exponential number of possible contact sequences. Instead, recent progress in contact-implicit control has leveraged simpler models that, while still hybrid, make local approximations. However, the use of local models inherently limits the controller to only exploit nearby interactions, potentially requiring intervention to richly explore the space of possible contacts. We present a novel approach which leverages the strengths of local complementarity-based control in combination with low-dimensional, but global, sampling of possible end-effector locations. Our key insight is to consider a contact-free stage preceding a contact-rich stage at every control loop. Our algorithm, in parallel, samples end effector locations to which the contact-free stage can move the robot, then considers the cost predicted by contact-rich MPC local to each sampled location. The result is a globally-informed, contact-implicit controller capable of real-time dexterous manipulation. We demonstrate our controller on precise, non-prehensile manipulation of non-convex objects using a Franka Panda arm. Project page: https://approximating-global-ci-mpc.github.io

2505.12601 2026-05-18 cs.LG

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

Yang Li

发表机构 * Independent researcher(独立研究者)

AI总结 随着大语言模型(LLM)规模和专业性不断提升,如何高效选择最适合的模型处理输入已成为关键问题。本文重新审视LLM路由策略,发现经过精心调优的k近邻(kNN)方法在多种任务中不仅表现优异,甚至优于当前最先进的学习路由方法。研究引入了一系列标准化路由基准和首个多模态路由数据集,揭示了嵌入空间中模型性能的局部特性使得非参数方法在样本复杂度上更具优势,挑战了当前追求复杂架构的趋势。

详情
英文摘要

As large language models (LLMs) grow in scale and specialization, routing--selecting the best model for a given input--has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.

2505.07322 2026-05-18 cs.CV

RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Li Xu, Siqi Wang, Kepeng Xu, Gang He, Lin Zhang, Weiran Wang, Yu-Wing Tai

发表机构 * Xidian University(西安电子科技大学) Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种通用的SDR到HDR转换框架RealRep,通过解耦亮度和色度属性的学习,提升对真实世界中多样SDR内容的鲁棒性。核心方法包括解耦表征学习、基于退化感知的负样本生成策略,以及一个轻量的两阶段映射网络DDACMNet,能够根据退化条件动态调整映射过程。实验表明,RealRep在泛化能力和HDR色彩重构的感知保真度方面均优于现有方法。

Comments Published on AAAI'26(Oral): The Annual AAAI Conference on Artificial Intelligence

详情
英文摘要

High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly widespread, driving a growing need for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which struggle to handle the diverse appearances and degradations commonly present in real-world SDR content. To address this limitation, we propose a generalized SDR-to-HDR framework that enhances robustness by learning attribute-disentangled representations. Central to our approach is Realistic Attribute-Disentangled Representation Learning (RealRep), which explicitly disentangles luminance and chrominance components to capture intrinsic content variations across different SDR distributions. Furthermore, we design a Luma-/Chroma-aware negative exemplar generation strategy that constructs degradation-sensitive contrastive pairs, effectively modeling tone discrepancies across SDR styles. Building on these attribute-level priors, we introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a lightweight, two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned features, enabling robust adaptation across diverse degradation domains. Extensive experiments demonstrate that RealRep consistently outperforms state-of-the-art methods in both generalization and perceptually faithful HDR color gamut reconstruction.

2505.06982 2026-05-18 cs.CV

Decentralized LoRA augmented transformer with multi-scale feature learning for secured eye diagnosis

Md. Naimur Asif Borno, Md Sakib Hossain Shovon, MD Hanif Sikder, Iffat Firozy Rimi, Tahani Jaser Alahmadi, Mohammad Ali Moni

发表机构 * organization= Research Assistant, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia organization= Mechatronics Engineering, Rajshahi University of Engineering \& Technology , city= Rajshahi , postcode= 6204 , country= Bangladesh organization= Researcher, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia organization= Department of Computer Science, American International University Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh organization= Department of Computer Science, University of South Asia-Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh organization= Department of Computer Science Engineering, Daffodil International University , city= Dhaka , country= Bangladesh Department of Information Systems, College of Computer Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia. Email organization= Faculty of Health, Medicine Behavioural Sciences, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia Cyber Futures Institute Charles Sturt University , addressline= 308 Queen St , city= Bathurst NSW , country= Australia

AI总结 本文提出了一种基于改进型图像Transformer(DeiT)的去中心化眼病诊断框架,旨在解决医学影像中眼科疾病诊断面临的数据不平衡、隐私保护、空间特征多样性和临床可解释性等挑战。该方法结合多尺度特征学习、低秩适配(LoRA)、知识蒸馏和联邦学习,有效提升了模型在计算效率、数据隐私保护和诊断性能方面的表现。实验表明,该框架在多个基准数据集上优于传统卷积神经网络和现有Transformer模型,并通过Grad-CAM++提供了可解释的诊断依据,为安全、可扩展的眼科AI诊断系统奠定了基础。

Comments Published at Knowledge-Based Systems

详情
英文摘要

Accurate and privacy-preserving diagnosis of ophthalmic diseases remains a critical challenge in medical imaging, particularly given the limitations of existing deep learning models in handling data imbalance, data privacy concerns, spatial feature diversity, and clinical interpretability. This paper proposes a novel Data efficient Image Transformer (DeiT) based framework that integrates context aware multiscale patch embedding, Low-Rank Adaptation (LoRA), knowledge distillation, and federated learning to address these challenges in a unified manner. The proposed model effectively captures both local and global retinal features by leveraging multi scale patch representations with local and global attention mechanisms. LoRA integration enhances computational efficiency by reducing the number of trainable parameters, while federated learning ensures secure, decentralized training without compromising data privacy. A knowledge distillation strategy further improves generalization in data scarce settings. Comprehensive evaluations on two benchmark datasets OCTDL and the Eye Disease Image Dataset demonstrate that the proposed framework consistently outperforms both traditional CNNs and state of the art transformer architectures across key metrics including AUC, F1 score, and precision. Furthermore, Grad-CAM++ visualizations provide interpretable insights into model predictions, supporting clinical trust. This work establishes a strong foundation for scalable, secure, and explainable AI applications in ophthalmic diagnostics.

2504.21850 2026-05-18 cs.CV

Visual Compositional Tuning

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky

发表机构 * Princeton University(普林斯顿大学) Meta AI

AI总结 本文研究了视觉指令微调(VIT)数据集中样本复杂度对信息量的影响,提出了一种名为COMPACT的合成数据生成方法,通过在一个训练样本中组合多个基础视觉能力,显著提升了数据效率。实验表明,COMPACT在减少训练数据量90%的情况下,仍能保持与完整数据相当甚至更好的模型性能,在多个视觉语言基准测试中表现优异。该方法为提升视觉语言任务的训练效率提供了可扩展的解决方案。

Comments See the project website at this [URL](https://princetonvisualai.github.io/compact/)

详情
英文摘要

Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a compositional VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective VIT. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Furthermore, training on the COMPACT data outperforms training on the full-scale VIT data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on vision-language tasks.

2504.09544 2026-05-18 cs.LG cs.CE cs.CV

Integrating chemical structures as treatments improves representations of microscopy images for morphological profiling

Yemin Yu, Emre Hayir, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Microsoft Research(微软研究院) Department of Computer Science, Zhejiang University(浙江大学计算机科学系)

AI总结 该研究提出了一种名为MICON的新框架,通过在自监督预训练中整合化学结构信息,提升高通量显微图像的表征能力,以更准确地进行形态学分析。研究认为,将化合物结构作为诱导细胞表型变化的“处理”因素进行建模,能够显著优于传统手工特征和现有深度学习方法。实验表明,结合化学信息的表征学习在跨实验重复和数据来源的药物效应识别任务中表现更优,为多模态显微筛查数据的表征学习提供了新方向。

Comments 24 pages

详情
英文摘要

Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structures during self-supervised pre-training could improve learned representations of images from high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides small, but consistent improvements in performance and that modeling compounds specifically as treatments outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

2504.08300 2026-05-18 cs.CL cs.AI

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) State Key Laboratory of Transvascular Implantation Devices and TIDRI(血管植入设备国家重点实验室和TIDRI) Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence(浙江医学影像人工智能重点实验室) School of Data Science of Engineering, East China Normal University(华东师范大学工程数据科学学院) Second Affiliated Hospital and Liangzhu Laboratory, Zhejiang University School of Medicine(浙江大学医学院第二附属医院和良渚实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文研究了大语言模型(LLMs)在基准测试中的表现是否受到训练数据污染的影响,指出当前基于基准测试的评估方式可能高估了模型的真实能力。为此,作者提出了一种新的评估框架TrinEval,通过重构多选题形式,减少对记忆的依赖,从而更准确地评估模型的真实学习能力。实验表明,主流大语言模型在多个数据集上约有19.6%的知识点依赖于死记硬背,而非真正的理解与推理能力。

Comments Work in Progress

详情
英文摘要

Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. When pre-exposed to the testing benchmark during training, less capable LLMs have been found to achieve inflated performance, thereby yielding erroneous results in LLM evaluation. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle and expose genuine capability acquisition from superficial memorization in LLM evaluation. Following this, firstly, by analyzing model performance under different memorization conditions of MCQs, we uncover a counterintuitive trend: LLMs perform worse on memorized benchmarks than on non-memorized ones, indicating the coexistence of two learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative knowledge-centric trinity format, reducing memorization while preserving inherent knowledge, enabling the evaluation of genuine capability in the presence of memorization. Extensive experiments validate the effectiveness and robustness of TrinEval in reformulating benchmarks, and the evaluation results further reveal that mainstream LLMs rely on rote memorization for an average of 19.6% of knowledge points across the MMLU and the GSM8K dataset.

2504.05451 2026-05-18 cs.CV

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

发表机构 * UT Austin(得克萨斯大学奥斯汀分校) Meta AI Stanford University(斯坦福大学) Northeastern University(东北大学)

AI总结 ViewBridge 是一种用于学习活动视点不变表示的框架,旨在应对野外视频中极端视角变化带来的挑战。该方法通过知识蒸馏保留动作语义,并结合课程学习策略,逐步增加视角难度以实现平滑适应。实验表明,ViewBridge 在两个任务上优于现有方法,适用于多个数据集。

详情
英文摘要

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

2503.16589 2026-05-18 cs.LG cs.ET math.ST stat.TH

A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: Avoiding Unreliable Conclusions

Moslem Noori, Elisabetta Valiante, Thomas Van Vaerenbergh, Masoud Mohseni, Ignacio Rozada

发表机构 * QB Information Technologies (1QBit)(1QB信息科技(1QBit)) Hewlett Packard Labs, Hewlett Packard Enterprise(惠普实验室,惠普企业)

AI总结 本文针对随机优化器的性能评估问题,提出了一种统计分析方法,以避免因实验设计不当导致的不可靠结论。研究分析了常用性能指标的置信区间及其与实验重复次数的关系,并推导出保证指标精度所需的最小重复次数下界。基于此,作者提出了一种自适应调整重复次数的算法,以提高评估的准确性和可靠性。实验结果验证了该方法在基准测试和超参数调优中的有效性。

Journal ref Physical Review Applied 25, no. 3 (2026): 034081

详情
英文摘要

A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.

2503.07518 2026-05-18 cs.CL cs.AI cs.LG

TokenButler: Token Importance is Predictable

Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Sameh Gobriel, Nilesh Jain, Mohamed S. Abdelfattah

发表机构 * Cornell University(康奈尔大学) Intel Labs(英特尔实验室)

AI总结 大型语言模型在解码过程中依赖键值缓存(KV-Cache)存储历史信息,但随着缓存增长,其成为内存和计算瓶颈。为解决这一问题,本文提出TokenButler,一种高精度、查询感知的标记重要性预测方法,能够在固定预算下动态选择关键标记,同时保留完整的KV缓存。该方法通过学习预测低维重要性查询,并结合缓存键的投影进行高效评分,实验表明其在长上下文任务中性能优越,并显著提升了推理速度。

详情
英文摘要

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks of tokens and many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. TokenButler predicts low-dimensional importance queries at a fixed depth stride, and combines them with a learned projection of the real KV-cache keys to score tokens cheaply, enabling dynamic per-token selection under a fixed budget while preserving the full KV cache. We train TokenButler by distilling the model's masked causal attention distributions, optimizing a lightweight predictor with minimal parameter overhead. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy where existing methods fail. Furthermore, TokenButler achieves competitive or superior performance on long-context benchmarks (RULER, LongBench), up to $\approx1.6\times$ on-GPU speedup using our proposed *prediction interval with neighbor fetching* that amortizes predictor cost while maintaining accuracy within $\approx$1.1\%, and up to 7.6$\times$ reduction in latency compared to Dense Attention with CPU offloading. Code is available: https://github.com/abdelfattah-lab/TokenButler

2503.02597 2026-05-18 cs.CV cs.AI

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

发表机构 * Sony Group Corporation, Tokyo, Japan(索尼集团,日本东京)

AI总结 近期多模态大语言模型(MLLMs)在理解和推理多模态信息方面取得了显著进展,但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发,提出了一种新的模态互注意力机制(MMA),通过将因果注意力扩展为跨模态互注意力,使图像模态能够关注文本模态,从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能,且无需增加额外参数,具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

详情
英文摘要

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

2502.12187 2026-05-18 cs.CL cs.FL cs.LG math.ST stat.ML stat.TH

Hallucinations are inevitable but can be made statistically negligible

Atsushi Suzuki, Yulan He, Feng Tian, Zhongyuan Wang

发表机构 * Department of Mathematics(数学系) The University of Hong Kong(香港大学) Department of Informatics(信息学院) King’s College London(伦敦国王学院) Division of Natural and Applied Sciences(自然科学与应用科学系) Duke Kunshan University(杜克大学昆山分校) School of Computer Science(计算机科学学院)

AI总结 本文探讨了语言模型中不可避免的“幻觉”现象,即模型生成非事实内容的问题。尽管已有研究从可计算性理论角度证明,任何语言模型在无限输入集上都会产生幻觉,但本文从概率论角度提出,只要训练数据的质量和数量足够,幻觉在统计意义上可以被显著降低。研究指出,虽然可计算性理论结果具有理论意义,但概率理论结果更符合实际应用需求,为缓解幻觉问题提供了新的理论依据。

详情
英文摘要

Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, recent studies established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. This paper claims that those "innate" inevitability results from computability theory and diagonal argument, in principle, cannot explain practical issues of LLMs. We demonstrate this claim by presenting a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.

2501.19128 2026-05-18 cs.LG cs.AI

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

发表机构 * Department of Mathematics, The University of Hong Kong (HKU)(香港大学数学系) Department of Data and Systems Engineering, HKU(香港大学数据与系统工程系) Musketeers Foundation Institute of Data Science, HKU(穆斯克特基金会数据科学研究所)

AI总结 在强化学习中,稀疏奖励信号使得奖励函数的学习变得困难。本文提出一种半监督方法,结合非零奖励转移和数据增强技术,利用大量零奖励转移学习轨迹表示,从而提升奖励塑形的效果。实验表明,该方法在Atari和机器人操作任务中优于基于监督的方法,尤其在稀疏奖励环境下,其最高得分可达监督方法的两倍。

详情
英文摘要

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

2501.17116 2026-05-18 cs.LG cs.CL

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

发表机构 * University of Science and Technology of China(科学技术大学) Microsoft Research Asia(微软亚洲研究院) Microsoft SIGMA Team(微软SIGMA团队)

AI总结 随着大语言模型(LLM)训练的计算需求不断增长,如何提高训练效率成为关键问题。本文提出首个基于FP4量化的大语言模型训练框架,通过可微分量化估计器和异常值截断补偿策略,有效解决了FP4精度下量化误差大、表征能力有限的问题,并结合混合精度训练和向量化量化保证训练稳定性。实验表明,该框架在保持与BF16和FP8相近精度的同时,能够高效支持超大规模模型的训练。

Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:62937-62957, 2025

详情
英文摘要

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

2412.02271 2026-05-18 cs.CL

The MediaSpin Dataset: Post-Publication News Headline Edits Annotated for Media Bias

Preetika Verma, Kokil Jaidka

发表机构 * Carnegie Mellon University(卡内基梅隆大学) National University of Singapore(新加坡国立大学) NUS Centre for Trusted Internet and Community(新加坡国立大学可信互联网与社区中心)

AI总结 本文介绍了 MediaSpin 数据集,这是一个大规模语言资源,记录了主要新闻机构在新闻发布后对标题的修改情况,并配套了 MediaSpin-in-the-Wild 数据集,用于分析这些修改后的标题在社交媒体上的互动情况。数据集包含78,910对标题,标注了13种媒体偏见类型,涵盖主观和客观偏见形式,并通过专家验证的大型语言模型进行标注。研究展示了该数据集在跨国家分析、偏见分类和社交媒体行为分析中的应用,揭示了媒体报道中的区域框架不对称性、可量化的语言特征以及偏见内容的高互动性。

Comments 8 pages, 3 figures, 8 tables Accepted at AAAI ICWSM 2026 We updated the paper title from "MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines " to "The MediaSpin Dataset: Post-Publication News Headline Edits Annotated for Media Bias"

详情
英文摘要

We present MediaSpin, a large-scale language resource capturing how major news outlets modify headlines after publication, and MediaSpin-in-the-Wild, a complementary dataset linking these revised headlines to their downstream engagement on social media. The increasing editability of online news headlines offers new opportunities to study linguistic framing and bias through the lens of editorial revisions. The dataset contains 78,910 headline pairs annotated for 13 types of media bias, grounded in established media-bias taxonomies, covering both subjective (e.g., sensationalism, spin) and objective (e.g., omission, slant) forms, with annotation conducted through a human-supervised large-language-model pipeline with expert validation and quality control. We describe the annotation schema and demonstrate three downstream applications: (1) cross-national analysis of how country references are added or removed during editing, (2) transformer-based bias classification at both binary and fine-grained levels, and (3) behavioral analysis of biased headlines on X (Twitter) using 180,786 news-related tweets from 819 consenting users. The results reveal regional asymmetries in representational framing, measurable linguistic markers, and consistently higher engagement with biased content. MediaSpin and MediaSpin-in-the-Wild together provide a reproducible benchmark for bias detection and the study of editorial and behavioral dynamics in contemporary media ecosystems.

2410.01990 2026-05-18 cs.LG cs.CE

Deep Learning Alternatives of the Kolmogorov Superposition Theorem

Leonardo Ferreira Guilhoto, Paris Perdikaris

发表机构 * Graduate Group in Applied Mathematics and Computational Science(应用数学与计算科学联合研究生组) University of Pennsylvania(宾夕法尼亚大学) Department of Mechanical Engineering & Applied Mechanics(机械工程与应用力学系)

AI总结 本文探讨了作为神经网络设计基础的柯尔莫戈罗夫叠加定理(KST)的替代形式。传统KST在数学上优雅,但因其对内外函数结构的洞察有限且引入大量未知变量,带来实际应用挑战。为此,研究提出了一种可扩展的深度学习模型ActNet,克服了原KST的诸多缺陷,并在物理信息神经网络(PINNs)框架下进行了评估,结果表明ActNet在偏微分方程模拟等任务中优于基于KST的Kolmogorov-Arnold网络,并具有与传统多层感知机相当的竞争力。

Journal ref Guilhoto, Leonardo Ferreira, and Paris Perdikaris. "Deep Learning Alternatives Of The Kolmogorov Superposition Theorem." The Thirteenth International Conference on Learning Representations (ICLR 2025)

详情
英文摘要

This paper explores alternative formulations of the Kolmogorov Superposition Theorem (KST) as a foundation for neural network design. The original KST formulation, while mathematically elegant, presents practical challenges due to its limited insight into the structure of inner and outer functions and the large number of unknown variables it introduces. Kolmogorov-Arnold Networks (KANs) leverage KST for function approximation, but they have faced scrutiny due to mixed results compared to traditional multilayer perceptrons (MLPs) and practical limitations imposed by the original KST formulation. To address these issues, we introduce ActNet, a scalable deep learning model that builds on the KST and overcomes many of the drawbacks of Kolmogorov's original formulation. We evaluate ActNet in the context of Physics-Informed Neural Networks (PINNs), a framework well-suited for leveraging KST's strengths in low-dimensional function approximation, particularly for simulating partial differential equations (PDEs). In this challenging setting, where models must learn latent functions without direct measurements, ActNet consistently outperforms KANs across multiple benchmarks and is competitive against the current best MLP-based approaches. These results present ActNet as a promising new direction for KST-based deep learning applications, particularly in scientific computing and PDE simulation tasks.

2409.11022 2026-05-18 cs.CL cs.AI

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

Hanjun Luo, Yingbin Jin, Xinfeng Li, Xuecheng Liu, Ruizhe Chen, Tong Shang, Kun Wang, Qingsong Wen, Zuozhu Liu

发表机构 * New York University Abu Dhabi(纽约大学阿布扎赫德分校) Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technology University(南阳技术大学) University of Electronic Science and Technology of China(电子科技大学) Texas A&M University(德克萨斯大学) Squirrel AI

AI总结 随着大语言模型(LLM)在命名实体识别(NER)任务中的应用日益广泛,现有数据集在语料选择和设计逻辑上已难以满足LLM方法的需求。为此,本文提出DynamicNER,一个专为LLM设计的动态、多语言、细粒度NER数据集,支持同一实体在不同上下文中具有不同实体类型,涵盖8种语言和155种实体类型,适用于广泛领域。同时,本文还提出CascadeNER方法,通过两阶段策略和轻量级LLM实现更高效的细粒度识别,实验表明DynamicNER为LLM-based NER提供了有效的评估基准。

Comments This paper is accepted by EMNLP 2025 Main Conference

详情
英文摘要

The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.

2409.03897 2026-05-18 cs.LG cs.DC

On the Convergence Rates of Federated Q-Learning across Heterogeneous Environments

Leo Muxing Wang, Pengkun Yang, Lili Su

发表机构 * Northeastern University(东北大学) Tsinghua University(清华大学)

AI总结 本文研究了异构环境下联邦Q学习的收敛速率问题,探讨了在多个智能体协同学习最优Q函数时,通信频率与智能体数量对收敛速度的影响。研究发现,虽然增加智能体数量可以线性加速收敛,但增加通信间隔会导致性能显著下降,且这一现象具有本质性。论文还揭示了收敛过程中的两阶段特性,并提出了通过调整学习率以加快整体收敛的策略。

详情
英文摘要

Large-scale multi-agent systems are often deployed across wide geographic areas, where agents interact with heterogeneous environments. There is an emerging interest in understanding the role of heterogeneity in the performance of the federated versions of classic reinforcement learning algorithms. In this paper, we study synchronous federated Q-learning, which aims to learn an optimal Q-function by having $K$ agents average their local Q-estimates per $E$ iterations. We observe an interesting phenomenon on the convergence speeds in terms of $K$ and $E$. Similar to the homogeneous environment settings, there is a linear speed-up concerning $K$ in reducing the errors that arise from sampling randomness. Yet, in sharp contrast to the homogeneous settings, $E>1$ leads to significant performance degradation. Specifically, we provide a fine-grained characterization of the error evolution in the presence of environmental heterogeneity, which decay to zero as the number of iterations $T$ increases. The slow convergence of having $E>1$ turns out to be fundamental rather than an artifact of our analysis. We prove that, for a wide range of stepsizes, the $\ell_{\infty}$ norm of the error cannot decay faster than $Θ(E/T)$. In addition, our experiments demonstrate that the convergence exhibits an interesting two-phase phenomenon. For any given stepsize, there is a sharp phase-transition of the convergence: the error decays rapidly in the beginning yet later bounces up and stabilizes. Provided that the phase-transition time can be estimated, choosing different stepsizes for the two phases leads to faster overall convergence.

2408.07331 2026-05-18 cs.LG

RSEA-MVGNN: Multi-View Graph Neural Network with Reliable Structural Enhancement and Aggregation

Junyu Chen, Long Shi, Badong Chen

发表机构 * Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics(四川省金融智能与金融工程重点实验室,西南财经大学计算机与人工智能学院) Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人工智能与机器人研究院)

AI总结 该论文提出了一种名为RSEA-MVGNN的多视图图神经网络,旨在有效融合具有不同图结构特征的多视图图数据。该方法通过主观逻辑估计每个视图的不确定性,并利用去相关算法进行可靠的结构增强,从而提升特征多样性;同时,模型基于视图的信念和不确定性评估视图质量,使高质量视图在图神经网络聚合中占据主导地位。实验表明,该方法在多个真实数据集上优于现有先进方法。

Journal ref Information Fusion 121 (2025) 103143

详情
英文摘要

Graph Neural Networks (GNNs) have exhibited remarkable efficacy in learning from multi-view graph data. In the framework of multi-view graph neural networks, a critical challenge lies in effectively combining diverse views, where each view has distinct graph structure features (GSFs). Existing approaches to this challenge primarily focus on two aspects: 1) prioritizing the most important GSFs, 2) utilizing GNNs for feature aggregation. However, prioritizing the most important GSFs can lead to limited feature diversity, and existing GNN-based aggregation strategies equally treat each view without considering view quality. To address these issues, we propose a novel Multi-View Graph Neural Network with Reliable Structural Enhancement and Aggregation (RSEA-MVGNN). Firstly, we estimate view-specific uncertainty employing subjective logic. Based on this uncertainty, we design reliable structural enhancement by feature de-correlation algorithm. This approach enables each enhancement to focus on different GSFs, thereby achieving diverse feature representation in the enhanced structure. Secondly, the model learns view-specific beliefs and uncertainty as opinions, which are utilized to evaluate view quality. Based on these opinions, the model enables high-quality views to dominate GNN aggregation, thereby facilitating representation learning. Experimental results conducted on five real-world datasets demonstrate that RSEA-MVGNN outperforms several state-of-the-art GNN-based methods.

2407.02039 2026-05-18 cs.CL

Prompt Stability Scoring for Text Annotation with Large Language Models

Christopher Barrie, Elli Palaiologou, Petter Törnberg

发表机构 * Department of Sociology, New York University(纽约大学社会学系) Independent Researcher(独立研究者) Institute for Logic, Language, and Computation, University of Amsterdam(阿姆斯特丹大学逻辑、语言与计算研究所)

AI总结 随着大型语言模型在文本标注中的应用日益广泛,研究发现模型输出的可重复性可能受到提示设计微小变化的影响。为此,本文提出了一种通用的提示稳定性评估框架,通过借鉴编码者内部与外部一致性评分方法,定义了“提示稳定性评分(PSS)”,并开发了相应的Python工具包。实验在多个数据集上验证了该方法的有效性,并为实际研究者提供了提升标注稳定性的实践建议。

Comments 39 pages, 5 figures

详情
英文摘要

Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call ``prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package \texttt{promptstability} for its estimation. Using six different datasets and twelve outcomes, we classify $\sim$3.1m rows of data and $\sim$300m input tokens to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.

2406.18944 2026-05-18 cs.CV cs.AI cs.CR

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu, Ruoxi Chen, Xun Chen, Lichao Sun

发表机构 * Lehigh University(莱维大学) Lehigh University Computer Science(莱维大学计算机科学) Engineering Bethlehem PA USA(工程 布雷顿 佛罗里达 美国) Independent Researcher(独立研究员) Independent Researcher Fremont California USA(独立研究员 佛罗里达 加州 美国)

AI总结 个性化扩散模型(PDMs)在使用少量数据生成特定人物图像方面表现出色,但其对微小对抗性扰动高度敏感,导致在受污染数据上微调时性能显著下降。本文通过 Shortcut Learning 的视角深入分析了 PDMs 的微调过程,揭示了对抗扰动在 CLIP 嵌入空间中引发的潜在语义对齐问题,并据此提出了一种系统性的反制框架,包括图像净化和对比解耦学习,有效提升了模型的鲁棒性和泛化能力。

Comments Code is available at https://github.com/liuyixin-louis/DiffShortcut

详情
英文摘要

Personalized diffusion models (PDMs) have become prominent for adapting pre-trained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to red-team the protective perturbation to break the protection but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic red-teaming framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic content in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers shortcut learning vulnerabilities in PDMs but also provides a thorough evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its advantages over existing purification methods and its robustness against adaptive perturbations.

2404.03099 2026-05-18 cs.LG cs.AI cs.CE cs.IT math.IT stat.ML

Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks

Leonardo Ferreira Guilhoto, Paris Perdikaris

发表机构 * Graduate Group in Applied Mathematics and Computational Science(应用数学与计算科学联合研究生组) University of Pennsylvania(宾夕法尼亚大学) Department of Mechanical Engineering and Applied Mechanics(机械工程与应用力学系)

AI总结 本文提出了一种名为NEON的神经网络架构,用于在无限维函数空间中进行带有不确定性的预测,其参数数量远少于性能相当的深度集成方法。研究聚焦于复合贝叶斯优化问题,即优化由未知函数映射和已知函数组成的复合函数,并通过实验表明NEON在多个场景下取得了领先的优化效果,同时显著降低了模型复杂度。

Journal ref Guilhoto, Leonardo Ferreira, and Paris Perdikaris. "Composite Bayesian optimization in function spaces using NEON - Neural Epistemic Operator Networks." Scientific Reports 14.1 (2024): 29199

详情
英文摘要

Operator learning is a rising field of scientific computing where inputs or outputs of a machine learning model are functions defined in infinite-dimensional spaces. In this paper, we introduce NEON (Neural Epistemic Operator Networks), an architecture for generating predictions with uncertainty using a single operator network backbone, which presents orders of magnitude less trainable parameters than deep ensembles of comparable performance. We showcase the utility of this method for sequential decision-making by examining the problem of composite Bayesian Optimization (BO), where we aim to optimize a function $f=g\circ h$, where $h:X\to C(\mathcal{Y},\mathbb{R}^{d_s})$ is an unknown map which outputs elements of a function space, and $g: C(\mathcal{Y},\mathbb{R}^{d_s})\to \mathbb{R}$ is a known and cheap-to-compute functional. By comparing our approach to other state-of-the-art methods on toy and real world scenarios, we demonstrate that NEON achieves state-of-the-art performance while requiring orders of magnitude less trainable parameters.

2403.13805 2026-05-18 cs.CV cs.AI cs.LG

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) MThreads, Inc.(MThreads公司) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为RAR的方法,旨在提升多模态大语言模型(MLLMs)在细粒度和少样本视觉识别任务中的性能。RAR结合了CLIP的多模态检索能力与MLLMs的丰富知识库,通过建立多模态检索器来扩展模型的上下文窗口,并在推理时检索相关类别信息供MLLMs进行排序和预测。该方法有效解决了MLLMs在面对大量类别时性能下降的问题,在多个细粒度和零样本识别基准上取得了显著的性能提升。

Comments Project: https://github.com/Liuziyu77/RAR

详情
英文摘要

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

2402.10380 2026-05-18 cs.LG cs.AI cs.CL

Subgraph-level Universal Prompt Tuning

Junhyun Lee, Wooseong Yang, Jaewoo Kang

发表机构 * Korea University(韩国大学) University of Illinois at Chicago(伊利诺伊大学香槟分校)

AI总结 在图神经网络中,如何有效适配不同预训练策略的模型仍是一个挑战。本文提出了一种子图级通用提示调优方法(SUPT),通过在子图层面分配提示特征,保持方法的通用性,同时大幅减少调优参数数量。实验表明,SUPT在多种下游任务中表现优异,尤其在少样本场景下平均性能提升超过6.6%。

Journal ref Information Sciences 749 (2026) 123516

详情
英文摘要

In the evolving landscape of machine learning, the adaptation of pre-trained models through prompt tuning has become increasingly prominent. This trend is particularly observable in the graph domain, where diverse pre-training strategies present unique challenges in developing effective prompt-based tuning methods for graph neural networks. Previous approaches have been limited, focusing on specialized prompting functions tailored to models with edge prediction pre-training tasks. These methods, however, suffer from a lack of generalizability across different pre-training strategies. Recently, a simple prompt tuning method has been designed for any pre-training strategy, functioning within the input graph's feature space. This allows it to theoretically emulate any type of prompting function, thereby significantly increasing its versatility for a range of downstream applications. Nevertheless, the capacity of such simple prompts to fully grasp the complex contexts found in graphs remains an open question, necessitating further investigation. Addressing this challenge, our work introduces the Subgraph-level Universal Prompt Tuning (SUPT) approach, focusing on the detailed context within subgraphs. In SUPT, prompt features are assigned at the subgraph-level, preserving the method's universal capability. This requires extremely fewer tuning parameters than fine-tuning-based methods, outperforming them in 42 out of 45 full-shot scenario experiments with an average improvement of over 2.5%. In few-shot scenarios, it excels in 41 out of 45 experiments, achieving an average performance increase of more than 6.6%.

2311.03658 2026-05-18 cs.CL cs.AI cs.LG stat.ML

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文探讨了“线性表示假设”,即高层概念在表示空间中以线性方向形式表示的问题,提出了“线性表示”的两种形式化定义,并分别对应输出(词)空间和输入(句子)空间。通过引入因果内积,作者建立了一个非欧几里得的内积结构,能够统一各种线性表示的概念,并用于构建探针和引导向量。实验表明,大型语言模型中确实存在概念的线性表示,且内积的选择对解释与控制模型具有基础性作用。

Comments Accepted for a presentation at ICML 2024 and an oral presentation at NeurIPS 2023 Workshop on Causal Representation Learning. Code is available at https://github.com/KihoPark/linear_rep_geometry

Journal ref In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

详情
英文摘要

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

2212.12130 2026-05-18 cs.CV

Learning to Detect and Segment for Open Vocabulary Object Detection

Tao Wang, Nan Li

发表机构 * Sichuan University(四川大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 该研究旨在解决开放词汇物体检测中的检测与分割问题,提出了一种名为CondHead的动态网络结构,以提升模型对新类别物体的泛化能力。核心方法通过条件参数化网络头,利用语义嵌入引导模型学习类别特异性知识,从而实现更准确的边界框回归和分割预测。该方法在保持计算开销极小的前提下,显著提升了现有开放词汇检测方法的性能。

Comments We appologize that author Nan Li was not on the published version due to cvpr23 policy that authors cannot be added after abstract deadline

详情
英文摘要

Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

1911.05467 2026-05-18 cs.LG cs.NA math.NA

ChebNet: Efficient and Stable Constructions of Deep Neural Networks with Rectified Power Units via Chebyshev Approximations

Shanshan Tang, Bo Li, Haijun Yu

发表机构 * Software Development Center, Industrial and Commercial Bank of China(中国工商银行软件开发中心) Hisilicon Semiconductor and Component Business Dept.(2012 Labs), Huawei Technologies Co., Ltd(华为技术有限公司半导体及组件业务部) NCMIS & LSEC, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Beijing(数学与系统科学研究院) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院)

AI总结 本文提出了一种基于切比雪夫多项式逼近的高效且稳定的深度神经网络构建方法——ChebNet,用于提升对光滑函数的逼近能力。相比传统使用幂级数逼近的RePU激活函数网络,ChebNet通过频率域中的分层切比雪夫逼近结构,实现了更稳定且计算效率更高的网络构造。实验表明,ChebNet不仅保持了与幂级数方法相当的逼近性能,还具有更高的稳定性,并可通过微调获得更优结果,为实际应用中高效逼近光滑函数提供了可行方案。

Comments 6 figures, 3 tables, to appear on Communications in Mathematics and Statistics

Journal ref Communications in Mathematics and Statistics, 2024

详情
英文摘要

In a previous study [B. Li, S. Tang and H. Yu, Commun. Comput. Phy. 27(2):379-411, 2020], it is shown that deep neural networks built with rectified power units (RePU) as activation functions can give better approximation for sufficient smooth functions than those built with rectified linear units, by converting polynomial approximations using power series into deep neural networks with optimal complexity and no approximation error. However, in practice, power series approximations are not easy to obtain due to the associated stability issue. In this paper, we propose a new and more stable way to construct RePU deep neural networks based on Chebyshev polynomial approximations. By using a hierarchical structure of Chebyshev polynomial approximation in frequency domain, we obtain efficient and stable deep neural network construction, which we call ChebNet. The approximation of smooth functions by ChebNets is no worse than the approximation by deep RePU nets using power series. On the same time, ChebNets are much more stable. Numerical results show that the constructed ChebNets can be further fine-tuned to obtain much better results than those obtained by tuning deep RePU nets constructed by power series approach. As spectral accuracy is hard to obtain by direct training of deep neural networks, ChebNets provide a practical way to obtain spectral accuracy, it is expected to be useful in real applications that require efficient approximations of smooth functions.

2605.16255 2026-05-18 cs.DC cs.AI

Designing Datacenter Power Delivery Hierarchies for the AI Era

为AI时代设计数据中心电力交付层级

Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini

发表机构 * Stanford University(斯坦福大学) Microsoft Azure Research(微软Azure研究院)

AI总结 本文研究了AI时代数据中心电力交付层级设计的挑战,提出了一种评估框架,结合吞吐量、功率和成本指标,分析多资源短缺对部署容量、资本支出和性能的影响。

详情
AI中文摘要

对AI加速器的需求迅速增加机架功率密度,预计到2027年将达到每部署1MW。这给数据中心电力交付设计者带来了重大挑战。随着功率密度增加,为不同目标密度设计的数据中心可能无法使用其交付层级预留的所有功率。设计必须在数据中心长生命周期和多个硬件世代中保持高效。功率利用率在AI时代尤为重要,因为电网电力容量是稀缺资源。设计长期高效的电力交付层级困难,因为机架放置可行性、工作负载影响和成本取决于电气拓扑、部署粒度、放置策略、功率超订和工作负载混合。此外,这些因素随时间变化,跨多个资源维度有相互依赖性,通常无法用闭式分析。为解决这一挑战,我们开发了一个评估框架,结合GPU、计算和存储部署的投影模型,结合Microsoft Azure的生产数据。我们的结果表明,多资源短缺显著改变可部署容量、有效资本支出和交付性能,并量化了从机架和机柜规模AI系统中上升的密度如何影响这些结果。对于AI数据中心设计,相关规划目标不是安装兆瓦,而是随时间变化的可部署容量。

英文摘要

Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.

2605.16245 2026-05-18 cs.CY cs.AI cs.CL cs.LG cs.SI

AI-Mediated Communication Can Steer Collective Opinion

AI介导的交流可以引导集体意见

Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

发表机构 * Hasso Plattner Institute(哈索普兰特纳研究所) Oxford Internet Institute, University of Oxford(牛津互联网研究所,牛津大学) Weizenbaum Institute(魏泽纳姆研究所)

AI总结 本文研究AI在人类间交流中对集体意见形成的影响,通过实证和理论分析展示AI引入的方向性偏见如何通过网络放大并改变集体观点,探讨平台如何控制此类偏见。

详情
AI中文摘要

生成式人工智能(AI)正日益融入人类交流意见的在线平台;大型语言模型(LLMs)现在在LinkedIn上润色用户帖子,并在X上提供内容上下文。尽管先前研究显示AI能表达偏见意见并影响个体意见,但较少关注其在介导人类间交流时对集体意见形成的影响。我们通过实证和理论分析填补这一空白。我们实证显示,多个流行LLM家族在被指示编辑争议性话题的人类文本时引入方向性偏见,例如倾向于支持枪支管控,反对无神论。基于这一观察,我们引入了一个意见动态的数学模型,其中AI系统位于社交网络用户之间,转换他们表达和感知的意见。通过分析该模型的平衡点并使用真实社交网络数据进行模拟,我们显示AI在人类间交流中引入的偏见可通过网络放大并转向集体意见。鉴于这些发现,我们探讨此类偏见是否可通过在线平台控制。我们审核了X上的“解释此帖子”功能,并发现Grok在与堕胎相关的内容中的输出存在亲生命偏见,我们追溯到特定的设计选择。最后,我们讨论了这些发现与欧洲联盟正在进行的立法努力的广泛影响。

英文摘要

Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

2605.16230 2026-05-18 cond-mat.mtrl-sci cs.LG

Universal Magnetic Structure Prediction from Atomic Coordinates with Near-Experimental Accuracy

从原子坐标预测通用磁结构并实现接近实验精度

Abhijatmedhi Chotrattanapituk, Ryotaro Okabe, Eunbi Rha, Mariya Al-Hinai, Eugene Jiang, Daniel Pajerowski, Yongqiang Cheng, Joshua J. Turner, Mingda Li

发表机构 * Quantum Measurement Group, MIT, Cambridge, MA 02139, USA(麻省理工学院量子测量组) Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA 02139, USA(麻省理工学院电气工程与计算机科学系) Department of Chemistry, MIT, Cambridge, MA 02139, USA(麻省理工学院化学系) Department of Nuclear Science and Engineering, MIT, Cambridge, MA 02139, USA(麻省理工学院核科学与工程系) Department of Physics, MIT, Cambridge, MA 02139, USA(麻省理工学院物理系) Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA(橡树岭国家实验室中子散射组) SLAC National Accelerator Laboratory, Menlo Park, CA 94025, USA(斯坦福直线加速器实验室)

AI总结 本文提出磁结构网络(MSN),通过原子晶体结构直接预测磁结构,利用原始调制结构表示(PMSR)统一编码调制结构,实现高精度磁结构预测,为磁性材料发现提供新方法。

Comments 9 pages, 3 figures

详情
AI中文摘要

磁序是材料的基本性质,调控集体行为并实现多种功能。然而,磁结构难以确定:实验成本高且专业,而第一性原理方法常难以处理非collinear和无调制序。本文引入磁结构网络(MSN),一种E(3)等变图神经网络,直接从原子晶体结构预测collinear和non-collinear磁结构,训练于MAGNDATA实验确定结构。通过提出原始调制结构表示(PMSR),我们能够统一编码调制和非调制结构,无需对称假设。模型在所有调制组件上表现强劲,能高保真重建实验磁结构。我们的方法提供了一种可扩展的框架,用于快速磁结构预测,并开辟了数据驱动发现磁性材料的新途径。

英文摘要

Magnetic order is a fundamental property of materials, governing collective behavior and enabling a broad range of functionalities. Yet magnetic structure remains difficult to determine: experiments are costly and specialized, while first-principles methods often struggle with the noncollinear and incommensurate orders found in real materials. Here we introduce magnetic structure network (MSN), an E(3) equivariant graph neural network that predicts both collinear and non-collinear magnetic structures directly from atomic crystal structures, trained directly on experimentally determined structures from MAGNDATA. By proposing the primitive modulated structure representation (PMSR), we are able to encode commensurate and incommensurate structures in a unified way without symmetry assumptions. The model achieves strong performance across all modulation components and reconstructs experimental magnetic structures with high fidelity. Our approach provides a scalable framework for rapid magnetic structure prediction and opens a route to data-driven discovery of magnetic materials.