arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2605.05676 2026-05-08 cs.CL cs.AI

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

分解大型语言模型的基本能力:在多任务指令微调中缓解跨任务干扰

Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Gang Niu, Masashi Sugiyama

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University(吉林大学符号计算与知识工程重点实验室) RIKEN Center for Advanced Intelligence Project(RIKEN高级智能项目中心) Graduate School of Frontier Sciences, University of Tokyo(东京大学前沿科学研究生院)

AI总结 本文通过实验揭示现有方法仍存在跨任务干扰问题,提出BADIT方法将LLM参数分解为正交的高奇异值LoRA专家,通过球形聚类保持正交性,实验证明其在多任务指令微调中优于现有方法。

Comments Accepted by ICML 2026. 25 pages, 13 figures. Code: https://github.com/wangbing1416/BADIT

详情
AI中文摘要

近年来,大型语言模型(LLMs)的显著性能主要由多任务指令微调驱动。然而,由于不同任务间共享参数的冲突梯度,这种训练范式存在关键问题,即跨任务干扰。一些先前方法通过隔离任务特定参数来缓解这一问题,例如任务特定神经元选择和混合专家。本文通过实验证实,现有解决方案仍存在跨任务干扰,因为许多参数同时被不同任务共享。因此,我们提出了一种新的解决方案,即多任务指令微调中的基本能力分解(BADIT)。具体而言,我们发现某些参数始终被共同激活,并且共同激活的参数自然组织成基础组。这促使我们类比认为LLM编码了若干正交的基本能力,且任何任务都可以表示为这些能力的线性组合。因此,我们提出了BADIT,将LLM参数分解为正交的高奇异值LoRA专家,代表基本能力,并在训练过程中通过球形聚类对秩-1组件进行动态强制正交性。我们在SuperNI基准上对6个LLM进行了广泛的实验,实验证实BADIT在多任务指令微调中优于现有方法,并有效缓解了跨任务干扰的程度。

英文摘要

Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.

2605.05668 2026-05-08 cs.AI cs.CV

Large Vision-Language Models Get Lost in Attention

大视觉-语言模型在注意力中迷失

Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

发表机构 * School of Cyberspace Security, Beijing University of Posts(信息安全学院,北京邮电大学) State Key Laboratory of Networking and Switching Technology, Beijing University of Posts(网络与交换技术国家重点实验室,北京邮电大学) School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts(计算机科学学院(国家试点软件工程学院),北京邮电大学) Nanyang Technological University, Singapore(新加坡南洋理工大学) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(信息工程研究所,中国科学院北京研究院)

AI总结 研究揭示大视觉-语言模型中注意力与前馈网络的作用差异,提出基于信息论和几何的统一框架,发现注意力模块存在冗余问题。

Comments 25 pages, 10 figures. Accepted by ICML 2026

详情
AI中文摘要

尽管训练范式迅速发展,大视觉-语言模型(LVLMs)的解码器骨干仍基于残差连接Transformer架构。因此,解析内部模块的差异化作用对理解模型机制和指导架构优化至关重要。尽管先前的统计方法提供了有价值的归因见解,但往往缺乏统一的理论基础。为弥合这一差距,我们提出一个基于信息论和几何的统一框架,以量化残差更新的几何和熵性质。应用这一统一框架揭示了基本的功能解耦:注意力作为子空间保持算子,专注于重新配置,而前馈网络(FFNs)作为子空间扩展算子,驱动语义创新。令人惊讶的是,进一步实验表明,用预定义值(如高斯噪声)替换学习的注意力权重,在大多数数据集上表现相当或优于原始模型。这些结果揭示了当前机制中的严重误配和冗余,表明最先进的LVLMs实际上“在注意力中迷失”,而非高效利用视觉上下文。

英文摘要

Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.

2605.05664 2026-05-08 cs.CV

Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

稀疏到完整:从稀疏图像捕获到完整3D场景

Yiyang Shen, Yin Yang, Kun Zhou, Tianjia Shao

发表机构 * State Key Lab of CAD\&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) University of Utah(犹他大学) Hangzhou Research Institute of Holographic and AI Technology(杭州全息与人工智能技术研究院)

AI总结 本文提出S2C-3D框架,通过六到八张图像实现高保真完整场景重建,结合专用扩散模型、视图一致性条件采样和相机轨迹规划,提升3D重建质量与鲁棒性。

Journal ref SIGGRAPH 2026 Conf. Proc

详情
AI中文摘要

我们介绍S2C-3D,一种新颖的稀疏视角3D重建框架,用于从最少六到八张图像中实现高保真和完整的场景重建。该框架包含三个组件:专为场景定制的图像修复扩散模型、无需训练的视图一致性条件采样过程以及相机轨迹规划方案。专用扩散模型通过微调预训练架构在输入视角及其对应退化版本上进行开发。适应场景分布使模型能够修复高斯渲染同时有效消除领域差距。同时,轨迹规划方案通过连接每个新采样的相机与其两个最近邻居来优化场景覆盖。通过迭代构建路径并仅保留显著提升可视性的路径,建立覆盖整个场景的轨迹。为解决多视图冲突,视图一致性条件采样过程量化相邻修复图像之间的一致性。此信息被注入到冻结的扩散模型采样过程中,从而生成视图一致的图像而无需额外训练。因此,我们的方法生成了鲁棒于伪影的高保真3D高斯。实验结果表明,S2C-3D优于现有方法,能够用非常稀疏的输入构建高质量的场景,避免缺失区域、模糊或其他伪影。源代码和数据可在https://gapszju.github.io/S2C-3D获取。

英文摘要

We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is developed by finetuning a pretrained architecture on the input views and their corresponding degraded counterparts. The adaptation to the scene distribution allows the model to repair Gaussian renderings while effectively eliminating domain gaps. Meanwhile, the trajectory planning scheme optimizes scene coverage by connecting each newly sampled camera to its two nearest neighbors. By iteratively constructing paths and retaining only those that significantly enhance visibility, the scheme establishes a trajectory that covers the entire scene. To address multi-view conflicts, the view-consistency conditioned sampling process quantifies the consistency between neighboring repaired images. This information is injected as a condition into the sampling process of the frozen diffusion model, facilitating the generation of view-consistent images without additional training. Consequently, our approach produces high-fidelity 3D Gaussians that are robust to artifacts. Experimental results demonstrate that S2C-3D outperforms state-of-the-art methods, constructing high-quality scenes that are free from missing regions, blurring, or other artifacts with very sparse inputs. The source code and data are available at https://gapszju.github.io/S2C-3D.

2605.05662 2026-05-08 cs.CL cs.AI

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

XL-SafetyBench: 一个基于国家的跨文化安全基准,用于LLM安全性和文化敏感性

Dasol Choi, Eugenia Kim, Jaewon Noh, Sang Seo, Eunmi Kim, Myunggyo Oh, Yunjin Park, Brigitta Jesica Kartono, Josef Pichlmeier, Helena Berndt, Sai Krishna Mendu, Glenn Johannes Tungka, Özlem Gökçe, Suresh Gehlot, Katherine Pratt, Amanda Minnich, Haon Park

发表机构 * AIM Intelligence(AIM智能研究院) Microsoft(微软公司) Korea AISI(韩国人工智能研究所) KT Corporation(KT公司) BMW Group(宝马集团) Coinbase(Coinbase公司) Technical University of Munich(慕尼黑技术大学) Ankara University(安卡拉大学) Cyril Amarchand Mangaldas(Cyril Amarchand Mangaldas法律事务所) Seoul National University(首尔国立大学)

AI总结 XL-SafetyBench通过5500个跨10个国家语言对的测试案例,评估LLM在文化敏感性和安全性的表现,揭示了前沿模型的安全性与文化意识无耦合关系,以及本地模型在安全与文化敏感性间的线性权衡。

详情
AI中文摘要

当前LLM安全基准大多以英语为中心,依赖翻译,无法捕捉国家特定的伤害。此外,它们很少评估模型检测文化嵌入敏感性的能力,与普遍伤害区分开。我们介绍了XL-SafetyBench,包含5500个测试案例,涵盖10个国家语言对,包括基于国家的对抗性提示的 Jailbreak Benchmark 和文化基准,其中本地敏感性嵌入在无害请求中。每个项目通过多阶段流程构建,结合LLM辅助发现、自动化验证门和双独立母语标注者。为区分原则性拒绝与理解失败,我们评估攻击成功率(ASR)并引入两个互补指标:中性安全率(NSR)和文化敏感性率(CSR)。评估10个前沿和27个本地LLM,发现两个关键发现:第一,前沿模型的 jailbreak 坚固性和文化意识无耦合关系,因此综合安全分数掩盖了各轴的变化。第二,本地模型表现出近线性的 ASR-NSR 权衡(r = -0.81),表明其表面安全性反映的是生成失败而非真正的对齐。XL-SafetyBench在多语言时代实现了更细致的跨文化安全评估。

英文摘要

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

2605.05660 2026-05-08 cs.LG math.OC

Distributionally Robust Multi-Objective Optimization

分布鲁棒多目标优化

Yufeng Yang, Fangning Zhuo, Ziyi Chen, Heng Huang, Yi Zhou

发表机构 * Department of Computer Science(计算机科学系) Texas A& M University(德克萨斯A&M大学) University of Maryland(马里兰大学)

AI总结 本文提出分布鲁棒多目标优化(DR-MOO),通过多梯度下降算法(MGDA)在非凸设定下实现ε-Pareto- stationary点,改进样本复杂度至O(ε^{-4})。

Comments 47 pages

详情
AI中文摘要

多目标优化(MOO)在需要多准则学习的应用中受到越来越多的关注。然而,现有的MOO公式没有明确考虑数据中的分布偏移。我们引入分布鲁棒多目标优化(DR-MOO),在各自最坏分布下最小化多个目标。我们提出了DR-MOO的帕累托型解概念,并开发了具有可证明保证的多梯度下降算法(MGDA)。利用拉格朗日对偶重述,我们首先设计了一个双循环MGDA,使用内循环估计对偶变量,达到ε-Pareto- stationary点的总样本复杂度O(ε^{-12})。为进一步提高效率,我们引入梯度裁剪以处理广义光滑和有偏梯度估计,取消双采样需求。这产生了一个单循环双裁剪MGDA,样本复杂度显著改进至O(ε^{-4})。我们的理论适用于非凸设定,不需有界目标或梯度。实验表明,我们的方法与最先进的MGDA基线竞争。

英文摘要

Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributionally robust multi-objective optimization (DR-MOO), which minimizes multiple objectives under their respective worst-case distributions. We propose Pareto-type solution concepts for DR-MOO and develop multi-gradient descent algorithms (MGDA) with provable guarantees. Leveraging a Lagrangian dual reformulation, we first design a double-loop MGDA that uses an inner loop to estimate dual variables and achieves a total sample complexity $\mathcal{O}(ε^{-12})$ for reaching an $ε$-Pareto-stationary point. To further improve efficiency, we incorporate gradient clipping to handle generalized-smooth and biased gradient estimates, removing the need for double sampling. This yields a single-loop double-clip MGDA with substantially improved sample complexity $\mathcal{O}(ε^{-4})$. Our theory applies to the nonconvex setting and does not require bounded objectives or gradients. Experiments demonstrate that our methods are competitive with state-of-the-art MGDA baselines.

2605.05659 2026-05-08 cs.LG

Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks

结构对应与对角线加低秩神经网络的通用逼近

Ying Chen, Aoxi Li, Jihun Kim, Javad Lavaei

发表机构 * Department of Industrial Engineering & Operations Research(工业工程与运筹学系)

AI总结 本文研究了严格限制在低秩流形上的神经网络的极限,提出结构对应框架,证明在低秩层中加入少量稀疏对角元素即可实现通用逼近,并展示乘法深度优于加法宽度的参数-表达性比例。

Comments 27 pages, 6 figures

详情
AI中文摘要

现代深度学习架构的计算成本驱动了参数高效的低秩结构的广泛应用,如LoRA和低秩分解。然而,其表达能力的理论保证较少,通常依赖于预训练基矩阵、ReLU激活或不可验证的奇异性条件。本文首先研究了严格限制在低秩流形上的神经网络的极限,发现纯秩-1层可以精确插值任意标量数据集,但函数逼近时会崩溃。为克服这一瓶颈而不牺牲参数效率,我们引入了统一的结构对应框架。证明在低秩层中仅添加最小稀疏对角元素(如对角线加低秩DLoR结构)即可达到通用逼近。展示任何满秩变换可通过这些DLoR组件重建,通过权衡网络宽度(加法分解)或深度(乘法分解)。通过跟踪渐近泰勒余项,证明DLoR神经网络可完全恢复一般激活函数的通用逼近定理。最后,证明乘法深度相比加法宽度具有更优的参数-表达性比例。结果表明,密集矩阵和特定激活函数并非通用表达性的拓扑必要条件。

英文摘要

The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are less explored, often relying on restrictive priors like a pretrained base matrix, ReLU activations or non-verifiable singularity conditions. We first investigate the limits of neural networks constrained strictly to low-rank manifolds without pretrained dense priors. We demonstrate a theoretical paradox: while purely rank-1 layers can exactly interpolate arbitrary scalar datasets, they collapse for function approximations. To overcome this bottleneck without surrendering parameter efficiency, we introduce a unified \textit{Structural Correspondence} framework. We prove that augmenting low-rank layers with only a minimal sparse diagonal component, say a Diagonal plus Low-Rank (DLoR) structure, is sufficient to reach Universal Approximation. We show that any full-rank transformation can be exactly reconstructed using these DLoR components by trading off network width (additive decomposition) or depth (multiplicative decomposition). By tracking asymptotic Taylor remainders, we prove that DLoR neural networks fully restore the Universal Approximation Theorem for general activation functions. Finally, we establish that multiplicative depth provides superior parameter-to-expressivity scaling compared to additive width. Our results show that dense matrices and specific activation functions are not topological prerequisites for universal expressivity.

2605.05657 2026-05-08 cs.AI cs.MA

Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

基于可证明预算保守的检索条件拓扑选择多智能体代码生成

Abhijit Talluri, Pujith Anne, Bhagavan Choudary Pendiyala, Raghavendra Chilukuri

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出RGAO架构,通过提取代码结构复杂度向量实现多智能体代码生成系统的拓扑选择,结合复杂度条件LLM路由与形式资源代数,实现可证明的预算保守性,降低误路由率并提升代码检索效率。

Comments 30 pages, 9 figures. NeurIPS 2026 Evaluations and Datasets Track Submission Under review

详情
AI中文摘要

多智能体LLM系统在代码生成中面临根本性的路由问题:最优编排拓扑依赖于待修改代码的结构复杂度,但现有系统在选择拓扑时未咨询代码库。我们提出了Retrieval-Guided Adaptive Orchestration (RGAO),一种通过从分层代码索引中提取结构复杂度向量在选择编排拓扑前闭合此循环的架构。RGAO在Code-Agent多智能体框架内运行,其子智能体由具有六维预算向量的正式合同所约束。我们的主要贡献是将两种先前独立的研究方向——基于复杂度的LLM路由和形式资源代数——结合起来,从而产生一种单独无法实现的性质:在检索条件动态拓扑选择下可证明的预算保守性。具体而言,我们贡献了:(1) 一种基于复杂度的拓扑路由器,将代理测量的误路由率从30.1%降至8.2%;(2) 一种具有结构归纳守恒定理的预算代数;以及(3) 一种分层代码检索引擎。实证评估显示了亚毫秒级DAG构建和线性树索引可扩展性。

英文摘要

Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-Agent, a multi-agent framework whose sub-agents are governed by formal contracts with six-dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work -- complexity-conditioned LLM routing and formal resource algebras -- yielding a property neither admits alone: provable budget conservation under retrieval-conditioned dynamic topology selection. Concretely we contribute: (1) a complexity-conditioned topology router that reduces proxy-measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural-induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub-millisecond DAG construction and linear tree-index scalability.

2605.05653 2026-05-08 cs.CL

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

负值之前正值:大型语言模型中的非对称情感处理

Sohan Venkatesh

发表机构 * Manipal Institute of Technology Bengaluru(班加罗尔Manipal理工学院)

AI总结 研究揭示大型语言模型中情感正负值处理的非对称性,通过激活修补和引导发现负面情感在早期层处理,正面情感在中后期层峰值,证明情感可被操控。

详情
AI中文摘要

机制可解释性揭示了大型语言模型(LLMs)中概念的编码方式,但情感内容在机制层面仍缺乏理解。我们研究LLMs是否通过专用内部结构或表层标记匹配处理情感估值。利用激活修补和引导在开源LLMs上,发现负面和正面估值在不同网络深度处理。负面结果局部化到早期层,正面结果在中后期层峰值。在固定主题的同时翻转估值产生符号相反响应,排除了主题检测。在识别层引导良好新闻方向,将中性提示转向正面估值,显示这些层将估值编码为可操控方向。大型语言模型中的情感估值是局部化、因果且可引导的,使其成为基于可解释性监督的目标。

英文摘要

Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.

2605.05646 2026-05-08 cs.CV

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

MUSE:通过拓扑正交性解决视觉标记化中的流形偏移

Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, Yao Hu, Yang Luo, Yongqiang Ma

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence,National Engineering Research Center of Visual Information and Applications,and Institute of Artificial Intelligence and Robotics, Xi'an Jiao Tong University(人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心、人工智能与机器人研究院、西安交通大学) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence,National Engineering Research Center of Visual Information(人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence(人工智能研究院) Robotics, Xi'an Jiao Tong University(机器人,西安交通大学)

AI总结 MUSE通过拓扑正交性解决视觉标记化中的流形偏移问题,实现高保真像素重建与语义抽象的平衡,提升生成质量和线性探测性能。

Comments 21 pages,Accepted by ICML 2026 main track

详情
AI中文摘要

统一的视觉标记化面临高保真像素重建(空间等变性)与语义抽象(概念不变性)之间的根本权衡。我们将其归因于流形偏移:朴素的联合优化导致对立梯度,形成重建与感知之间的零和游戏。为此,我们提出MUSE框架,基于拓扑正交性。通过将结构视为正交桥梁,MUSE在Transformer中解耦优化:结构梯度细化注意力拓扑,而语义梯度更新特征值。这将破坏性干扰转化为相互促进。实验表明,MUSE打破了这一权衡,实现了最先进的生成质量(gFID 3.08)并超越其教师模型InternViT-300M在线性探测中的表现(85.2% vs. 82.5%),证明了结构对齐的重建可以增强语义感知。代码可在https://github.com/PanqiYang1/MUSE获取。

英文摘要

Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.

2605.05643 2026-05-08 cs.AI cs.IR

Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

文本-图协同:一种双向验证与完成框架用于RAG

Jiarui Zhong, Hong Cai Chen

发表机构 * School of Automation, Southeast University(自动化学院,东南大学)

AI总结 本文提出TGS-RAG框架,通过双向机制融合文本与图结构,解决传统RAG中信息孤岛问题,提升多跳推理性能。

Comments 12 pages, 3 figures

详情
AI中文摘要

检索增强生成(RAG)已成为增强大型语言模型(LLMs)事实性和多跳推理能力的核心范式。传统基于文本的RAG常检索逻辑无关的伪证据,而基于图的RAG常受限于搜索时剪枝,可能丢弃潜在有效推理路径。现有混合方法主要采用简单证据拼接或单向增强,未能解决由文本与图结构间非对称推理流导致的根本性“信息孤岛”问题。本文提出TGS-RAG,一种统一的文本-图协同增强框架。TGS-RAG引入双向机制:(i)图到文本通道采用从访问图节点的全局投票策略重新排序和细化文本证据,过滤掉语义噪声;(ii)文本到图通道利用基于记忆的孤儿实体桥接算法。该算法利用文本线索主动复活搜索历史中被剪枝的有效推理路径,无需额外数据库开销。在多个多跳推理基准上的实验结果表明,TGS-RAG显著优于现有最先进基线,实现了检索精度与计算效率之间的优越平衡。

英文摘要

Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is frequently hindered by search-time pruning, which may discard potentially valid reasoning paths. Existing hybrid approaches primarily adopt simple evidence concatenation or unidirectional enhancement, which fails to address the fundamental "Information Island" problem caused by asymmetric reasoning flows between unstructured text and structured graphs. We propose \textbf{TGS-RAG}, a unified framework for \textbf{T}ext-\textbf{G}raph \textbf{S}ynergistic enhancement. TGS-RAG introduces a bidirectional mechanism: (i) a \textbf{Graph-to-Text} channel that employs a Global Voting strategy from visited graph nodes to re-rank and refine textual evidence, filtering out semantic noise; and (ii) a \textbf{Text-to-Graph} channel that utilizes the \textbf{Memory-based Orphan Entity Bridging} algorithm. This algorithm utilizes textual cues to proactively resurrect valid but previously pruned reasoning paths from the search history without additional database overhead. Experimental results on multiple multi-hop reasoning benchmarks demonstrate that TGS-RAG significantly outperforms state-of-the-art baselines, achieving a superior balance between retrieval precision and computational efficiency.

2605.05640 2026-05-08 cs.CV

AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

AffectSeek: 长视频中基于模糊用户查询的代理情感理解

Zhen Zhang, Yuhang Yang, Yunxiang Jiang, Yuhuan Lu, Haifeng Lu, Zheng Lian, Runhao Zeng, Xiping Hu

发表机构 * Gansu Provincial Key Laboratory of Wearable Computing, School of Information Science and Engineering, Lanzhou University(甘肃省可穿戴计算重点实验室,兰州大学信息科学与工程学院) Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University(粤港澳大湾区情感智能与泛在计算联合实验室,深圳MSU-BIT大学) Department of Computer and Information Engineering, Khalifa University(计算机与信息工程系,哈利法大学) Artificial Intelligence Research Institute, Shenzhen MSU-BIT University(人工智能研究院,深圳MSU-BIT大学) Department of Electrical and Computer Engineering, The University of Hong Kong(电子与计算机工程系,香港大学)

AI总结 本文提出VQAU任务,要求模型在长视频中定位情感时刻、预测情绪类别并生成证据支持的解释。构建VQAU-Bench基准,并提出AffectSeek框架,通过分步推理实现模糊查询驱动的情感理解。

详情
AI中文摘要

现有情感理解研究主要关注从图像、音频或预剪视频中识别情绪,其中情感证据已给出。这种被动且以片段为中心的设置无法完全反映真实场景,用户常与长视频互动并通过自然语言查询表达需求。本文研究Vague-Query-driven视频情感理解(VQAU)任务,要求模型在长视频中定位情感时刻、预测情绪类别并生成证据支持的解释。为支持该任务,我们构建了VQAU-Bench基准,整合了长视频、模糊情感查询、时间片段标注、情绪标签和解释性说明,形成统一评估框架。VQAU-Bench可系统评估语义-时间-情感对齐、情感时刻定位、情绪分类和解释生成。为解决VQAU的多步推理挑战,我们进一步提出AffectSeek框架,通过主动寻找、验证和解释情感时刻实现长视频情感理解。AffectSeek将VQAU分解为意图解释、候选定位、片段验证、情绪推理和解释生成,并通过角色专业化推理和跨阶段验证逐步对齐模糊用户意图与长视频证据。实验表明,VQAU对现有情感识别模型和单步视觉-语言模型仍具挑战性,而AffectSeek提供了一个简单而有效的代理长视频情感理解框架。

英文摘要

Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.

2605.05638 2026-05-08 cs.LG

Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning

预训练表示的扩展实现了无需微调的无标签分布外检测

Brett Barkley, Preston Culbertson, David Fridovich-Keil

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Department of Computer Science(计算机科学系) Cornell University(康奈尔大学) Department of Aerospace Engineering and Engineering Mechanics(航空航天工程与工程力学系)

AI总结 本文研究了通过扩展预训练表示实现无需微调的无标签分布外检测,展示了冻结表示的几何结构对检测性能的影响,证明了在不同任务中,表示质量提升可提高检测效果。

详情
AI中文摘要

训练深度学习模型的模型往往在输入超出训练数据流形时无法发出信号,导致在分布偏移下预测不可靠。先前的工作认为有效的分布外(OOD)检测通常需要条件类建模或通过监督微调获得的专用模型。我们重新审视现代预训练模型中的这一假设,证明其冻结的表示已经编码了足够的几何结构,以实现准确的无标签OOD检测。在涵盖视觉和语言的59个基础模型-任务配对中,我们比较了两种互补的无标签检测器:一个基于未标记潜在表示拟合的全局马哈拉诺布斯估计器,以及ReSCOPED,一个轻量级、基于扩散的典型性估计器,操作在同一特征上局部层面。尽管它们的检测机制不同,表示扩展揭示了一种依赖于的恒定模式:本地和全局检测器的绝对性能随着表示质量的提高而改善,并且在语言和视觉任务中,随着表示的扩展,检测器之间的性能差距消失。这些结果表明,无标签OOD检测强烈依赖于冻结预训练基础模型暴露的几何结构,随着基础模型规模的增加,检测器选择的重要性降低,从而可以高效地直接部署在冻结模型上。

英文摘要

Models trained with deep learning often fail to signal when inputs fall outside their training data manifold, leading to unreliable predictions under distribution shift. Prior work suggests that effective out-of-distribution (OOD) detection often requires class-conditional modeling or specialized models obtained through supervised fine-tuning. We revisit this assumption in modern pretrained models and show that their frozen representations already encode sufficient geometric structure for accurate label-free OOD detection. Across 59 backbone-task pairings spanning vision and language, we compare two complementary label-free detectors: a global Mahalanobis estimator fit on unlabeled latent representations, and ReSCOPED, a lightweight, diffusion-based typicality estimator operating on the same features at a local level. Despite their different detection mechanisms, representation scaling reveals a consistent regime-dependent pattern: both local and global detectors' absolute performance improves with better representation quality, and performance gaps between the two detectors disappear across both language and vision tasks as representations scale. These results suggest that label-free OOD detection depends strongly on the geometry exposed by frozen pretrained backbones, reducing the importance of detector choice as backbone scale increases and enabling efficient deployment directly on frozen models.

2605.05636 2026-05-08 cs.CV cs.GR

Learning a Delighting Prior for Facial Appearance Capture in the Wild

为野外面部外观捕捉学习一种愉悦先验

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu

发表机构 * School of Software(软件学院) BNRist, Tsinghua University(清华大学计算机系) ShanghaiTech University(上海科技大学) Deemos Technology Co., Ltd.(德莫斯科技有限公司)

AI总结 本文提出通过训练强大愉悦网络作为先验来约束优化,结合异构数据源实现高质量反射率估计,同时开源模型和数据集推动高精度面部捕捉研究。

Comments ACM Transactions on Graphics (Proc. of SIGGRAPH), 2026. Code: https://github.com/yxuhan/OpenDelight Project Page: https://yxuhan.github.io/OpenDelight/index.html

详情
AI中文摘要

高质量面部外观捕捉传统上需要昂贵的摄影棚录制。近期工作考虑了野外智能手机设置,但其基于模型的反向渲染范式在分离未知光照下的反射率时面临挑战。为此,我们提出将范式转向训练强大愉悦网络作为先验来约束优化。我们利用OLAT数据集和渲染的Light Stage扫描进行训练,并提出数据集潜在调节(DLM)来无缝整合这些异构数据源。具体而言,通过将核心网络条件化于可学习的源感知标记,我们解耦了数据集特定的风格与物理愉悦原则,使出现的愉悦先验优于现有专有模型。这种强大的愉悦先验使出现了一种简单自动的外观捕捉流程,能够从随意视频输入中实现高质量的反射率估计,大幅优于先前方法。此外,我们利用外观捕捉方法将多视角NeRSemble数据集转换为NeRSemble-Scan,一个大规模4K分辨率可重照明扫描集。通过开源我们的模型和NeRSemble-Scan数据集,我们民主化了高端面部捕捉,并为研究社区提供了构建逼真数字人类的新基础。

英文摘要

High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.

2605.05627 2026-05-08 cs.CV cs.AI cs.LG cs.RO

Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

利用图像生成器解决训练数据稀缺问题:Gen4Regen数据集用于森林再生制图

Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony Deschênes, François Pomerleau, Philippe Giguère

发表机构 * Northern Robotics Laboratory, Université Laval, Québec, QC, G1V 0A6, Canada(拉瓦尔大学北部机器人实验室,魁北克,QC,G1V 0A6,加拿大) Département d'informatique et de génie logiciel, Université Laval, Québec, QC, G1V 0A6, Canada(拉瓦尔大学计算机与软件工程系,魁北克,QC,G1V 0A6,加拿大)

AI总结 本文通过Gen4Regen数据集和Nano Banana Pro模型生成高质量图像及语义掩码,解决森林再生物种分割中的数据稀缺和类别不平衡问题,提升F1分数15%以上。

Comments 36 pages, 17 figures

详情
AI中文摘要

可持续森林管理依赖精确的物种组成制图,但传统地面调查劳动强度大且地理受限。尽管无人机提供可扩展的数据收集,但深度学习解释的过渡受专家标注图像严重稀缺的瓶颈限制,特别是在复杂且视觉异质的再生区。本文通过提供可扩展框架减少对人工照片解释的依赖,同时利用大规模视觉语言Nano Banana Pro模型生成高保真图像及其对应的像素对齐语义掩码。我们引入WilDReF-Q-V2数据集,包含13977张未标注和50张标注的现实图像,以及Gen4Regen数据集,包含2101对合成图像和语义掩码。我们的方法整合现实数据与AI生成图像,证明AI生成数据对现实数据的高互补性,统一训练使F1分数比纯监督基线提升超过15%。此外,我们展示即使少量提示生成数据也能显著提升低表示物种的性能,某些物种的F1分数提升高达30%。我们得出结论:视觉语言模型可作为敏捷数据生成器,有效启动感知任务,用于专家标签稀缺或不可用的 niche AI 领域。我们的数据集、源代码和模型将在https://norlab-ulaval.github.io/gen4regen提供。

英文摘要

Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab-ulaval.github.io/gen4regen.

2605.05626 2026-05-08 cs.CL cs.AI

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak: 一个多党对话中大型语言模型的时间参与和发言时机的数据集

Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent

发表机构 * Pratt School of Engineering(普拉特工程学院) Duke University(杜克大学)

AI总结 本文提出When2Speak数据集,通过四阶段生成流程学习群体互动中的发言时机,通过强化学习优化模型,提升多党对话中的发言准确性与自然度。

Comments Currently under review. Dataset can be found: https://huggingface.co/datasets/duke-trust-lab/When2Speak

详情
AI中文摘要

大型语言模型(LLMs)在生成上下文合适响应方面表现出色,但在多党对话中决定何时发言同样关键。我们引入When2Speak,一个基于现实的合成数据集和四阶段生成流程,用于学习群体互动中的干预时机。数据集包含超过215,000个示例,源自16,000次包含2-6名发言者的对话,涵盖多样化的对话风格、语气和参与者动态,并在每个回合显式建模SPEAK vs. SILENT决策。我们的流程结合了现实世界的基础、结构化增强、受控转录合成和可微调的监督,并完全开源以支持可重复性和适应特定领域的对话规范。在多个模型家族中,对When2Speak的监督微调(SFT)显著优于零样本基线(例如,平均宏F1增加超过4B+参数模型的60%,最大增加为120%)。然而,SFT训练的模型仍系统性地过于保守,错过近一半的必要干预,如通过遗漏干预率(MIR)显示,平均为0.50,即使在较大模型规模下也存在。为了解决这一限制,我们应用了具有不对称奖励塑造的强化学习,将MIR降低到0.186-0.218,并将召回率从0.479提高到0.78-0.81。我们的发现表明,时间参与是对话智能的一个独特且可训练的维度,而基于现实的合成数据为使LLM更自然和适当地参与多党互动提供了有效且可扩展的途径。

英文摘要

Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.

2605.05623 2026-05-08 cs.LG

Region-adaptable retrieval of coastal biogeochemical parameters from near-surface hyperspectral remote sensing reflectance using physics-aware meta-learning

基于物理感知元学习的沿海生物地球化学参数近地表高光谱遥感反射率区域适应性检索

Yiqing Guo, Nagur R. C. Cherukuru, Eric A. Lehmann, S. L. Kesav Unnithan, Tim J. Malthus, Gemma Kerrisk, Xiubin Qi, Faisal Islam, Tisham Dhar, Mark J. Doubell

发表机构 * Queensland Department of the Environment, Tourism, Science and Innovation(昆士兰环境、旅游、科学和创新部) South Australia Research and Development Institute, Aquatic Sciences(南澳大利亚研发研究所,水科)

AI总结 本文提出一种两阶段物理感知元学习框架,通过合成数据预训练区域无关基础模型,再用局部样本微调,实现对不同区域的生物地球化学参数的高光谱遥感反射率适应性检索。

详情
AI中文摘要

高光谱原位传感在获取水体生物地球化学(BGC)参数,如总悬浮固体、溶解有机碳和总叶绿素a方面显示出潜力,可用于低成本监测沿海水质。然而,将此类检索算法推广到不同水体仍具挑战性,因为远程传感反射率(Rrs)与BGC参数之间的关系因地区环境条件和生物地球化学差异而显著变化。在本研究中,我们提出了一种两阶段的物理感知元学习框架,用于从近地表Rrs观测中检索沿海BGC参数。第一阶段利用生物光学正向模型生成一个基于具有广泛代表性的澳大利亚沿海水体原位生物光学光谱库的大合成数据集。然后,该数据集用于预训练一个区域无关的基础模型,通过元学习使其学习基本的物理关系。第二阶段,预训练的基础模型通过局部样本进行微调以适应特定区域。我们从澳大利亚沿海五个地理上不同的地点收集了原位高光谱Rrs和BGC测量数据。实验结果表明:(1)BGC参数及其对应的高光谱Rrs特征在实验地点之间表现出明显的区域差异;(2)合成数据集在物理上合理,并且在参数分布和参数间相关性方面与真实世界样本紧密一致;(3)所提出的方法在BGC检索中优于五个基准模型;(4)原位测量和模型预测的BGC参数时间序列在幅度和时间动态上表现出良好的一致性。

英文摘要

Hyperspectral in situ sensing has shown promise in retrieving aquatic biogeochemical (BGC) parameters, such as total suspended solids, dissolved organic carbon, and total chlorophyll-a, for cost-effective monitoring of coastal water quality. However, generalising such retrieval algorithms across water bodies remains challenging, as the relationship between remote sensing reflectance (Rrs) and BGC parameters can vary considerably from one region to another due to regional distinctions in environmental conditions and biogeochemistry that lead to different BGC ranges and bio-optical properties. In this study, we propose a two-stage physics-aware meta-learning framework for retrieving coastal BGC parameters from near-surface Rrs observations. In the first stage, a bio-optical forward model is used to generate a large synthetic dataset based on an in situ bio-optical spectral library with broad representativeness of Australian coastal waters. This dataset is then used to pretrain a region-agnostic base model with meta-learning, allowing the model to learn fundamental physical relationships. In the second stage, the pretrained base model is fine-tuned for specific regions with local samples. We collected in situ hyperspectral Rrs and BGC measurements from five geographically distinct sites in Australian coastal waters. Our experimental results suggest: (1) the BGC parameters and their corresponding hyperspectral Rrs signatures exhibited clear regional distinctions among the experimental sites; (2) the synthetic dataset was physically plausible and closely aligned with real-world samples in both parameter distributions and inter-parameter correlations; (3) the proposed approach outperformed five benchmark models in BGC retrieval; and (4) time series of in situ measured and model-predicted BGC parameters showed good agreement in both magnitude and temporal dynamics.

2605.05616 2026-05-08 cs.CV cs.LG

RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

RAM-H1200:手部放射影像中类风湿性关节炎的统一评估与数据集

Songxiao Yang, Haolin Wang, Yao Fu, Junmu Peng, Lin Fan, Hongruixuan Chen, Jian Song, Masayuki Ikebe, Shinya Takamaeda-Yamazaki, Masatoshi Okutomi, Tamotsu Kamishima, Yafei Ou

发表机构 * Institute of Science Tokyo(东京科学研究所) Hokkaido University(北海道大学) Southwest Jiaotong University(西南交通大学) RIKEN(理化学研究所) The University of Tokyo(东京大学)

AI总结 本文提出RAM-H1200数据集,用于评估模型在手部放射影像中联合捕捉解剖结构、局部侵蚀性病变和临床标准化RA严重程度的能力,包含1200张放射影像及多级标注。

Comments 50 pages, 24 figures, 25 tables

详情
AI中文摘要

类风湿性关节炎(RA)评估需多级分析和建模解剖结构及细微局部病理变化。然而现有公开资源不支持统一多级分析,常缺乏全手覆盖、细粒度标注和与临床评分系统的一致整合。特别是能够进行骨侵蚀(BE)定量分析的标注仍稀缺。RAM-H1200包含来自六个医疗中心的1200张手部放射影像,包含多级标注:(i)全手骨结构实例分割,(ii)像素级BE掩码,(iii)SvdH定义的感兴趣区域,以及(iv)BE和关节间隙狭窄(JSN)的关节级SvdH评分。其旨在评估模型能否联合捕捉手部放射影像中的解剖结构、局部侵蚀性病变和临床标准化RA严重程度。所提出的BE掩码首次通过提供显式空间监督实现超越粗粒度分类的BE分析。据我们所知,RAM-H1200是首个公开的大规模基准,同时支持全手骨结构实例分割、像素级BE勾勒以及临床基础的关节级SvdH评分。在基准任务中的结果表明,解剖建模比定量BE分析更成熟:全手骨分割表现强劲,而BE分割仍是一个主要开放挑战。通过统一解剖结构建模、定量病变分析和临床基础的SvdH评分,RAM-H1200提供了一个全面的RA分析单一基准。

英文摘要

Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.

2605.05609 2026-05-08 cs.LG econ.EM stat.ML

Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

在无偏非利普希茨需求下的最优上下文定价

Jianyu Xu, Yu-Xiang Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文研究了具有线性估值和有界支持无偏噪声的上下文动态定价问题,提出了一种多项式时间算法,通过随机参数估计、保守残差网格探测和基于置信度的一步重定向,实现了最优的 regret,改进了之前的上界。

Comments 30 pages, 1 figure, 1 table

详情
AI中文摘要

我们研究了具有线性估值和有界支持无偏噪声的上下文动态定价问题,其诱导的需求曲线可能非利普希茨,具有任意跳跃和原子。这些不连续性破坏了平滑需求定价算法所使用的跨上下文插值论证,而之前最好的方法仅实现了~O(T^{3/4})的 regret。我们提出保守降价重定向 UCB 定价算法,结合了随机参数估计、保守残差网格探测和基于置信度的一步重定向。我们的算法实现了~O(T^{2/3})最优 regret,与 Kleinberg 和 Leighton(2003)已知的下界相符,仅在对数因子上,优于 Xu 和 Wang(2022)之前的上界。在随机良好条件的上下文中,这填补了线性估值上下文定价在无偏非利普希茨噪声分布下的长期存在的 regret 间隙。

英文摘要

We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.

2605.05598 2026-05-08 cs.AI cs.HC

Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development

Prober.ai:基于LLM约束人设的门控探究式反馈用于论辩写作发展

Ran Bi, Shiyao Wei, Yuanyiyi Zhou

发表机构 * Florida State University(佛罗里达州立大学) New York University(纽约大学)

AI总结 Prober.ai通过约束LLM生成结构化JSON输出,提供探究式问题反馈,帮助学生提升论辩写作能力,其双阶段交互架构通过强制反思提升认知能力。

Comments Prototype awarded second place at the NYEdTech Hackathon (March 2026) https://www.nyedtechhackathon.com/2026-submissions

详情
AI中文摘要

大语言模型(LLM)在教育中的普及反而削弱了其支持的认知过程。学生越来越多地将批判性思维委托给AI助手,导致认知债务和论辩推理能力下降。我们提出了Prober.ai,一个基于网页的写作环境,颠覆传统AI辅导范式:系统通过特定角色提示和结构化JSON输出模式约束LLM,仅生成针对论辩弱点的探究式问题。双阶段交互架构(挑战与解锁)通过强制学生反思来实现教学摩擦机制。系统设计基于托尔金的论辩理论、同伴反馈提问机制研究以及AI支持的写作教学证据。在纽约EdTech黑客松(2026年3月)中,36小时内开发了一个功能原型,获得第二名。我们描述了系统架构、约束LLM输出至教学对齐JSON模式的提示工程方法,并讨论了在写作教育中实现可扩展、认知保护的AI整合的含义。

英文摘要

The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, resulting in measurable cognitive debt and diminished argumentative reasoning skills. We present Prober.ai, a web-based writing environment that inverts the conventional AI-tutoring paradigm: rather than generating or rewriting student text, the system constrains an LLM (Gemini 3 Flash Preview) through persona-specific system prompts and structured JSON output schemas to produce only targeted, inquiry-based questions about argumentative weaknesses. A two-phase interaction architecture -- Challenge and Unlock -- implements a pedagogical friction mechanism whereby revision suggestions are gated behind mandatory student reflection. The system's design is grounded in Toulmin's argumentation theory, research on peer feedforward questioning mechanisms, and evidence on AI-supported feedback in writing instruction. A functional prototype was developed in 36 hours during the NY EdTech Hackathon (March 2026), where it was awarded second place. We describe the system architecture, the prompt engineering methodology for constraining LLM output to pedagogically aligned JSON schemas, and discuss implications for scalable, cognition-preserving AI integration in writing education.

2605.05594 2026-05-08 cs.CL cs.CV cs.LG

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

上下文的成本:减轻多模态检索增强生成中的文本偏差

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering(埃尔莫夫家族电气与计算机工程学院)

AI总结 本文探讨了多模态大语言模型在整合检索增强生成时出现的文本偏差问题,提出BAIR方法通过恢复视觉显著性并施加位置感知惩罚来提升多模态接地和诊断可靠性。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)越来越多地与检索增强生成(RAG)结合以减轻幻觉,但引入外部文档可能导致实例层面的严重失败模式。我们识别并正式化了“recorruption”现象,即即使引入完美准确的“ oracle”上下文,有能力的模型也会放弃最初正确的预测。通过分析内部注意力矩阵,我们发现recorruption由双重注意力崩溃驱动:(1)视觉盲性,表现为系统性抑制视觉注意力质量(M_vis)和锐度(S_vis);(2)结构性位置偏差,迫使模型优先考虑边界令牌而非语义相关性。我们的分析揭示了“成功幻觉”,表明许多看似正确的RAG结果仅仅是位置巧合,模型的文本复制偏差恰好与真实位置一致。为解决这些漏洞,我们提出Bottleneck Attention Intervention for Recovery(BAIR),一种无需参数、推理时的框架,可恢复视觉显著性并施加位置感知惩罚至文本干扰项。在医疗事实性、社会公平性和地理空间基准上,BAIR成功恢复多模态接地并提高诊断可靠性,无需模型重新训练或微调。

英文摘要

While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

2605.05593 2026-05-08 cs.AI

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

多模态大语言模型内部视觉表征的因果探测

Zehao Deng, Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Ant Group(蚂蚁集团)

AI总结 研究通过激活引导框架系统干预四个视觉概念类别,揭示实体记忆局部化与抽象概念全局分布的差异,发现模型深度对复杂抽象概念编码至关重要,且反向引导揭示感知与生成之间的补偿机制。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在多样任务中取得显著成功,但其内部机制如何编码和 grounding 不同的视觉概念仍不清晰。为此,我们提出基于激活引导的因果框架,主动探测和操控内部视觉表征。通过系统干预四个视觉概念类别,结果揭示了概念编码的分歧:实体表现出不同的局部记忆,而抽象概念则在网络中全局分布。关键发现是,增加模型深度对于编码分布和复杂抽象概念至关重要,而实体定位对规模变化具有显著不变性。进一步,反向引导揭示阻断显式输出会引发潜在激活的激增,暴露感知与生成之间的补偿机制。最后,将分析扩展到视觉推理,揭示尽管MLLMs成功识别几何关系,但它们仅将其视为静态视觉特征,未能触发抽象问题解决所需的过程执行。

英文摘要

Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit distinct localized memorization, whereas abstract concepts are globally distributed across the network. Critically, this divergence uncovers a mechanistic driver of scaling laws: increasing model depth is indispensable for encoding distributed and complex abstract concepts, whereas entity localization remains remarkably invariant to scale. Furthermore, reverse steering uncovers that blocking explicit output triggers a surge in latent activations, exposing a compensatory mechanism between perception and generation. Finally, extending our analysis to visual reasoning, we expose a disconnect between perception and reasoning although MLLMs successfully recognize geometric relations, they treat them merely as static visual features, failing to trigger the procedural execution necessary for abstract problem-solving.

2605.05592 2026-05-08 cs.LG cs.IT math.IT

When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation

投票何时能帮助、伤害或改变进程?二元测试时间聚合的精确结构

Yi Liu

发表机构 * York University(约克大学)

AI总结 研究探讨了投票在二元测试时间聚合中的精确结构,揭示了投票曲线的非单调性和无限趋势变化,并证明了投票签名与完整曲线的等价性。

详情
AI中文摘要

多数投票是少数能提升固定随机预测器的黑箱干预之一:重复访问可能比更换高能力模型更便宜。经典固定能力理论使这种干预显得单调——超过多数阈值时更多投票有帮助,低于时则有害。我们证明这种图景本质上不完整。在德芬尼蒂表示下,交换性重复正确性下,投票由每个例子的正确性概率的潜在分布所主导。即使简单的潜在混合物也能生成显著不同的投票曲线,包括非单调行为和显式构造中的无限趋势变化。完整的潜在规律决定曲线,但曲线不决定规律。投票恢复的对象是一个带符号的投票签名:在二项式方差尺度上,它记录了超过而非低于多数阈值的潜在质量。主要定理证明了完整奇数预算曲线和此签名的等价性:曲线增量是带符号的豪斯多夫矩,完整曲线唯一恢复签名。这一观点解释了形状现象、分支对称非识别性、可实现性、变化性和端点速率。它还分离了估计阶段:直接每个例子的成功概率信息针对完整签名,而固定深度分组标签仅揭示有限前缀。

英文摘要

Majority voting is one of the few black-box interventions that can improve a fixed stochastic predictor: repeated access can be cheaper than changing a high-capability model. Classical fixed-competence theory makes this intervention look monotone -- more votes help above the majority threshold and hurt below it. We show that this picture is fundamentally incomplete. Under the de Finetti representation for exchangeable repeated correctness, voting is governed by a latent distribution of per-example correctness probabilities. Even simple latent mixtures can generate sharply different voting curves, including nonmonotone behavior and, in an explicit construction, infinitely many trend changes. The full latent law determines the curve, but the curve does not determine the law. The exact object recovered by voting is a signed voting signature: at each binomial variance scale, it records excess latent mass above rather than below the majority threshold. Our main theorem proves that the complete odd-budget curve and this signature are equivalent: the curve increments are signed Hausdorff moments, and the full curve recovers the signature uniquely. This viewpoint explains shape phenomena, branch-symmetric nonidentifiability, realizability, variation, and endpoint rates. It also separates estimation regimes: direct per-example success-probability information targets the full signature, whereas fixed-depth grouped labels reveal only a finite prefix.

2605.05590 2026-05-08 cs.CV

Uncertainty-Guided Edge Learning for Deep Image Regression in Remote Sensing

不确定性引导的边缘学习用于遥感中的深度图像回归

Anh Vu Nguyen, Dino Sejdinovic, Tat-Jun Chin

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所)

AI总结 本文提出一种不确定性引导的边缘学习算法,用于遥感中的深度图像回归,通过深度beta回归提高边缘设备上的训练效率。

Comments AI4Space @ CVPR 2026

详情
AI中文摘要

边缘学习指的是在边缘平台上训练机器学习模型,通常使用新积累的数据。边缘设备的计算限制不仅影响模型优化,还影响当前模型对未标记数据的预测不确定性计算,这对于指导模型更新至关重要。本文研究了在遥感卫星上执行深度图像回归的边缘学习,其中深度网络由 onboard 计算机执行,从输入图像回归一个标量 y,例如 y 是表示云覆盖或土地利用的像素百分比。我们提出了一种不确定性引导的边缘学习(UGEL)算法,能够准确优先数据以加速 onboard 回归模型的训练收敛。UGEL 的基础是基于深度 beta 回归的预测不确定性计算,其中深度网络用于估计 beta 分布的参数,对于输入图像的目标 y 有高概率。与现有不确定性估计方法相比,深度 beta 回归可以在单次前向传递中计算,并允许更通用的预测分布。结果表明,UGEL 比主动或半监督学习具有更快的边缘学习收敛速度。代码和模型可在 https://github.com/anh-vunguyen/UGEL 上公开获取。

英文摘要

Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertainty of the current model on the unlabelled data, which is vital for informing model updating. In this paper, we investigate edge learning in the context of performing deep image regression on a remote sensing satellite, where a deep network is executed by an onboard computer to regress a scalar $y$ from an input image, e.g., $y$ is the percentage of pixels indicating cloud coverage or land use. We propose an uncertainty-guided edge learning (UGEL) algorithm that can accurately prioritise the data to speed up training convergence of the on-board regression model. Underpinning UGEL is the calculation of predictive uncertainty based on deep beta regression, where a deep network is used to estimate the parameters of a beta distribution for which the target $y$ for an input image has a high likelihood. Compared to established methods for uncertainty estimation that are either too costly on edge devices (e.g., require many forward passes per sample) or make strict assumptions on the predictive distribution (e.g., Gaussian), deep beta regression is computable in a single forward pass and allows more general predictive distributions. Results show that UGEL delivers faster-converging edge learning than active or semi-supervised learning. Code and models are publicly available at https://github.com/anh-vunguyen/UGEL.

2605.05586 2026-05-08 cs.LG

AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling

AeroJEPA:学习语义潜在表示以实现可扩展的3D气动场建模

Francisco Giral, Abhijeet Vishwasrao, Andrea Arroyo Ramo, Mahmoud Golestanian, Federica Tonti, Adrian Lozano-Duran, Steven L. Brunton, Sergio Hoyas, Hector Gomez, Soledad Le Clainche, Ricardo Vinuesa

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) University of Michigan(密歇根大学) Universitat Politècnica de València(瓦伦西亚理工大学) Purdue University(普渡大学) Caltech(加州理工学院) University of Washington(华盛顿大学)

AI总结 AeroJEPA通过联合嵌入预测架构,在高保真度和大规模场景中实现可扩展的3D气动场建模,通过语义组织潜在空间提升建模效率与设计意义。

详情
AI中文摘要

气动替代模型日益被用于替代重复的高保真度CFD评估,但现有方法仍面临两个重要限制:难以扩展到现实3D气动学中出现的非常大的场,且很少产生对分析和设计直接有用的有效潜在表示。我们引入AeroJEPA,一种用于气动场建模的联合嵌入预测架构,解决了这两个问题。而不是直接从几何预测完整流场,AeroJEPA从几何和操作条件的上下文潜在表示预测流场的目标潜在表示,并可选地通过连续隐式解码器重建场。这种形式将潜在预测与场分辨率解耦,同时鼓励潜在空间进行语义组织。我们在两个互补数据集上评估AeroJEPA:HiLiftAeroML,该数据集在高保真度范围内测试方法,具有极大的边界层场;SuperWing,该数据集测试大规模泛化和潜在空间优化,覆盖广泛跨音速机翼家族。在这些基准测试中,AeroJEPA在连续替代气动场方面具有竞争力,能够自然扩展到高分辨率输出,并学习编码几何和气动量的上下文和预测潜在表示,这些量未直接用作监督。我们进一步表明,所得到的潜在空间支持受控插值、线性探测、概念向量算术和受约束的设计潜在优化实验。这些结果表明,预测潜在学习是可扩展且具有设计意义的气动替代建模的有前途方向。

英文摘要

Aerodynamic surrogate models are increasingly used to replace repeated high-fidelity CFD evaluations in many-query design settings, but current approaches still face two important limitations: they often scale poorly to the very large fields arising in realistic 3D aerodynamics, and they rarely produce latent representations that are directly useful for analysis and design. We introduce AeroJEPA, a Joint-Embedding Predictive Architecture for aerodynamic field modeling that addresses both issues. Rather than predicting the full flow field directly from geometry, AeroJEPA predicts a target latent representation of the flow from a context latent representation of the geometry and operating conditions, and optionally reconstructs the field through a continuous implicit decoder. This formulation decouples latent prediction from field resolution while encouraging the latent space to organize semantically. We evaluate AeroJEPA on two complementary datasets: HiLiftAeroML, which stresses the method in a high-fidelity regime with extremely large boundary-layer fields, and SuperWing, which tests large-scale generalization and latent-space optimization over a broad family of transonic wings. Across these benchmarks, AeroJEPA is competitive as a continuous surrogate for aerodynamic fields, scales naturally to high-resolution outputs, and learns context and predicted latents that encode geometry and aerodynamic quantities not used directly as supervision. We further show that the resulting latent space supports controlled interpolation, linear probing, concept-vector arithmetic, and a constrained design latent-optimization experiment. These results suggest that predictive latent learning is a promising direction for scalable and design-meaningful aerodynamic surrogate modeling.

2605.05580 2026-05-08 cs.AI

AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading

AlphaCrafter:一个跨截面量化交易的全栈多智能体框架

Yishuo Yuan, Jiayi Sheng, Sirui Zeng, Jiaqi Wang, Jiaheng Liu

发表机构 * Nanjing University(南京大学)

AI总结 AlphaCrafter通过连续适应的因子到执行管道,整合因子发现、制度适应选择和风险约束执行,实现跨截面量化交易的鲁棒性能。

Comments Submitted to NeurIPS 2026. 26 pages, 8 figures,

详情
AI中文摘要

金融市场本质上是非平稳的,受宏观经济制度、微观结构摩擦和行为动态的复杂交互驱动。构建持续盈利的量化策略需要连续耦合因子发现、制度适应选择和风险约束执行。现有方法却在静态或孤立假设下优化这些组件。因子挖掘框架通常将alpha发现视为一次性搜索过程,隐含假设因子效用在市场制度中持续存在。以执行为导向的系统常采用角色扮演智能体架构模拟拟人化交易委员会,引入行为噪声而非系统理性。因此,一个统一的量化管道全自动化、理性驱动框架仍缺失。我们引入AlphaCrafter,一个全栈多智能体框架,通过连续适应的因子到执行管道弥合这一差距,设计用于跟踪并响应演变的市场条件而无需人工干预。AlphaCrafter通过三个专业化智能体运作:Miner持续扩展因子池通过LLM引导搜索,Screener评估现行市场条件以构建制度条件因子集合,Trader将这些集合转化为在显式风险约束下的量化策略。这些智能体共同形成一个闭环的跨截面交易系统,整体适应演变的市场动态。在CSI 300和S&P 500上的大量实验表明,AlphaCrafter在风险调整收益上持续优于最先进的基线,同时表现出最低的跨试验方差,证实了整合和适应的因子到执行设计产生稳健的交易性能。

英文摘要

Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time search process, implicitly assuming that factor efficacy persists across market regimes. Execution-oriented systems often adopt role-playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality-driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full-stack multi-agent framework that closes this gap through a continuously adaptive factor-to-execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM-guided search, a Screener that assesses prevailing market conditions to construct regime-conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed-loop cross-sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 demonstrate that AlphaCrafter consistently outperforms state-of-the-art baselines in risk-adjusted returns while exhibiting the lowest cross-trial variance, confirming that integrated and adaptive factor-to-execution design yields robust trading performance.

2605.05577 2026-05-08 cs.LG cs.AI

Accelerating LMO-Based Optimization via Implicit Gradient Transport

通过隐式梯度传输加速基于LMO的优化

Won-Jun Jang, Si-Hyeon Lee

发表机构 * School of Electrical Engineering(电气工程学院) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 本文提出LMO-IGT方法,利用隐式梯度传输提升基于LMO的优化效率,引入统一框架和新的站定性度量RSF,证明其在迭代复杂度和计算效率上的优势。

详情
AI中文摘要

最近的优化器如Lion和Muon通过线性最小化Oracle(LMO)规范化梯度动量表现出色。尽管已探索方差减少以加速LMO方法,但通常因额外梯度评估导致计算开销大。同时,LMO方法的理论理解在无约束和有约束形式中仍不连贯。受这些限制启发,我们提出LMO-IGT,一种利用隐式梯度传输(IGT)的随机LMO方法。我们进一步引入统一的随机LMO优化框架,并引入新的站定性度量,即正则化支撑函数(RSF),在共同框架下连接梯度范数和Frank-Wolfe间隙概念。通过在运输点评估随机梯度,LMO-IGT加速收敛同时保留标准随机LMO的单梯度每迭代结构。分析表明,随机LMO的迭代复杂度为O(ε^{-4}),方差减少LMO为O(ε^{-3})但需额外梯度评估,而LMO-IGT为O(ε^{-3.5})仅需单次随机梯度每迭代。实验表明,LMO-IGT在无显著开销下优于随机LMO方法。其实例Muon-IGT在评估设置中表现最佳,证明IGT为现代LMO优化提供有效且实用的加速机制。

英文摘要

Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations, we propose \emph{LMO-IGT}, a new class of stochastic LMO-based methods leveraging implicit gradient transport (IGT). We further introduce a unified framework for stochastic LMO-based optimization together with a new stationarity measure, the \emph{regularized support function} (RSF), which bridges gradient-norm and Frank--Wolfe-gap notions within a common framework. By evaluating stochastic gradients at transported points, LMO-IGT accelerates convergence while retaining the single-gradient-per-iteration structure of standard stochastic LMO. Our analysis establishes that stochastic LMO achieves an iteration complexity of $\mathcal{O}(\varepsilon^{-4})$, variance-reduced LMO achieves $\mathcal{O}(\varepsilon^{-3})$ at the cost of additional gradient evaluations, and LMO-IGT achieves $\mathcal{O}(\varepsilon^{-3.5})$ using only a single stochastic gradient per iteration. Empirically, LMO-IGT consistently improves over stochastic LMO counterparts with negligible overhead. Among its instantiations, Muon-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO-based optimization.

2605.05572 2026-05-08 cs.CV

Text-to-CAD Retrieval: a Strong Baseline

基于文本的CAD检索:一个强大的基线

Honghu Pan, Zibo Du, Daxiang Liu, Chengliang Liu, Xiaoling Luo

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) College of Computer Science and Software Engineer, Shenzhen University(深圳大学计算机科学与软件工程学院) Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院)

AI总结 本文提出文本到CAD检索任务,通过多模态嵌入框架实现CAD模型的语义检索,为下游CAD生成奠定基础。

详情
AI中文摘要

基于计算机辅助设计(CAD)模型的文本检索是重用遗留工业设计的关键但未被充分研究的任务。现有CAD仓库通常通过文件名或目录搜索,限制了设计检索的效率、可扩展性和准确性。本文正式引入文本到CAD检索作为新的跨模态检索任务,旨在根据自然语言查询从大规模数据库中检索语义相关CAD模型。利用Text2CAD数据集中的配对文本-CAD注释,我们建立了该任务的实用基准。为实现基于文本的检索,我们提出一个统一框架,从过程序列和几何点云中学习多模态CAD嵌入。具体来说,序列编码器捕捉CAD模型的构造逻辑,而点编码器提取显式几何特征。文本编码器用于学习文本查询的语义表示。在训练过程中,我们引入了一个新的特征解码器,通过交叉注意力与文本和点特征重建掩码序列特征,鼓励隐含的多模态对齐。在推理时,我们移除这个辅助解码器,以利用拼接的序列-点特征实现高效的检索。我们的框架为文本到CAD检索提供了强大的基线,并为下游CAD生成范式,如检索增强生成,奠定了基础。源代码将被发布。

英文摘要

Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.

2605.05567 2026-05-08 cs.AI

Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift

具有局部性的隐私类识别用于域适应的极端标签偏移

Chuan-Xian Ren, Cheng-Jun Guo, Hong Yan

发表机构 * School of Mathematics, Sun Yat-Sen University(中山大学数学学院) Department of Electrical Engineering, City University of Hong Kong(香港城市大学电气工程系)

AI总结 本文提出一种基于局部运输和最优传输属性的隐私类识别方法,用于缓解域适应中的极端标签偏移问题,通过区分共享类和隐私类样本,减少跨域分类难度。

详情
AI中文摘要

域适应旨在将知识从标记源域转移到未标记的目标域,其分布不同。在现实场景中,两个域的标签空间通常存在包含关系,一些类只存在于一个域而不存在于另一个域。这些不重叠的类被称为隐私类。识别隐私类样本并减轻其负面影响是文献中的关键问题。现有方法假设隐私类的偏移足够大以被视为异常值。然而,单个共享类内的方差可能显著大于私人类与其他共享类之间的差异,挑战这一假设。因此,隐私类大大增加了跨域分类的难度。为了解决这些问题,基于局部运输和最优传输(OT)的局部属性,提出了一种以运输质量形式呈现的隐私类识别方法。所提出方法的有效性在理论上得到证明,突显了该分数函数在区分共享类和隐私类样本方面的强大能力。在此基础上,我们引入了一种可靠的基于最优传输的方法(ReOT)用于在严重的标签偏移下进行域适应。ReOT在学习识别的共享类和隐私类之间的分离聚类结构的同时,最小化分类风险,有效避免共享-隐私样本对的不匹配,从而确保重要的知识在类内可靠传输以缓解类条件差异。此外,为极端标签偏移场景提供了目标风险的一般化上界,该上界可通过ReOT最小化。在基准测试上的广泛实验验证了ReOT的有效性。

英文摘要

Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other. These non-overlapping classes are referred to as private classes. Identifying private class samples and mitigating their adverse effects is critical in the literature. Existing methods rely on the assumption that shifts in private classes are large enough to be considered outliers. However, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class, challenging this assumption. Consequently, private classes substantially increase the difficulty of cross-domain classification. To address these issues, based on local transportation and metric properties of optimal transport (OT), a locality-aware private class identification approach is proposed in the form of a score function on transport mass. The effectiveness of the proposed approach is theoretically proven, highlighting the score function's strong ability to distinguish between shared and private class samples. Building on this, we introduce a reliable OT-based method (ReOT) for domain adaptation under severe label shift. ReOT minimizes classification risk while learning the separated cluster structure between the identified shared classes and private classes, effectively avoiding mismatch between shared-private sample pairs, thus ensuring that important knowledge is reliably transported intra-class to mitigate class-conditional discrepancy. Furthermore, a generalization upper bound of the target risk is provided for extreme label shift scenarios, which can be minimized by ReOT. Extensive experiments on benchmarks validate the effectiveness of ReOT.

2605.05566 2026-05-08 cs.AI cs.CL cs.LG

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

无意义帮助:提示空间扰动拓宽推理探索

Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 本文提出LoPE框架,通过任务无关的提示空间扰动提升LLM推理能力,实验显示其在不同规模模型上均优于传统重采样方法。

详情
AI中文摘要

可验证奖励的强化学习,特别是群体相对策略优化(GRPO),显著提升了大型语言模型(LLM)的推理能力。然而在复杂任务中,GRPO常面临『零优势问题』:当所有采样路径失败时,相对优势归零,导致模型失去有效训练信号,浪费训练数据和计算资源。虽然增加采样预算是常见解决方案,但静态采样策略限制了推理探索,限制了成功率。本文提出Lorem扰动探索(LoPE),一种简单有效的训练框架,以打破这一探索瓶颈。我们假设任务无关的提示空间扰动足以改变模型输出分布,解锁硬问题的正交推理路径。具体而言,LoPE在提示前随机拼接由Lorem Ipsum词汇(伪拉丁占位文本)构成的序列后重新采样。在17亿、40亿和70亿参数模型上的实验表明,LoPE在性能上显著优于使用原始提示的重采样方法。进一步分析显示,其他低困惑度的拉丁语随机序列也有效。我们的结果确立LoPE为扩展LLM强化学习探索的有力基线。

英文摘要

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.

2605.05556 2026-05-08 cs.CV

An extremely coarse feedback signal is sufficient for learning human-aligned visual representations

极粗的反馈信号足以学习与人类对齐的视觉表征

Yash Mehta, Michael F. Bonner

发表机构 * Department of Cognitive Science, Johns Hopkins University(约翰霍普金斯大学认知科学系)

AI总结 研究探讨了粗略反馈信号对视觉表征对齐的影响,发现即使仅区分8类,网络表征也能达到甚至超越细粒度模型的对齐程度,并更接近人类感知相似性判断。

Comments 21 Pages, 6 Figures

详情
AI中文摘要

人工神经网络在视觉任务中训练会发展出类似于灵长类视觉系统内部表征,这一发现指导了十年来的计算神经科学研究。构建与大脑对齐的模型研究逐渐采用更细粒度的监督信号,从物体分类到对比自监督目标,但监督信号粒度对大脑对齐的作用仍不清楚。本文系统研究了学习信号的粗细如何影响与人类视觉的表征对齐。通过数据驱动方法,将训练图像集划分为不同数量的类别(2, 4, 8, 16, ..., 64)进行PCA基于的分割,训练数百种神经网络在这些粗粒度分类任务上,并将表征与猕猴电生理记录和人类fMRI响应进行比较。发现仅区分8类的网络表征能匹配或超越区分1000类模型的神经对齐程度。更引人注目的是,这些粗粒度训练的网络在人类感知相似性判断上比所有其他模型更接近,包括使用细粒度监督或自监督训练的网络以及领先的大型视觉模型。这些结果表明,人类样式的视觉表征可以从极粗的反馈中产生,重新定义了视觉学习所需的信号,并为构建更符合人类感知的AI系统开辟了道路。

英文摘要

Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains largely unexamined. Here we systematically investigate how the coarseness of a learning signal shapes representational alignment with human vision. We parametrically vary the level of signal granularity using a data-driven approach that partitions a set of training images into varied numbers of categories (2, 4, 8, 16, ..., 64) via PCA-based splits of pretrained embeddings. We train hundreds of neural networks across convolutional and transformer architectures on these coarse classification tasks and compare their representations to macaque electrophysiology recordings and human fMRI responses. We find that networks trained to distinguish as few as 8 broad categories learn representations that match or exceed the neural alignment of models distinguishing 1,000-classes. Even more strikingly, these coarsely trained networks align more closely with human perceptual similarity judgments than all other models evaluated, including networks trained with fine-grained supervision or self-supervision as well as leading large-scale vision models. These results demonstrate that human-like visual representations emerge from remarkably coarse feedback, reframing what learning signals vision may require and opening a path toward building AI systems that are more aligned with human perception.