arXivDaily arXiv每日学术速递 周一至周五更新
重置
2412.08610 2026-06-12 cs.GT cs.AI cs.CY 版本更新

Competition and Diversity in Generative AI

生成式人工智能中的竞争与多样性

Manish Raghavan

发表机构 * MIT Sloan School of Management & Department of Electrical Engineering and Computer Science(麻省理工学院斯隆管理学院及电气工程与计算机科学系)

AI总结 通过博弈论模型和Scattergories游戏实验,研究竞争如何促使生成式AI模型多样化,缓解同质化,并提升社会福利。

详情
AI中文摘要

最近的实验和现实证据表明,使用生成式人工智能会降低所产生内容的多样性。使用相同或相似的AI模型似乎会导致更同质化的行为。我们的工作从观察到存在一股相反方向的推动力开始:竞争。当生产者相互竞争(例如,争夺客户或注意力)时,他们被激励去创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型,我们表明竞争市场会选择多样化的AI模型,从而缓解单一文化。我们进一步表明,一个在孤立环境中表现良好(即根据基准)的生成式AI模型可能在竞争市场中无法提供价值。我们的结果强调了在生成式AI模型输出分布的广度上评估它们的重要性,特别是当它们将被部署在竞争环境中时。我们通过使用语言模型玩Scattergories(一个奖励正确且独特答案的文字游戏)来实证验证我们的结果。总体而言,我们的结果表明,由生成式AI导致的同质化不太可能在竞争市场中持续存在,相反,下游市场的竞争可能会推动AI模型开发的多样化。

英文摘要

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

2510.05430 2026-06-12 cs.RO 版本更新

Active Semantic Perception

主动语义感知

Huayi Tang, Pratik Chaudhari

发表机构 * General Robotics, Automation, Sensing and Perception (GRASP) Laboratory(通用机器人、自动化、传感与感知实验室)

AI总结 提出一种基于紧凑多层场景图和大语言模型的主动语义感知方法,用于高效探索未知环境,在仿真和真实机器人上验证了优于现有方法。

详情
AI中文摘要

我们开发了一种主动语义感知方法,该方法利用场景的语义进行探索等任务。我们构建了一个紧凑的多层场景图,能够以不同抽象级别表示大型复杂室内环境,例如对应于房间、物体、墙壁、窗户等的节点,以及它们几何结构的细粒度细节。我们基于大语言模型(LLM)开发了一个程序,用于采样与场景部分观测一致的未观测区域的新可能场景图。我们开发了一个程序,用于计算潜在航点在该场景图上的信息增益,以实现复杂的空间推理:例如,从客厅出去的两扇门中,一扇可能通向厨房,另一扇通向卧室。我们在仿真中的逼真3D室内公寓以及现实世界中的Unitree Go 2机器人上评估了我们的方法。定性和定量分析表明,我们的方法能够比现有方法更快、更准确地确定环境中高层和低层的语义信息。

英文摘要

We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem(希伯来大学杰里科分校) IBM Research(IBM研究院)

AI总结 提出WildIFEval数据集,包含7K条真实用户的多约束指令,用于评估LLM的指令遵循能力,发现所有模型仍有较大改进空间。

详情
Comments
Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026
AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功,但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中,我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往的数据集不同,我们的收集涵盖了广泛的词汇和主题约束范围,这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别,以捕捉它们在现实场景中的分布和动态。利用WildIFEval,我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型,并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响,揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出通用动作专家(GAE),通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹,采用动作预训练-点云微调(APPF)方案解耦动作动力学与几何基础,实现跨视觉域、视角和指令的强泛化。

详情
AI中文摘要

视觉语言模型展示了强大的推理和规划能力,但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起,导致泛化能力有限。我们提出了通用动作专家(GAE),一个任务无关的模型,将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口:VLM预测代表高层意图的稀疏3D路点,而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力,我们引入了动作预训练-点云微调(APPF)方案,将学习动作动力学与几何基础解耦。预训练后,GAE被冻结并在下游任务中重用,只需对VLM进行轻量级微调以生成稀疏接口。实验表明,我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

2505.20076 2026-06-12 cs.LG 版本更新

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

ExPLAIND:统一模型、数据和训练归因以研究模型行为

Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich

发表机构 * University of Michigan(密歇根大学)

AI总结 提出ExPLAIND框架,统一归因于模型组件、数据和训练轨迹,支持跨粒度解释,通过梯度路径核和AdamW核机器推导参数级和步骤级影响分数,验证了Transformer的Grokking和EuroLLM预训练中的两阶段动态。

详情
Comments
published at ICML 2026, code at this https URL
AI中文摘要

事后可解释性方法通常将模型行为归因于其组件、数据或训练轨迹中的某一个,并且往往局限于局部到全局谱中的特定粒度。这导致解释缺乏统一视角,可能遗漏关键交互。我们提出了ExPLAIND,一个理论扎实的统一框架,它整合了模型组件、数据和训练轨迹,同时支持跨粒度的解释。我们推广了最近关于梯度路径核的工作,将AdamW训练的模型重新表述为核机器。从得到的核特征图中,我们推导出新的参数级和步骤级影响分数。我们在多种设置下实证验证了模型行为的分解结果,并将ExPLAIND应用于两个案例研究。我们对一个表现出Grokking现象的Transformer的发现支持了先前提出的学习阶段,同时将最后阶段细化为外层在记忆后围绕一个表示管道对齐的阶段。对于EuroLLM预训练,ExPLAIND揭示了一个两阶段动态:第一阶段以外部MLP学习为特征,第二阶段以中间注意力层的相对影响增加为特征。这些结果确立了ExPLAIND作为解释模型行为和训练动态的统一框架。

英文摘要

Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

2509.22050 2026-06-12 cs.LG 版本更新

BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning

BrainPro:迈向大规模脑状态感知的脑电图表征学习

Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) Advanced Telecommunications Research Institute International(先进电信研究院) Southeast University(东南大学)

AI总结 提出BrainPro模型,通过检索式空间对齐和脑状态解耦模块,学习共享与特定状态表征,在9个公共BCI数据集上取得最优性能。

详情
Comments
31 pages, 11 figures
AI中文摘要

脑电图(EEG)反映了潜在的脑状态,其活动分布在大脑区域并表现为头皮上的空间模式。学习这些空间结构化的、与状态相关的模式需要跨数据集的一致空间表征。然而,现有的EEG基础模型通常基于自注意力机制,该机制不保留位置特定信息,并且难以对齐不同通道配置记录的信号。此外,脑状态包含共享和状态特定的区域活动,这表明学习神经生理学上合理的、状态感知的表征可以补充当前模型所针对的共享表征,并改善下游解码。为了解决这些局限性,我们提出了BrainPro,一个大型EEG模型,它结合了基于检索的空间学习机制用于跨布局空间对齐,以及一个脑状态解耦模块,通过并行编码器和区域感知重建学习共享和状态特定表征。在大型EEG语料库上预训练后,BrainPro在跨越情感、运动、语音、压力、精神疾病和注意力任务的九个公共BCI数据集上实现了最先进的性能。对空间滤波器、通道丢失鲁棒性和编码器贡献的分析进一步验证了其空间对齐和状态感知路径的有效性。这些结果表明,BrainPro实现了学习空间模式的更好可解释性,并产生了有益于多种EEG解码任务的表征。

英文摘要

Electroencephalography (EEG) reflects underlying brain states, whose activities are distributed across brain regions and manifest as spatial patterns on the scalp. Learning these spatially structured, state-related patterns requires consistent spatial representations across datasets. However, existing EEG foundation models are typically based on self-attention, which does not preserve location-specific information and struggles to align signals recorded with different channel configurations. Moreover, brain states contain both shared and state-specific regional activity, suggesting that learning neurophysiologically plausible, state-aware representations can complement the shared representations targeted by current models and improve downstream decoding. To address these limitations, we propose BrainPro, a large EEG model that combines a retrieval-based spatial learning mechanism for cross-layout spatial alignment with a brain state-decoupling module that learns both shared and state-specific representations through parallel encoders and region-aware reconstruction. Pre-trained on a large EEG corpus, BrainPro achieves state-of-the-art performance across nine public BCI datasets spanning emotion, motor, speech, stress, mental disease, and attention tasks. Analyses of spatial filters, channel-drop robustness, and encoder contributions further validate the effectiveness of its spatial alignment and state-aware pathways. These results show that BrainPro achieves improved interpretability of learned spatial patterns and produces representations that benefit diverse EEG decoding tasks.

2509.21548 2026-06-12 cs.CY cs.CL 版本更新

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

C-QUERI:国会机构中的问题、交流与回答数据集

Manjari Rudra, Daniel Magleby, Sujoy Sikdar

发表机构 * School of Computing, Binghamton University(宾夕法尼亚大学布林莫尔分校计算机学院) Department of Political Science, Binghamton University(宾夕法尼亚大学布林莫尔分校政治学系)

AI总结 提出从听证会记录中提取问答对的流程,构建108-117届国会委员会听证数据集,分析显示提问者党派可从问题本身预测,为政治话语研究提供框架。

详情
AI中文摘要

政治采访和听证中的问题除了信息收集外,还具有战略目的,包括推进党派叙事和塑造公众认知。然而,由于缺乏大规模数据集来研究此类话语,这些战略方面仍未得到充分研究。国会听证会为研究政治提问提供了一个特别丰富且易于处理的地点:互动由正式规则组织,证人必须回答,不同政治派别的成员保证有机会提问,从而能够比较跨政治光谱的行为。我们开发了一个流程,从非结构化听证记录中提取问答对,并构建了一个包含第108至117届国会委员会听证的新数据集。我们的分析揭示了跨党派的提问策略的系统性差异,表明仅从问题本身即可预测提问者的党派归属。我们的数据集和方法不仅推进了国会政治研究,还为分析类似采访环境中的问答提供了通用框架。

英文摘要

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.

2509.21398 2026-06-12 cs.CV eess.IV 版本更新

Skeleton Sparsification and Densification Scale-Spaces

骨架稀疏化和致密化尺度空间

Julia Gierke, Pascal Peter

发表机构 * Mathematical Image Analysis Group, Saarland University(萨尔兰大学数学图像分析组) Department of Mathematics and Computer Science, Saarland University(萨尔兰大学数学与计算机科学系)

AI总结 提出骨架化尺度空间,通过稀疏化中轴实现形状层次简化,并引入致密化实现从粗到细的逆过程,应用于鲁棒骨架化、形状压缩和增材制造刚度增强。

详情
AI中文摘要

Hamilton-Jacobi骨架,也称为中轴,是一种强大的形状描述符,它根据最大内切圆的中心来表示二值对象。尽管应用广泛,但中轴对噪声敏感:微小的边界变化可能导致骨架不成比例地扩大和产生不必要的分支。经典的剪枝方法通过系统地移除多余的骨架分支来缓解这一缺陷。这种骨架的顺序简化类似于稀疏化尺度空间的原理,该空间将图像嵌入到从越来越稀疏的像素表示重建的族中。我们通过引入骨架化尺度空间将两者结合起来:它们利用中轴的稀疏化来实现形状的层次简化。与传统的剪枝不同,我们的框架固有地满足关键的尺度空间特性,如层次结构、可控简化和对几何变换的等变性。我们在连续和离散公式中提供了严格的理论基础,并通过致密化进一步扩展了这一概念。通过逐步增长骨架而不是收缩它,我们允许从粗到细尺度的逆过程。致密化尺度空间甚至可以超越原始骨架,产生与实际问题相关的过完备形状表示。通过概念验证实验,我们展示了我们的框架在实际任务中的有效性,包括鲁棒骨架化、形状压缩和增材制造的刚度增强。

英文摘要

The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: Minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. By growing the skeleton successively instead of shrinking it, we allow inverse progression from coarse to fine scales. Densification scale-spaces can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

2509.19526 2026-06-12 cs.LG eess.SY 版本更新

Metriplectic Conditional Flow Matching for Dissipative Dynamics

度量辛条件流匹配用于耗散动力学

Ali Baheri, Lars Lindemann

发表机构 * Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Automatic Control Laboratory, ETH Zürich, Switzerland(自动控制实验室)

AI总结 提出度量辛条件流匹配(MCFM)方法,通过将保守-耗散分解融入向量场和结构保持采样器,学习耗散动力学,保证能量单调递减和长期稳定性。

详情
AI中文摘要

度量辛条件流匹配(MCFM)在不违反第一原理的情况下学习耗散动力学。神经替代模型常常注入能量并破坏长期推演的稳定性;MCFM 则将保守-耗散分解同时融入向量场和结构保持采样器。MCFM 通过短时间过渡上的条件流匹配进行训练,避免了长时间推演伴随的梯度计算。在推理时,Strang-prox 方案交替进行辛更新和近端度量步骤,确保离散能量衰减;当有可信能量可用时,可选投影强制严格衰减。我们提供了连续和离散时间保证,将该参数化和采样器与守恒、单调耗散和稳定推演联系起来。在一个受控机械基准上,MCFM 产生的相图更接近真实情况,并且与同等表达能力的无约束神经流相比,能量增加和正能量率事件显著减少,同时匹配终端分布拟合。

英文摘要

Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

2509.01630 2026-06-12 cs.LG cs.MA cs.RO eess.SY 版本更新

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

DiffCoord: 分布式多智能体轨迹优化的可微协调

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出DiffCoord框架,将截断ADMM-DDP管道的耦合参数通过端到端元学习联合优化,利用智能体神经网络实现任务自适应,并扩展到不同智能体数量。在协作空中运输系统中验证,相比现有方法将每智能体梯度计算时间减少70%。

详情
AI中文摘要

将交替方向乘子法(ADMM)与微分动态规划(DDP)相结合,为分布式多智能体轨迹优化提供了一个可扩展的框架。在实践中,ADMM通常被截断以提高计算效率,这紧密耦合了原本分别控制协调质量和任务性能的参数。在本文中,我们提出了可微协调(DiffCoord),一个统一框架,联合元学习截断ADMM-DDP管道的这些耦合参数。这些参数由智能体神经网络生成以实现任务自适应,并且同构智能体之间共享相同的网络,从而能够扩展到不同数量的智能体。我们通过端到端微分ADMM-DDP管道实现了高效的元学习。值得注意的是,这产生了一个辅助的ADMM-LQR分布式梯度求解器,用于计算和协调关于这些参数的元梯度。该求解器继承了管道的计算结构,使得关键计算结果可以重用,并能够在智能体和轨迹时间线上高效并行化。我们通过协作空中运输系统的数值和物理实验验证了DiffCoord,该系统在狭窄空间中重新配置四旋翼编队以实现安全的六自由度负载操作。它能够鲁棒地适应变化的团队规模和负载动力学,同时与最先进的轨迹梯度方法相比,将每智能体梯度计算时间减少高达70%。

英文摘要

Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

详情
Comments
Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

2508.14143 2026-06-12 cs.LG q-bio.NC 版本更新

The Urysohn Machine: A Metric-Topological Model of Computation

Urysohn机器:一种度量-拓扑计算模型

Xin Li

发表机构 * University at Albany, State University of New York(纽约州立大学阿尔巴尼分校)

AI总结 提出Urysohn机器,一种基于度量分离、前沿结构和收缩的分类计算模型,通过Urysohn三元组和分层构造实现分类复杂度度量与可重用推理。

详情
AI中文摘要

我们引入Urysohn机器,一种面向分类计算的有效模型,其中度量分离、前沿结构和收缩是计算状态的显式部分。其基本对象是Urysohn三元组:一个支撑区域、一个目标划分以及一个存储在可重用度量库中的分离分类器。拓扑基础是有限单纯形设置下的构造性Urysohn实现定理。它通过嵌套多面体区域的二进阶梯构建分离器,并为其前沿配备链级微积分:前沿是循环,层级之间的壳层边界由前沿之差给出。该构造产生两种相关的复杂度度量:决策边界宽度(单个分类器边界的几何度量)和Urysohn宽度(库或实现所表示的总前沿质量)。我们证明了摊销分离定理,该定理表明在显式边界足迹假设下,逼近宽度为的边界达到精度所需的简单基三元组数量与边界宽度成正比,与分辨率成反比。我们还引入了一种对比分离算子,其图割泛函能从采样度量数据中一致地估计决策边界宽度,而其拉普拉斯谱则能证明类组件结构和电导率。最后,我们分析了动态Urysohn阶梯,并证明了四个保证:商塌缩下的可分离性、已提交前沿的稳定性、收缩下的有界容量以及商距离下的可扩展性。这些结果共同给出了分类复杂度、摊销推理和组合重用的度量-拓扑解释,在保留经典可计算性的同时,揭示了纯符号描述所隐藏的几何结构。

英文摘要

We introduce the Urysohn Machine, an effective model of classification-oriented computation in which metric separation, frontier structure, and contraction are explicit parts of the computational state. Its basic object is a \emph{Urysohn Triple}: a support region, a target partition, and a separating classifier stored in a reusable Metric Library. The topological foundation is a constructive Urysohn Realization theorem for finite simplicial settings. It builds separators from dyadic ladders of nested polyhedral regions and equips their frontiers with a chain-level calculus: frontiers are cycles, and shells between levels have boundaries given by differences of frontiers. This construction yields two related complexity measures: decision-boundary width, the geometric measure of a single classifier's boundary, and Urysohn width, the total frontier mass represented by a library or realization. We prove an Amortized Separation Theorem showing that approximating a boundary of width to accuracy requires a number of simple basis triples proportional to boundary width and inversely proportional to resolution, under explicit boundary-footprint assumptions. We also introduce a contrastive separation operator whose graph-cut functional consistently estimates decision-boundary width from sampled metric data, while its Laplacian spectrum certifies class-component structure and conductance. Finally, we analyze the dynamic Urysohn ladder and prove four guarantees: separability under quotient collapse, stability of committed frontiers, bounded capacity under contraction, and scalability with quotient distance. Together, these results give a metric-topological account of classification complexity, amortized inference, and compositional reuse that preserves classical computability while exposing geometric structure hidden by purely symbolic descriptions.

2508.04888 2026-06-12 cs.LG 版本更新

Retrieval-Augmented Foundation Models for Water Level Prediction in the Everglades

用于大沼泽地水位预测的检索增强基础模型

Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu

发表机构 * Florida International University(佛罗里达国际大学) Everglades National Park(大沼泽地国家公园)

AI总结 针对大沼泽地水位预测,提出检索增强机制,利用统计相似性或互信息检索历史水文事件,提升预训练时序基础模型的长期预测性能,尤其在极端事件中效果显著。

详情
AI中文摘要

大沼泽地的准确水位预测对于防洪、干旱管理、水资源规划和生物多样性保护至关重要。尽管最近的时序基础模型在通用任务(体现在其预训练中)上表现出色,但它们在特定领域应用中的有效性仍未被充分理解。在这项工作中,我们整理了一个用于大沼泽地水位预测的领域特定数据集,并观察到当前最先进模型的性能仍然有限。为了解决这一差距,我们利用检索增强机制,从历史观测的外部档案中检索类似的多变量水文事件,以丰富这些预训练模型的输入上下文。我们研究了两种检索策略:基于统计相似性的检索和基于互信息的检索,并分析了纳入检索到的历史上下文如何影响预测性能。大量实验表明,检索增强一致地改善了长期水位预测,并在极端事件期间产生了不成比例的更大收益,这对环境决策尤为关键。我们的研究提供了经验证据,表明基于类比检索可以有益于环境科学中的预训练时序基础模型,为它们在大沼泽地水文预测中的应用提供了关于其优势、局限性和失败模式的实用见解。尽管在大沼泽地进行了评估,但所提出的框架是通用的,并且可以应用于给定时间序列数据的其他水文系统。代码和数据已在此 https URL 公开。

英文摘要

Accurate water level forecasting in the Everglades is essential for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent time-series foundation models have shown strong performance on generic tasks (represented in their pre-training), their effectiveness in domain-specific applications remains insufficiently understood. In this work, we curate a domain-specific dataset for water-level forecasting in the Everglades and observe that the performance of current state-of-the-art models remains limited. To address this gap, we leverage a retrieval-augmented mechanism that retrieves analogous multivariate hydrological episodes from an external archive of historical observations to enrich the input context of those pre-trained models. We study two retrieval strategies, statistical similarity-based retrieval and mutual information-based retrieval, and analyze how incorporating retrieved historical contexts affects predictive performance. Extensive experiments show that retrieval augmentation consistently improves long-horizon water level forecasts and yields disproportionately larger gains during extreme events, which is particularly critical for environmental decision-making. Our study provides empirical evidence that analog-based retrieval can benefit pretrained time-series foundation models in environmental science, offering practical insights into their strengths, limitations, and failure modes when applied to hydrological forecasting in the Everglades. Although evaluated in the Everglades, the proposed framework is general and can be applied to other hydrological systems given time series data. The code and data have been made publicly available at this https URL.

2508.01656 2026-06-12 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Authorship Attribution in Multilingual Machine-Generated Texts

多语言机器生成文本的作者归属

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

发表机构 * DIMES Department, University of Calabria(卡利博大学DIMES系) Kempelen Institute of Intelligent Technologies(智能技术研究所)

AI总结 提出多语言作者归属问题,研究单语言方法在18种语言和8个生成器上的跨语言迁移能力,发现显著局限。

详情
Comments
Accepted at ACL 2026 - Main
AI中文摘要

随着大型语言模型(LLM)达到类人的流畅性和连贯性,区分机器生成文本(MGT)与人类撰写的内容变得越来越困难。虽然MGT检测的早期工作侧重于二元分类,但LLM的不断发展和多样性需要更细粒度且更具挑战性的作者归属(AA),即能够识别文本背后的确切生成器(LLM或人类)。然而,目前AA仍局限于单语言环境,其中英语是研究最多的语言,忽视了现代LLM的多语言性质和使用。在这项工作中,我们引入了多语言作者归属问题,涉及将文本归因于跨多种语言的人类或多个LLM生成器。聚焦于18种语言——涵盖多个语系和书写系统——以及8个生成器(7个LLM和人类撰写类别),我们研究了单语言AA方法在多语言环境中的适用性,包括其跨语言迁移能力,以及生成器对归属性能的影响。我们的结果表明,虽然某些单语言AA方法可以适应多语言环境,但仍然存在显著的局限性和挑战,特别是在跨不同语系迁移时,这凸显了多语言AA的复杂性以及需要更稳健的方法以更好地匹配现实场景。

英文摘要

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

2507.22791 2026-06-12 cs.CV 版本更新

Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

视觉与视觉-语言应用中的模态感知特征匹配:全面综述

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

发表机构 * School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics(江西财经大学计算机与人工智能学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Computer Science and Informatics, Cardiff University(卡迪夫大学计算机科学与信息学院) School of Computing and Communications, Lancaster University(兰卡斯特大学计算机与通讯学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR)(新加坡资讯研究院,科技研究局(A*STAR)) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 综述基于模态的特征匹配,涵盖传统手工方法和现代深度学习方法,重点讨论跨RGB、深度、3D点云、LiDAR、医学图像及视觉-语言模态的进展,突出模态感知技术。

详情
Comments
CSUR
AI中文摘要

特征匹配是计算机视觉中的一项基础任务,对于图像检索、立体匹配、三维重建和SLAM等应用至关重要。本综述全面回顾了基于模态的特征匹配,探索了传统手工方法,并强调了当代深度学习方法在各种模态中的应用,包括RGB图像、深度图像、3D点云、LiDAR扫描、医学图像和视觉-语言交互。传统方法利用Harris角点等检测器和SIFT、ORB等描述符,在中等模态内变化下表现出鲁棒性,但在显著模态差距下表现不佳。当代基于深度学习的方法,例如基于CNN的SuperPoint和基于Transformer的LoFTR等无检测器策略,显著提高了跨模态的鲁棒性和适应性。我们重点介绍了模态感知的进展,例如用于深度图像的几何和深度特定描述符、用于3D点云的稀疏和密集学习方法、用于LiDAR扫描的注意力增强神经网络,以及用于复杂医学图像匹配的MIND描述符等专门解决方案。跨模态应用,特别是在医学图像配准和视觉-语言任务中,突显了特征匹配处理日益多样化数据交互的演变。

英文摘要

Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

2507.22028 2026-06-12 cs.CV cs.RO 版本更新

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

从看见到体验:通过强化学习扩展导航基础模型

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Coco Robotics(Coco机器人)

AI总结 提出S2E框架,结合离线视频预训练和模拟环境强化学习,通过锚点引导分布匹配和残差注意力模块,提升导航基础模型的交互性和安全性。

详情
Comments
27 pages, 20 figures, 9 tables, conference
AI中文摘要

基于大规模网络数据训练的导航基础模型使智能体能够跨不同环境和实体进行泛化。然而,这些仅基于离线数据训练的模型往往缺乏推理其行为后果或通过反事实理解进行适应的能力。因此,它们在现实世界城市导航中面临重大限制,其中交互性和安全行为(如避开障碍物和移动行人)至关重要。为解决这些挑战,我们引入了从看见到体验(S2E)学习框架,通过强化学习扩展导航基础模型的能力。S2E结合了离线视频预训练和强化学习后训练的优势。它保持了从大规模真实世界视频中获得的模型泛化能力,同时通过模拟环境中的强化学习增强了其交互性。具体而言,我们引入了两项创新:(1)用于离线预训练的锚点引导分布匹配策略,通过基于锚点的监督稳定学习并建模多样化的运动模式;(2)用于强化学习的残差注意力模块,从模拟环境中获得反应性行为,同时不抹除模型的预训练知识。此外,我们建立了一个全面的端到端评估基准NavBench-GS,该基准基于真实世界场景的光照逼真3D高斯溅射重建,并融入了物理交互。它可以系统评估导航基础模型的泛化能力和安全性。

英文摘要

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

2507.20208 2026-06-12 cs.CL 版本更新

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

从基准到技能:LLM评估的低秩因子

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

发表机构 * Bar-Ilan University(巴伊兰大学) OriginAI Data Science Institute Columbia University(哥伦比亚大学数据科学学院) Center for Data Science New York University(纽约大学数据科学中心)

AI总结 通过因子分析发现LLM基准性能矩阵本质低秩,揭示任务冗余,提出基于潜在技能空间的评估框架,用于识别冗余任务、用小任务子集建模新模型和按技能轮廓选模型。

详情
AI中文摘要

当前对大型语言模型(LLM)的评估严重依赖于不断增长的基准集合和聚合基准分数,然而这种比较实际捕捉了什么,以及这些分数揭示了模型的哪些底层能力,仍不清楚。在此,我们提出了一种新的LLM评估范式,通过询问基准性能是反映许多独立能力,还是依赖于少量共享维度。为了回答这个问题,我们将因子分析(FA)应用于LLM与基准的大规模性能矩阵(60×44),揭示了该矩阵的固有低秩结构。也就是说,少量潜在因子捕捉了完整任务空间中的大部分结构。这种低秩几何揭示了现有任务之间存在大量冗余,并解释了为什么许多基准似乎测量了重叠的能力。我们进一步表明,这些潜在因子对应于连贯的、类似技能的LLM行为维度。利用这个潜在技能空间,我们为LLM评估和下游用户提供了三个实用工具:(i)识别冗余任务,(ii)使用少量任务子集对新模型进行画像,以及(iii)选择与所需技能轮廓一致的模型。我们的方法为单一聚合分数的事实标准提供了一个可靠的替代方案,并建立了一个可解释且实用的框架,用于理解和基准测试LLM的核心能力。

英文摘要

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

2507.10599 2026-06-12 cs.CL cs.AI cs.LG 版本更新

Emergence of Hierarchical Emotion Organization in Large Language Models

大型语言模型中层级情感组织的涌现

Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Tokyo(东京大学)

AI总结 受情感轮理论启发,分析大型语言模型输出中情感状态间的概率依赖关系,发现模型自然形成与人类心理模型一致的层级情感树,且更大模型发展出更复杂的层级结构,同时揭示社会经济角色在情感识别中的系统性偏差。

详情
Comments
ICML 2026
AI中文摘要

随着大型语言模型(LLMs)越来越多地驱动对话代理,理解它们如何建模用户的情绪状态对于伦理部署至关重要。受情感轮(即一种认为情感层级组织的心理学框架)的启发,我们分析了模型输出中情感状态之间的概率依赖关系。我们发现LLMs自然形成与人类心理模型一致的层级情感树,且更大的模型发展出更复杂的层级结构。我们还揭示了跨社会经济角色的情感识别中存在系统性偏差,对于交叉、代表性不足的群体,错误分类会叠加。人类研究显示出惊人的相似性,表明LLMs内化了社会感知的某些方面。除了突出LLMs中的涌现情感推理能力,我们的结果还暗示了利用认知基础理论开发更好模型评估的潜力。

英文摘要

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

2507.05019 2026-06-12 cs.LG cs.AI 版本更新

Meta-Learning Transformers to Improve In-Context Generalization

元学习变换器以改进上下文泛化

Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci

发表机构 * University of Trento, Italy(特伦托大学,意大利) Eindhoven University, Netherlands(埃因霍温大学,荷兰) University of Doha for Science and Technology, Qatar(多哈科学与技术大学,卡塔尔)

AI总结 提出利用多个小规模领域特定数据集训练上下文学习器,通过元学习提升跨领域泛化能力,并在持续学习和无监督场景下验证其鲁棒性。

详情
AI中文摘要

上下文学习使变换器模型能够仅基于输入提示泛化到新任务,无需任何权重更新。然而,现有的训练范式通常依赖于大型非结构化数据集,这些数据集存储成本高,难以评估质量和平衡性,并且由于包含敏感信息而引发隐私和伦理问题。受这些局限性和风险的启发,我们提出了一种替代训练策略,利用多个小规模、领域特定的数据集集合。我们经验性地证明,此类数据质量的提高和多样性的增加提升了上下文学习器在其训练领域之外的泛化能力,同时与在单个大规模数据集上训练的模型相比,性能相当。我们通过利用元学习在Meta-Album集合上训练上下文学习器来研究这一范式,在多种设置下进行实验。首先,我们在受控环境中展示性能,其中测试领域完全排除在训练知识之外。其次,我们探索这些模型在信息可访问时间有限的持续场景中对遗忘的鲁棒性。最后,我们探索更具挑战性的无监督场景。我们的发现表明,当在精心策划的数据集集合上训练时,变换器仍然能够泛化用于上下文预测,同时在模块化和可替换性方面提供了优势。

英文摘要

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

2507.03660 2026-06-12 cs.LG 版本更新

Single vs. Multiple Branches in DeepONet and S-DeepONet: Network Architecture Follows Coupling in Multiphysics Systems

DeepONet和S-DeepONet中的单分支与多分支:网络架构遵循多物理系统中的耦合

Jaewan Park, Kazuma Kobayashi, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam

发表机构 * National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign(国家超级计算应用中心,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Mechanical Science and Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,机械科学与工程系,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,核物理与辐射工程系,伊利诺伊大学厄巴纳-香槟分校) Department of Industrial and Manufacturing Systems Engineering, Kansas State University(工业与制造系统工程系,堪萨斯州立大学) Civil and Urban Engineering Department, New York University Abu Dhabi, UAE(土木与城市工程系,纽约大学阿布扎比分校,阿联酋)

AI总结 研究比较单分支与多分支神经算子架构在强耦合多物理系统中的表现,发现单分支网络在紧耦合场景下通过共享潜在表示优于多分支,而多分支适用于解耦或单物理任务,代理模型加速高达1.8×10^4倍。

详情
AI中文摘要

复杂物理系统的实时预测需要从数据中学习并代表强多物理耦合的代理模型。深度算子网络在单物理问题中已显示出成功,但其在捕捉耦合系统(如热-机械或电-热耦合)中非线性相互作用方面的有效性仍未充分探索。这里我们提出一个实际问题:神经算子的架构是否应反映其旨在建模的物理耦合强度?我们比较了单分支和多分支设计,包括前馈和顺序循环形式,跨越三个代表性系统:具有异质源的反应-扩散问题、具有温度依赖电导率和焦耳热的非线性热电问题,以及钢凝固的粘塑性热-机械模型。单分支网络在紧耦合场景中通过鼓励共享潜在表示持续优于多分支变体,而多分支设计对于解耦或单物理任务仍然有利。一旦训练完成,这些代理模型提供全场预测的速度比基于物理的求解器快高达1.8×10^4倍。

英文摘要

`Real-time prediction of complex physical systems requires surrogate models that learn from data while representing strong multiphysics coupling. Deep Operator Networks have shown success in single-physics problems, yet their effectiveness in capturing nonlinear interactions in coupled systems (such as thermo-mechanical or electro-thermal coupling) remains underexplored. Here we pose a practical question: should the architecture of a neural operator reflect the strength of physical coupling it aims to model? We compare single-branch and multi-branch designs, in both feedforward and sequential recurrent forms, across three representative systems: a reaction--diffusion problem with heterogeneous sources, a nonlinear thermo-electrical problem with temperature-dependent conductivity and Joule heating, and a viscoplastic thermo-mechanical model of steel solidification. Single-branch networks consistently outperform multi-branch variants in tightly coupled regimes by encouraging shared latent representations, whereas multi-branch designs remain favorable for decoupled or single-physics tasks. Once trained, these surrogates deliver full-field predictions up to $1.8 \times 10^4$ times faster than physics-based solvers.

2506.23033 2026-06-12 cs.LG stat.ML 版本更新

How Reliable are Fairness Audits with Unreliable Data?

不可靠数据下的公平性审计有多可靠?

Yash Vardhan Tomar

发表机构 * Purdue University(普渡大学)

AI总结 研究受保护标签缺失对公平性缓解审计的影响,提出种子校准压力测试区分缺失效应与随机波动,发现正可用性缺失通常不改变缓解方法效果,但无标签端点表现不同,且阈值优化可能将单轴公平性增益转化为交叉危害。

详情
AI中文摘要

公平性审计是负责任机器学习部署的关键组成部分。然而,在不完全受保护标签访问下审计建议的可靠性仍然知之甚少。在这项工作中,我们关注公平性缓解审计中的受保护标签缺失。我们引入了一种种子校准压力测试,以将缺失效应与完全标签下已经存在的种子间波动分离开来。在ACS/Folktables任务中,我们发现正可用性缺失通常不会将选定的缓解方法移出完全标签的种子基线。无标签端点表现不同,暴露了ERM等效候选和确定性断点,而不是广泛的缺失效应。我们还发现,阈值优化可以将单轴公平性增益转化为高于零点的交叉危害,这是一种更尖锐的失败模式,在随机森林验证下似乎仍然可见。总体而言,我们的结果强调,在将受保护标签缺失视为审计脆弱性的证据之前,应报告种子零校准、候选集背景和交叉后果。

英文摘要

Fairness audits are a key component of responsible machine-learning deployment. Yet, audit-recommendation reliability under incomplete protected-label access is still poorly understood. In this work, we focused on protected-label missingness in fairness mitigation audits. We introduced a seed-calibrated stress test to separate missingness effects from seed-to-seed movement already present under complete labels. Across ACS/Folktables tasks, missingness settings that retain some protected labels usually do not move selected mitigation methods beyond a complete-label seed-to-seed baseline. At $0%$ protected-label access, candidates collapse to an empirical-risk-minimization baseline and deterministic tie-breaking rather than revealing a broad missingness effect. We also found that threshold optimization can turn fairness gains on a single protected axis into intersectional harm above a seed baseline, and this threshold-optimizer finding persists under random-forest validation. Overall, our results highlight that protected-label missingness should be reported with seed-null calibration, candidate-set context, and intersectional consequences before it is treated as evidence of audit fragility.

2506.21855 2026-06-12 cs.CV 版本更新

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE:用于rPPG估计的周期性视频掩码自编码器

Jiho Choi, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University, Republic of Korea(电子与信息工程系,全州国立大学)

AI总结 提出Periodic-MAE,一种自监督框架,通过周期性感知掩码和生理频带约束,从无标签面部视频学习可泛化的时空表示,提升远程光电容积描记法(rPPG)估计性能。

详情
AI中文摘要

在本文中,我们提出Periodic-MAE,一种自监督框架,用于从无标签面部视频中学习周期性生理信号的通用时空表示。该方法利用掩码自编码器(MAE),通过重建掩码视频令牌学习高维面部表示,而不依赖远程光电容积描记法(rPPG)特定监督。为了明确地将表示学习与rPPG特征对齐,我们引入了一种基于视频重采样的周期性感知帧掩码策略,使编码器能够学习捕获与脉搏信号估计相关的准周期性时间模式的表示。此外,生理频带约束被集成到MAE预训练框架中,利用脉搏信号在频域的稀疏性,引导学习到的表示朝向生理上有意义的模式。预训练后,学习到的表示被迁移到下游rPPG估计任务,其中编码器作为通用特征提取器,从面部视频中恢复脉搏相关信号。我们在四个基准数据集(包括PURE、UBFC-rPPG、MMPD和V4V)上进行了广泛实验。此外,我们在无约束光照条件和受试者运动下收集的真实世界rPPG数据集上评估了所提方法。实验结果表明,Periodic-MAE持续改善了rPPG估计性能,特别是在具有挑战性的跨数据集和真实世界评估场景中。我们的代码可在以下网址获取:此 https URL。

英文摘要

In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at this https URL.

2502.18959 2026-06-12 cs.LG stat.ML 版本更新

Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential

傅里叶多分量与多层神经网络:解锁高频潜力

Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou

发表机构 * Department of Applied Mathematics(应用数学系) Hong Kong Polytechnic University(香港理工大学) Department of Mathematics(数学系) Duke University(杜克大学) Department of Mathematics and Statistics(数学与统计学系) Auburn University(阿伯茨伦大学) School of Mathematics(数学学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出傅里叶多分量与多层神经网络(FMMNN),结合正弦型激活函数与多分量多层结构,通过低秩架构实现指数级函数逼近能力,优化景观优于标准全连接网络,并设计缩放随机初始化方法加速训练,在高频函数逼近任务中取得高精度与良好收敛性。

详情
Comments
Our code and implementation details are available at this https URL
AI中文摘要

神经网络的结构及其激活函数的选择对其性能至关重要。同样重要的是确保这两个元素良好匹配,因为它们的对齐是有效表示和学习的关键。在本文中,我们引入了傅里叶多分量与多层神经网络(FMMNN),该模型将正弦型激活函数与MMNN的多分量多层结构相结合。在FMMNN中,每个分量表示为固定随机正弦型基函数的可训练线性组合,而多层组合则生成更复杂且自适应的频率特征。我们证明,即使在低秩架构下,FMMNN仍能保持函数逼近的指数级表达能力。我们还分析了FMMNN的优化景观,发现其比标准全连接神经网络更有利,尤其是对于高频目标。此外,我们提出了一种针对FMMNN第一层权重的缩放随机初始化方法,当样本充足时,该方法能加速训练并提高最终性能。大量数值实验支持我们的理论见解,表明FMMNN在振荡函数逼近基准上实现了高精度和良好的收敛行为。

英文摘要

The architecture of a neural network and the choice of its activation function are both fundamental to its performance. Equally important is ensuring that these two elements are well matched, as their alignment is key to effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a model that combines sine-type activations with the multi-component and multi-layer structure of MMNNs. In an FMMNN, each component is represented as a trainable linear combination of fixed random sine-type basis functions, while multi-layer composition generates more complex and adaptive high-frequency features. We establish that FMMNNs retain exponential expressive power for function approximation even under a low-rank architectural structure. We also analyze the optimization landscape of FMMNNs and find it to be substantially more favorable than that of standard fully connected neural networks, especially for high-frequency targets. In addition, we propose a scaled random initialization method for the first-layer weights in FMMNNs, which accelerates training and improves final performance when sufficient samples are available. Extensive numerical experiments support our theoretical insights, showing that FMMNNs achieve strong accuracy and favorable convergence behavior on oscillatory function-approximation benchmarks.

2402.01779 2026-06-12 eess.IV cs.CV cs.LG stat.ML 版本更新

Plug-and-Play image restoration with Stochastic deNOising REgularization

即插即用图像恢复:随机去噪正则化

Marien Renaud, Jean Prost, Arthur Leclaire, Nicolas Papadakis

发表机构 * arXiv.org GitHub

AI总结 提出SNORE框架,仅在适当噪声水平图像上应用去噪器,结合随机正则化与梯度下降求解逆问题,在去模糊和修复任务上达到SOTA。

详情
AI中文摘要

即插即用(PnP)算法是一类迭代算法,通过结合物理模型和深度神经网络进行正则化来解决图像逆问题。尽管它们能产生令人印象深刻的图像恢复结果,但这些算法依赖于在迭代过程中噪声逐渐减小的图像上非标准地使用去噪器,这与最近基于扩散模型(DM)的算法形成对比,后者仅在重新加噪的图像上应用去噪器。我们提出了一种新的PnP框架,称为随机去噪正则化(SNORE),该框架仅在具有适当噪声水平的图像上应用去噪器。它基于显式的随机正则化,从而产生一种随机梯度下降算法来解决不适定逆问题。提供了该算法及其退火扩展的收敛性分析。实验上,我们证明SNORE在去模糊和修复任务上与最先进方法相比具有竞争力,无论是在定量还是定性方面。

英文摘要

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

2506.01274 2026-06-12 cs.CV cs.AI 版本更新

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

ReFoCUS: 用于上下文理解的强化引导帧优化

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReFoCUS框架,首次将在线策略梯度强化学习集成到视频大语言模型的帧级优化中,通过自回归和查询条件选择架构学习帧选择策略,无需显式帧级监督,提升视频问答推理准确性。

详情
Comments
Project page: this https URL
AI中文摘要

近期大型多模态模型(LMMs)的进展实现了有效的视觉-语言推理,然而视频理解能力仍受限于次优的帧选择策略,尽管视频专用LMMs发展迅速。先前的工作尝试通过静态启发式或外部检索模块来提供帧级信息,但这些方法往往无法捕捉与给定用户查询相关的视觉线索,混淆了原始视觉动态与真正的语义相关性。在本文中,我们介绍了ReFoCUS(用于上下文理解的强化引导帧优化),这是首个将在线策略梯度强化学习集成到视频-LLMs帧级优化的框架。ReFoCUS旨在学习帧选择策略,利用来自参考模型的奖励信号来捕捉其对最佳支持时间接地响应的帧组合的潜在评分行为。为了高效探索巨大的组合帧空间,我们采用了一种自回归且查询条件的选择架构,确保上下文一致性的同时降低复杂度。我们的策略学习无需显式帧级监督,因为它隐式地发现了最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提高了推理准确性,证明了将帧选择与模型内部效用对齐的优势。

英文摘要

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.

2505.23823 2026-06-12 cs.CL 版本更新

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

RAGPPI:药物发现中蛋白质-蛋白质相互作用的RAG基准

Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California Los Angeles(加州大学洛杉矶分校) Palo Alto High School(帕洛阿尔托高中) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出RAGPPI基准,包含4420个问答对,用于评估检索增强生成在药物发现中识别蛋白质-蛋白质相互作用生物学影响的能力。

详情
Comments
17 pages, 4 figures, 8 tables
AI中文摘要

检索蛋白质-蛋白质相互作用(PPI)的生物学影响对于药物开发中的靶点识别(Target ID)至关重要。由于涉及的蛋白质数量庞大,这一过程仍然耗时且具有挑战性。大型语言模型(LLMs)和检索增强生成(RAG)框架已支持靶点识别;然而,目前尚无用于识别PPI生物学影响的基准。为填补这一空白,我们引入了PPI的RAG基准(RAGPPI),这是一个包含4420个问答对的事实性问答基准,专注于PPI的潜在生物学影响。通过与专家访谈,我们确定了基准数据集的标准,例如问答类型和来源。我们通过专家驱动的数据标注构建了金标准数据集(500个问答对)。我们开发了一个集成自动评估LLM,该模型结合了专家标注特征、平均事实-摘要相似度(F1)和低相似度事实计数(F2),从而构建了银标准数据集(3720个问答对)。我们致力于维护RAGPPI作为支持研究社区推进药物发现问答解决方案的RAG系统的资源。

英文摘要

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

2505.22695 2026-06-12 cs.LG 版本更新

LLM-ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning

LLM-ODDR:一种用于联合订单调度和司机重新定位的大语言模型框架

Tengfei Lyu, Siyuan Feng, Hao Liu, Hai Yang

发表机构 * Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能前沿技术 thrust,香港科学与技术大学(广州)) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学) Research Center for Low Altitude Economy, The Hong Kong Polytechnic University(低空经济研究中心,香港理工大学) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科学与技术大学) Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology(土木与环境工程系,香港科学与技术大学)

AI总结 提出LLM-ODDR框架,利用大语言模型联合优化网约车订单调度与司机重新定位,通过多目标价值细化、公平感知调度和时空需求感知重定位提升效果、适应性和可解释性。

详情
Comments
Published in IEEE Transactions on Intelligent Transportation Systems (TITS)
AI中文摘要

网约车平台在动态城市环境中优化订单调度和司机重新定位操作面临重大挑战。基于组合优化、规则启发式和强化学习的传统方法往往忽视司机收入公平性、可解释性以及对现实动态的适应性。为弥补这些不足,我们提出LLM-ODDR,一种利用大语言模型(LLM)进行网约车服务中联合订单调度和司机重新定位(ODDR)的新型框架。LLM-ODDR框架包含三个关键组件:(1)多目标引导的订单价值细化,通过考虑多个目标评估订单以确定其整体价值;(2)公平感知的订单调度,平衡平台收入与司机收入公平性;(3)时空需求感知的司机重新定位,基于历史模式和预测供应优化空闲车辆放置。我们还开发了JointDR-GPT,一个针对ODDR任务进行领域知识微调的模型。在曼哈顿出租车运营的真实数据集上进行的大量实验表明,我们的框架在有效性、对异常条件的适应性以及决策可解释性方面显著优于传统方法。据我们所知,这是首次将LLM作为决策智能体应用于网约车ODDR任务,为将先进语言模型集成到智能交通系统中奠定了基础性见解。虽然当前框架的计算成本高于传统方法,但我们表明并行分解和模型蒸馏可以将延迟降低到可部署的生产水平。

英文摘要

Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems. While the current framework incurs higher computational costs than traditional methods, we show that parallel decomposition and model distillation can reduce latency to production-viable levels for deployment.

2505.04021 2026-06-12 cs.DC cs.AI cs.LG cs.PF 版本更新

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism: 通过GPU内存气球实现经济高效的多LLM服务

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

发表机构 * UCLA(加州大学洛杉矶分校) UC Berkeley(伯克利加州大学) Harvard University(哈佛大学) CMU(卡内基梅隆大学) University of Edinburgh(爱丁堡大学) Intel(英特尔) Stanford University(斯坦福大学) LMSYS(灵州市系统实验室) ByteDance(字节跳动) Alibaba Cloud(阿里云) Tsinghua University(清华大学) Novita AI Rice University(里士满大学)

AI总结 针对多LLM服务中资源效率低下的问题,提出基于内存气球的内存中心化LLM协同服务框架Prism,统一空间与时间共享,已在10K+ GPU生产环境部署。

详情
Comments
OSDI'26
AI中文摘要

推理提供商必须为许多LLM保持可用性,包括低流量但关键的模型,随着token价格下降,资源效率变得越来越重要。对生产轨迹的分析揭示了一种动态突发组模式,其中一组模型同时活跃并随时间变化;现有的空间和时间共享方法缺乏适应这种变化的原理性机制,迫使在SLO遵守和效率之间进行权衡。我们观察到弹性内存分配可以统一空间和时间共享。基于这一洞察,我们开发了Prism,一个以内存为中心的LLM协同服务框架,它应用内存气球来跨模型回收内存,并在单一方案下支持两种形式的共享。Prism的气球驱动程序,称为kvcached,已在https://github.com/... 开源,并在超过10K GPU的生产环境中部署。

英文摘要

Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at this https URL, and deployed in production environments across 10K+ GPUs.

2505.01869 2026-06-12 cs.CV 版本更新

Visual enhancement and 3D representation for underwater scenes: a review

水下场景的视觉增强与三维表示:综述

Guoxi Huang, Haoran Wang, Brett Seymour, Evan Kovacs, John Ellerbroc, Dave Blackham, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol(视觉信息实验室,布里斯托尔大学) Submerged Resources Center, National Park Service(水下资源中心,国家公园服务) Marine Imaging Technologies, LLC(海洋成像技术有限公司) Gates Underwater Products, Inc(盖茨水下产品公司) Esprit film and television Ltd(Esprit电影和电视有限公司)

AI总结 本文综述了水下视觉增强和三维重建方法,从物理模型到非学习与数据驱动技术(如NeRF和3D高斯溅射),并评估了多种算法在基准数据集上的性能,指出了未来研究方向。

详情
AI中文摘要

水下视觉增强(UVE)和水下三维重建由于水生环境中复杂的成像条件,在计算机视觉和基于AI的任务中面临重大挑战。尽管开发了许多增强算法,但涵盖UVE和水下三维重建的全面系统性综述仍然缺失。为了推动这些领域的研究,我们从多个角度进行了深入综述。首先,我们介绍了基本的物理模型,强调了挑战传统技术的特殊性。我们调查了专门为水下场景设计的视觉增强和三维重建的先进方法。本文评估了从非学习方法到先进数据驱动技术(包括神经辐射场和3D高斯溅射)的各种方法,讨论了它们在处理水下失真方面的有效性。最后,我们在多个基准数据集上对最先进的UVE和水下三维重建算法进行了定量和定性评估。最后,我们指出了水下视觉未来发展的关键研究方向。

英文摘要

Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

2408.17221 2026-06-12 cs.LG math.AG 版本更新

Geometry of Lightning Self-Attention: Identifiability and Dimension

闪电自注意力的几何:可识别性与维度

Nathan W. Henry, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * University of Toronto(多伦多大学) Royal Institute of Technology (KTH)(皇家理工学院(KTH))

AI总结 本文利用代数几何工具,分析了无归一化自注意力网络的函数空间几何,给出了深层注意力的可识别性描述并计算了函数空间维度,同时刻画了单层模型的奇异点和边界点,并推测了归一化情形的结果。

详情
Comments
Accepted at ICLR 2025
AI中文摘要

我们考虑由无归一化的自注意力网络定义的函数空间,并理论上分析其几何结构。由于这些网络是多项式,我们依赖代数几何的工具。特别地,我们通过描述任意层数参数化的通用纤维来研究深层注意力的可识别性,并据此计算函数空间的维度。此外,对于单层模型,我们刻画了奇异点和边界点。最后,我们提出一个关于归一化自注意力网络结果的推测性扩展,在单层情况下证明该推测,并在深层情况下进行数值验证。

英文摘要

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.