arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2510.03827 2026-05-26 cs.CV cs.RO

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

LIBERO-PRO:超越记忆的视觉-语言-动作模型鲁棒与公平评估

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

发表机构 * Huazhong University of Science and Technology(华中科技大学) College of AI, Tsinghua University(清华大学人工智能学院) Wuhan University of Technology(武汉理工大学) Lehigh University(莱斯大学)

AI总结 针对LIBERO基准评估中的记忆偏差问题,提出LIBERO-PRO扩展基准,通过在操作对象、初始状态、任务指令和环境四个维度施加合理扰动,揭示现有VLA模型性能从90%以上骤降至0.0%的严重缺陷,并呼吁采用鲁棒评估方法。

Comments 10 pages,7 figures, 0 tables

详情
AI中文摘要

LIBERO已成为评估视觉-语言-动作(VLA)模型的广泛采用的基准;然而,其当前的训练和评估设置存在问题,常常导致性能估计膨胀,并阻碍公平的模型比较。为了解决这些问题,我们引入了LIBERO-PRO,一个扩展的LIBERO基准,系统性地评估模型在四个维度(操作对象、初始状态、任务指令和环境)的合理扰动下的性能。实验结果表明,尽管现有模型在标准LIBERO评估下达到90%以上的准确率,但在我们的泛化设置下,其性能骤降至0.0%。关键的是,这种差异暴露了模型依赖于对训练集中动作序列和环境布局的死记硬背,而非真正的任务理解或环境感知。例如,当目标对象被替换为无关物品时,模型仍持续执行抓取动作;即使给出被破坏的指令甚至混乱的令牌,其输出也保持不变。这些发现揭示了当前评估实践中的严重缺陷,我们呼吁社区放弃误导性方法,转而采用对模型泛化和理解能力的鲁棒评估。我们的代码可在 https://github.com/Zxy-MLlab/LIBERO-PRO 获取。

英文摘要

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

2510.02837 2026-05-26 cs.AI cs.CL

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

超越最终答案:评估工具增强型智能体的推理轨迹

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

发表机构 * Graduate School of Data Science, KAIST, Daejeon, South Korea(数据科学研究生院,韩国科学技术院,大田,韩国) Department of Industrial and Systems Engineering, KAIST, Daejeon, South Korea(工业与系统工程系,韩国科学技术院,大田,韩国) Department of Artificial Intelligence, Yonsei University, Seoul, South Korea(人工智能系,延世大学,首尔,韩国)

AI总结 针对工具增强型LLM,提出无参考框架TRACE,通过证据库多维度评估推理轨迹的效率、幻觉和适应性,并用元评估数据集验证其有效性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管最近的工具增强型基准涉及复杂请求,但评估仍局限于答案匹配,忽略了效率、幻觉和适应性等关键轨迹方面。最直接的评估方法是将智能体的轨迹与真实轨迹进行比较,但注释所有有效的真实轨迹成本过高。为此,我们引入TRACE,一个用于工具增强型LLM多维度评估的无参考框架。通过整合一个从先前步骤积累知识的证据库,TRACE有效评估智能体的推理轨迹。为验证我们的框架,我们开发了一个新的元评估数据集,包含多样且有缺陷的轨迹,每个轨迹都标有多方面的性能分数。我们的结果证实,即使使用小型开源LLM,TRACE也能准确评估复杂轨迹。此外,我们应用该方法评估智能体在解决工具增强型任务时产生的轨迹,展示了先前未报告的观察结果及其相应的见解。

英文摘要

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

2510.02361 2026-05-26 cs.CL cs.AI

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学信息学院)

AI总结 针对Transformer自注意力二次复杂度导致的推理效率低下问题,提出ChunkLLM框架,通过QK适配器和块适配器实现块选择与压缩,在保持性能的同时显著加速推理。

详情
AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色,但由于自注意力对输入令牌的二次复杂度,面临严重的计算效率低下问题。最近,研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题,但它们要么存在语义不完整的问题,要么训练-推理效率低下。为了全面解决这些挑战,我们提出了ChunkLLM,一个轻量级且可插拔的训练框架。具体来说,我们引入了两个组件:QK适配器(Q-Adapter和K-Adapter)和块适配器。前者附加在每个Transformer层上,兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行,通过利用上下文语义信息来检测块边界。在训练阶段,骨干网络的参数保持冻结,仅QK适配器和块适配器进行训练。值得注意的是,我们设计了一种注意力蒸馏方法来训练QK适配器,这提高了关键块的召回率。在推理阶段,仅当当前令牌被检测为块边界时才触发块选择,从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能,而且在长上下文基准上保持了98.64%的性能,同时保持了48.58%的键值缓存保留率。特别地,在处理120K长文本时,ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

2510.01348 2026-05-26 cs.RO

Kilometer-Scale GNSS-Denied UAV Navigation via Heightmap Gradients: A Winning System from the SPRIN-D Challenge

基于高程图梯度的千米级GNSS拒止无人机导航:SPRIN-D挑战优胜系统

Michal Werner, David Čapek, Tomáš Musil, Ondřej Franěk, Tomáš Báča, Martin Saska

发表机构 * Faculty of Electrical Engineering, Czech Technical University in Prague(捷克技术大学布拉格分校电子工程系)

AI总结 针对GNSS拒止环境下无人机长距离飞行中的漂移问题,提出一种利用高程图梯度模板匹配进行漂移校正的轻量级定位方法,并在SPRIN-D挑战中实现9公里航点导航。

Comments 8 pages

详情
AI中文摘要

在GNSS拒止环境中实现可靠的长距离无人机飞行具有挑战性:集成里程计会导致漂移,在未探索区域无法进行闭环检测,且嵌入式平台计算能力有限。我们提出了一套完全机载的无人机系统,专为SPRIN-D Funke Fully Autonomous Flight Challenge开发,该挑战要求在没有GNSS或先验密集地图的情况下,在低于25米AGL(离地高度)的高度完成9公里长距离航点导航。该系统集成了感知、建图、规划和控制,并采用一种轻量级漂移校正方法,通过梯度模板匹配将激光雷达导出的局部高程图与先验地理数据高程图进行匹配,并在聚类粒子滤波器中融合里程计证据。在竞赛部署中,该系统在城区、森林和开阔地形中执行了千米级飞行,相对于原始里程计显著减少了漂移,同时在仅CPU硬件上实时运行。我们描述了系统架构、定位流程和竞赛评估,并报告了现场部署中的实际经验,为GNSS拒止无人机自主性的设计提供了参考。

英文摘要

Reliable long-range flight of unmanned aerial vehicles (UAVs) in GNSS-denied environments is challenging: integrating odometry leads to drift, loop closures are unavailable in previously unseen areas and embedded platforms provide limited computational power. We present a fully onboard UAV system developed for the SPRIN-D Funke Fully Autonomous Flight Challenge, which required 9 km long-range waypoint navigation below 25 m AGL (Above Ground Level) without GNSS or prior dense mapping. The system integrates perception, mapping, planning, and control with a lightweight drift-correction method that matches LiDAR-derived local heightmaps to a prior geo-data heightmap via gradient-template matching and fuses the evidence with odometry in a clustered particle filter. Deployed during the competition, the system executed kilometer-scale flights across urban, forest, and open-field terrain and reduced drift substantially relative to raw odometry, while running in real time on CPU-only hardware. We describe the system architecture, the localization pipeline, and the competition evaluation, and we report practical insights from field deployment that inform the design of GNSS-denied UAV autonomy.

2509.24978 2026-05-26 cs.AI cond-mat.quant-gas quant-ph

Agentic Exploration of Physics Models

物理模型的智能体探索

Maximilian Nägele, Florian Marquardt

发表机构 * Max Planck Institute for the Science of Light(马克斯·普朗克光科学研究所)

AI总结 提出 SciExplorer 智能体,利用大语言模型工具使用能力,无需领域特定蓝图即可探索未知物理系统,通过实验和观测恢复运动方程和哈密顿量。

详情
AI中文摘要

科学发现的过程依赖于观察、分析和假设生成的相互作用。机器学习正越来越多地被用于处理这一过程的各个方面。然而,完全自动化发现未知系统定律所需的启发式迭代循环(通过实验和分析探索系统)仍然是一个开放挑战,且不能针对特定任务进行定制。在这里,我们介绍了 SciExplorer,一个利用大语言模型工具使用能力来探索系统而无需任何领域特定蓝图的智能体,并将其应用于最初对智能体未知的物理系统。我们在涵盖机械动力学系统、波演化和量子多体物理的广泛模型上测试了 SciExplorer。尽管使用了最小工具集(主要基于代码执行),我们观察到在从观测动力学恢复运动方程和从期望值推断哈密顿量等任务上表现出色。该设置的有效性为在其他领域进行类似的科学探索打开了大门,无需微调或任务特定指令。

英文摘要

The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

2509.24621 2026-05-26 cs.CV

FreeRet: MLLMs as Training-Free Retrievers

FreeRet: 无需训练的多模态大语言模型检索器

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Chunxu Liu, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Institute of Science Tokyo(东京科学研究院) Zhejiang University(浙江大学)

AI总结 提出FreeRet框架,将现成的多模态大语言模型转化为无需额外训练的两阶段检索器,通过语义嵌入和重排序提升检索性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型正成为混合模态检索的通用基础。然而,它们通常需要大量的后期训练才能转化为用于检索的对比编码器。本文提出:现成的多模态大语言模型能否在无需额外训练的情况下作为强大的检索器?我们提出了FreeRet,一个即插即用的框架,可将任何多模态大语言模型转化为两阶段检索器。FreeRet首先直接从模型中导出语义嵌入以进行快速候选搜索,然后利用其推理能力进行精确重排序。该框架贡献了三个进步:绕过词汇对齐层以获得语义保真的嵌入、通过显式先验条件化表示生成、以及通过中性选择框架减轻重排序中的框架效应。在涵盖46个数据集的MMEB和MMEB-V2基准测试中,FreeRet显著优于在数百万个对上训练的模型。除基准测试外,FreeRet与模型无关,可无缝扩展至不同多模态大语言模型系列和规模,保留其生成能力,支持任意模态组合,并将检索、重排序和生成统一到单个模型内的端到端RAG中。我们的发现表明,经过精心利用的预训练多模态大语言模型可以在无需训练的情况下作为强大的检索引擎,弥补了其作为通才角色的关键差距。

英文摘要

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

2509.23651 2026-05-26 cs.RO

HeLoM: Hierarchical Learning for Whole-Body Loco-Manipulation by a Hexapod Robot

HeLoM: 六足机器人全身移动操作的分层学习

Xinrong Yang, Peizhuo Li, Hongyi Li, Yifeng Peng, Arhaan Jain, Junkai Lu, Linnan Chang, Yuhong Cao, Yifeng Zhang, Ge Sun, Guillaume Sartoretti

发表机构 * MARMoT Lab, Department of Mechanical Engineering, National University of Singapore(机械工程系,新加坡国立大学MARMoT实验室) Center for X-mechanics, Zhejiang University(浙江大学X力学中心)

AI总结 提出HeLoM分层框架,通过协调多肢控制实现六足机器人对重/不规则物体的稳定推动,在仿真和实物实验中验证了有效性。

详情
AI中文摘要

在自然界中,动物经常需要移动/操纵与自身重量/大小相当的物体。与抓取和搬运相比,推动提供了一种更直接、高效的非抓取操纵策略,避免了复杂的抓取设计,同时利用直接接触在交互过程中调节物体的姿态。然而,实现有效的推动既需要足够的操纵能力,也需要稳定的全身协调,这在处理重型或不规则物体时尤其具有挑战性。为了解决这些挑战,我们提出了HeLoM,一种基于学习的六足机器人分层全身操纵框架,该框架利用协调的多肢控制,并适用于多足机器人系统。受多足昆虫合作策略的启发,我们的框架利用多个接触点和高度自由度,在物体交互过程中实现高效、动态的全身协调。HeLoM的高层规划器规划推动行为,而其低层控制器保持运动稳定性并生成动态一致的关节动作。这种设计使机器人能够通过协调的前肢交互和支撑性的后肢推进,在执行连续可控的推动行为的同时保持平衡。我们通过仿真和实物实验验证了HeLoM的有效性。结果表明,我们的框架能够在现实世界中稳定地将不同尺寸和未知物理属性的物体推动到指定的目标姿态。

英文摘要

In nature, animals often need to move/manipulate objects comparable in weight/size to their own bodies. Compared to grasping and carrying, pushing provides a more straightforward and efficient non-prehensile manipulation strategy, avoiding complex grasp design while leveraging direct contact to regulate an object's pose during interaction. Achieving effective pushing, however, requires both sufficient manipulation capability and stable whole-body coordination, which is particularly challenging when dealing with heavy or irregular objects. To address these challenges, we propose HeLoM, a learning-based hierarchical whole-body manipulation framework for hexapod robots that exploits coordinated multi-limb control and is applicable to multi-legged robotic systems. Inspired by the cooperative strategies of multi-legged insects, our framework leverages multiple contact points and high degrees of freedom to enable efficient and dynamic whole-body coordination during object interaction. HeLoM's high-level planner plans pushing behaviors, while its low-level controller maintains locomotion stability and generates dynamically consistent joint actions. This design enables the robot to maintain balance while executing continuous and controllable pushing behaviors through coordinated foreleg interaction and supportive hind-leg propulsion. We validate the effectiveness of HeLoM through both simulation and real-world experiments. Results show that our framework can stably push objects of varying sizes and unknown physical properties to designated goal poses in the real world.

2509.22299 2026-05-26 cs.LG cs.AI

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr: 基于Hessian的输出空间中高效原子专家剪枝

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) FABU Inc.(FABU公司) Hangzhou Kuaidi Science and Technology Co., Ltd.(杭州快的科学技术有限公司)

AI总结 针对MoE模型粗粒度专家剪枝导致精度下降的问题,提出HEAPr算法,通过将专家分解为原子专家并利用二阶信息(最优脑外科原理)评估重要性,在输出空间简化计算,实现高比例无损压缩。

Comments ICLR 2026

Journal ref Proceedings of the International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

大型语言模型中的混合专家(MoE)架构相比密集LLM具有卓越性能和更低的推理成本。然而,其庞大的参数数量导致内存需求过高,限制了实际部署。现有的剪枝方法主要关注专家级剪枝,这种粗粒度通常导致显著的精度下降。在这项工作中,我们引入了HEAPr,一种新颖的剪枝算法,它将专家分解为更小、不可分割的原子专家,从而实现更精确和灵活的原子专家剪枝。为了衡量每个原子专家的重要性,我们利用基于最优脑外科理论原理的二阶信息。为了解决二阶信息带来的计算和存储挑战,HEAPr利用原子专家的固有属性,将专家参数的二阶信息转换为原子专家参数的二阶信息,并进一步简化为原子专家输出的二阶信息。这种方法将空间复杂度从$O(d^4)$(其中$d$是模型的维度)降低到$O(d^2)$。HEAPr仅需在小型校准集上进行两次前向传播和一次反向传播即可计算原子专家的重要性。在包括DeepSeek MoE和Qwen MoE系列在内的MoE模型上的大量实验表明,HEAPr在广泛的剪枝比例和基准测试中优于现有的专家级剪枝方法。具体来说,在大多数模型中,HEAPr在20%~25%的剪枝比例下实现了几乎无损的压缩,同时FLOPs也减少了近20%。代码可在[https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr)找到。

英文摘要

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

2509.21592 2026-05-26 cs.CV cs.AI cs.LG

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

接下来会发生什么?通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford(牛津大学视觉几何组)

AI总结 提出一种基于单张图像预测未来运动的方法,通过生成密集轨迹网格来捕捉场景动态和不确定性,相比现有方法更准确多样,并验证其在机器人等下游任务中的有效性。

Journal ref ICLR 2026

详情
AI中文摘要

我们考虑从单张图像预测运动的问题,即预测世界中物体可能如何移动,而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成,模型紧密遵循现代视频生成器的架构,但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性,比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法,展示了其在机器人等下游应用中的有效性,并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型,但我们表明它们在从单张图像预测运动方面存在困难,即使在简单的物理场景如落块或机械物体交互中,尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销,而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

2509.17057 2026-05-26 cs.RO

RoboManipBaselines: A Unified Framework for Imitation Learning in Robotic Manipulation across Real and Simulation Environments

RoboManipBaselines:面向真实与仿真环境的机器人操作模仿学习统一框架

Masaki Murooka, Tomohiro Motoda, Ryoichi Nakajo, Hanbit Oh, Koshi Makihara, Keisuke Shirai, Tetsuya Ogata, Yukiyasu Domae

发表机构 * Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST)(日本国家先进工业科学技术研究院人工智能研究中心) CNRS-AIST JRL (Joint Robotics Laboratory), IRL(法国国家科学研究中心与日本AIST联合机器人实验室) Institute for AI and Robotics, Future Robotics Organization, Waseda University(早稻田大学未来机器人组织人工智能与机器人研究所) AI Robot Association (AIRoA)(人工智能机器人协会)

AI总结 提出RoboManipBaselines开源框架,统一支持仿真和真实环境下的机器人操作模仿学习全流程,包括数据收集、策略训练和部署,并通过基准测试和研究应用验证其有效性。

Comments Added a Limitations section in response to comments from reviewers

Journal ref IEEE Access 2026

详情
AI中文摘要

我们提出RoboManipBaselines,一个用于机器人操作模仿学习研究的开源软件框架。该框架支持完整的模仿学习流程,包括数据收集、策略训练和部署,覆盖仿真和真实环境。其设计强调通过一致的工作流程实现集成,跨不同环境和机器人平台的通用性,通过易于添加新机器人、任务和策略的可扩展性,以及通过使用公开数据集进行评估的可重复性。RoboManipBaselines系统地实现了模仿学习的核心组件:环境、数据集和策略。通过统一接口,该框架支持多种仿真器和真实机器人环境,以及多模态传感器和多种策略模型。我们进一步在仿真和真实环境中进行了基准评估,并介绍了多项研究应用,包括数据增强、与触觉模型的集成、交互式机器人系统、3D感知评估和硬件扩展。这些结果表明,RoboManipBaselines为利用模仿学习推进机器人操作的研究和实验验证提供了有用的基础。https://isri-aist.github.io/RoboManipBaselines-ProjectPage

英文摘要

We present RoboManipBaselines, an open-source software framework for imitation learning research in robotic manipulation. The framework supports the entire imitation learning pipeline, including data collection, policy training, and rollout, across both simulation and real-world environments. Its design emphasizes integration through a consistent workflow, generality across diverse environments and robot platforms, extensibility for easily adding new robots, tasks, and policies, and reproducibility through evaluations using publicly available datasets. RoboManipBaselines systematically implements the core components of imitation learning: environment, dataset, and policy. Through a unified interface, the framework supports multiple simulators and real robot environments, as well as multimodal sensors and a wide variety of policy models. We further present benchmark evaluations in both simulation and real-world environments and introduce several research applications, including data augmentation, integration with tactile models, interactive robotic systems, 3D sensing evaluation, and hardware extensions. These results demonstrate that RoboManipBaselines provides a useful foundation for advancing research and experimental validation in robotic manipulation using imitation learning. https://isri-aist.github.io/RoboManipBaselines-ProjectPage

2509.16139 2026-05-26 cs.LG

Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

介观结构介质中冲击传播的时空多场深度学习

M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill

发表机构 * Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 提出多场时空模型(MSTM),通过训练多尺度多物理场数据,同时演化七个耦合热力学和动力学场,以高精度预测冲击传播中的异常响应,实现1000倍加速。

Comments 25 pages, 12 figures

详情
AI中文摘要

预测多孔和晶格材料极端流体动力学响应是高能量密度物理学中的一个基本挑战,其中冲击诱导的孔洞塌陷、斜压涡度和异常动力学与热力学状态必须在多个尺度上解析。传统高保真流体动力学代码在行星防御和惯性约束聚变等应用的大规模设计探索中计算成本过高。我们提出了一种多场时空模型(MSTM),旨在克服标准机器学习替代模型的局限性,这些模型通常无法捕捉冲击传播特征的尖锐梯度和非线性场耦合。通过在高保真、多尺度多物理场数据上训练,MSTM同时演化七个耦合的热力学和动力学场——包括压力、温度、密度和速度——跨越复杂材料架构。我们的框架展示了准确预测异常响应的能力,例如反直觉的冲击后密度降低和局部热点形成,均方根误差低至1.4%。关键的是,模型的多场公式在长自回归展开中保持了物理一致性和界面稳定性,在结构保真度上比单场模型提高了94%。该框架实现了1000倍的求解时间减少,为介观结构介质中能量耗散和动量传递的实时分析与优化提供了实用途径。

英文摘要

Predicting the extreme hydrodynamic response of porous and architected lattice materials is a fundamental challenge in high energy density physics, where shock-induced pore collapse, baroclinic vorticity, and anomalous kinetic and thermodynamic states must be resolved across multiple scales. Traditional high-fidelity hydrocodes are computationally prohibitive for large-scale design exploration in applications like planetary defense and inertial confinement fusion. We present a multi-field spatio-temporal model (MSTM) designed to overcome the limitations of standard machine learning surrogates, which often fail to capture the sharp gradients and non-linear field couplings characteristic of shock propagation. By training on high-fidelity, multiscale multiphysics data, MSTM simultaneously evolves seven coupled thermodynamic and kinetic fields - including pressure, temperature, density, and velocity - across complex material architectures. Our framework demonstrates the ability to accurately predict anomalous responses, such as counterintuitive post-shock density reductions and localized hotspot formation, with mean root mean squared errors as low as 1.4%. Crucially, the model's multi-field formulation maintains physical consistency and interface stability over long autoregressive rollouts, outperforming single-field models by 94% in structural fidelity. This framework enables a 1000x reduction in time to solution, providing a practical pathway for the real-time analysis and optimization of energy dissipation and momentum transfer in meso-structured media.

2509.09658 2026-05-26 cs.CV

Measuring Epistemic Humility in Multimodal Large Language Models

测量多模态大语言模型中的认知谦逊

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学) Hong Kong Baptist University(香港 Baptist大学)

AI总结 提出HumbleBench基准,通过强制选择多项选择中引入“以上皆非”选项,评估多模态大语言模型拒绝错误选项的谦逊行为。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的幻觉——即模型生成与输入图像不一致的内容——在现实应用中带来显著风险,从视觉问答中的错误信息到决策中的不安全错误。现有基准主要测试识别准确性,即评估模型能否在干扰项中选择正确答案。这忽略了可信AI的另一个重要能力:当没有提供的选项得到图像支持时,能够识别并避免做出错误选择,这是一种与谦逊相关的行为。我们提出了HumbleBench,这是一个新的幻觉基准,旨在评估MLLMs在强制选择多项选择设置中拒绝错误选项的能力,其中包含“以上皆非”选项。基于全景场景图数据集,我们利用对象和关系的细粒度场景图注释,使用候选属性线索,并提示GPT-4-Turbo生成多项选择问题,随后进行严格的人工筛选。每个问题都包含一个“以上皆非”选项,要求模型不仅识别正确的视觉信息,还要识别何时没有提供的答案有效。我们在HumbleBench上评估了各种最先进的MLLMs——包括通用型、专门推理型和专有模型——并为社区报告了实证结果。通过纳入明确的错误选项拒绝,HumbleBench填补了当前评估套件中的一个关键空白,评估了一种较窄但重要的、与可信多模态推理相关的弃权行为。我们的代码和数据集已公开发布,可在https://github.com/maifoundations/HumbleBench获取。

英文摘要

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.

2509.04445 2026-05-26 cs.LG

Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment

朝向认知忠实决策模型以改善AI对齐

Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong

发表机构 * Duke University(杜克大学) IIT Delhi(德里印度理工学院) CMU(卡内基梅隆大学)

AI总结 提出一种基于公理的方法,从成对比较中学习认知忠实的决策过程,以解决标准偏好诱导方法未能捕捉人类决策认知过程的问题,并在肾脏分配任务中验证了模型的有效性。

Comments In ICLR 2026

详情
AI中文摘要

最近的AI趋势旨在将AI模型与以人为中心的学习目标(如个人偏好、效用或社会价值观)对齐。使用标准偏好诱导方法,研究人员和从业者构建人类决策和判断的模型,AI模型与之对齐。然而,标准诱导方法通常未能捕捉人类决策背后的认知过程,如启发式或简化的结构化思维模式。为了解决这一失败,我们采用公理化的方法从成对比较中学习认知忠实的决策过程。基于分析塑造人类决策的认知过程的文献,我们推导出一个模型类,其中特征首先通过学习的规则处理,然后通过固定规则(如Bradley-Terry规则)聚合以产生决策。这种结构化的信息处理确保了这些模型作为代表潜在人类决策过程的现实且可行的候选者。我们通过在肾脏分配任务中学习可解释的人类决策模型来展示这种建模方法的有效性,并表明我们提出的模型在准确性上匹配或超越了先前的人类成对决策模型。

英文摘要

Recent AI trends seek to align AI models to learned human-centric objectives, such as personal preferences, utility, or societal values. Using standard preference elicitation methods, researchers and practitioners build models of human decisions and judgments, to which AI models are aligned. However, standard elicitation methods often fail to capture the cognitive processes behind human decision making, such as heuristics or simplifying structured thought patterns. To address this failure, we take an axiomatic approach to learning cognitively faithful decision processes from pairwise comparisons. Building on the literature analyzing cognitive processes that shape human decision-making, we derive a model class in which features are first processed with learned rules, then aggregated via a fixed rule, such as the Bradley-Terry rule, to produce a decision. This structured processing of information ensures that such models are realistic and feasible candidates to represent underlying human decision-making processes. We demonstrate the efficacy of this modeling approach by learning interpretable models of human decision making in a kidney allocation task, and show that our proposed models match or surpass the accuracy of prior models of human pairwise decision-making.

2509.00056 2026-05-26 cs.CV

Apex-Centered Spatio-Temporal Rank Pooling and Gradient Attention for Micro-Expression Recognition

基于顶点的时空秩池化和梯度注意力用于微表情识别

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(信息技术学院,越南工程大学)

AI总结 提出微表情时空图像(MESTI)和微表情梯度注意力网络(MEGANet),通过改进输入模态和注意力机制提升微表情识别性能。

详情
AI中文摘要

微表情识别(MER)由于微表情的细微和短暂性是一项具有挑战性的任务。传统的输入模态,如顶点帧、光流和动态图像,往往无法充分捕捉这些短暂的面部运动,导致性能次优。在本研究中,我们引入了微表情时空图像(MESTI),这是一种针对微表情的动态秩池化的重新表述,将视频序列转换为单张图像,同时强调微表情的起始-顶点-结束时间模式。此外,我们提出了微表情梯度注意力网络(MEGANet),该网络包含一个提出的梯度注意力块,以增强从微表情中提取细粒度运动特征。通过结合MESTI和MEGANet,我们旨在建立一种更有效的MER方法。进行了大量实验以评估MESTI的有效性,将其与现有输入模态在常规架构上进行比较。此外,我们证明将先前发表的MER网络的输入替换为MESTI会导致一致的性能提升。还评估了MEGANet的性能,显示我们提出的网络在SMIC-HS、SAMM数据集上达到了最先进的结果,在CASMEII数据集上具有竞争力的性能,并且在报告的跨数据集评估设置中也取得了领先性能。MESTI和MEGANet的组合始终优于比较方法。这些发现强调了MESTI作为优越输入模态和MEGANet作为先进识别网络的潜力,旨在在各种应用中实现更有效的MER系统。

英文摘要

Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a micro-expression-specific reformulation of dynamic rank pooling that transforms a video sequence into a single image while emphasizing the onset-apex-offset temporal pattern of micro-expressions. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a proposed Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across regular architectures. Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet is also evaluated, showing that our proposed network achieves state-of-the-art results on the SMIC-HS, SAMM and competitive performance on CASMEII datasets, it also achieves leading performance in the reported cross-dataset evaluation settings. The combination of MESTI and MEGANet consistently outperforms the compared methods. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, aiming to more effective MER systems in a variety of applications.

2508.19988 2026-05-26 cs.CL

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

AgentCoMa:一个混合常识与数学推理的现实场景组合基准

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

发表机构 * Imperial College London(帝国理工学院伦敦分校) RIKEN(日本研究机构) University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 提出AgentCoMa基准,测试大语言模型在组合常识与数学推理任务上的性能,发现模型在单独步骤上准确率高但组合后平均下降近30%。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)在涉及多个推理步骤组合的复杂常识和数学问题上取得了高准确率。然而,当前测试这些技能的组合基准往往侧重于常识或数学推理,而解决现实世界任务的LLM智能体需要两者的结合。在这项工作中,我们引入了一个智能体常识与数学基准(AgentCoMa),其中每个组合任务需要一个常识推理步骤和一个数学推理步骤。我们在61个不同规模、模型家族和训练策略的LLM上进行了测试。我们发现,LLM通常可以孤立地解决这两个步骤,但当两者结合时,它们的准确率平均下降近30%。这比我们在先前组合相同推理类型多个步骤的组合基准中观察到的性能差距要大得多。相比之下,非专家人类标注者可以以同样高的准确率解决AgentCoMa中的组合问题和各个步骤。此外,我们进行了一系列可解释性研究,以更好地理解性能差距,检查了神经元模式、注意力图和成员推断。我们的工作强调了在混合类型组合推理背景下模型脆弱性的显著程度,并为未来的改进提供了一个测试平台。

英文摘要

Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by nearly 30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.

2508.19113 2026-05-26 cs.AI

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

混合深度搜索器:可扩展的并行与顺序搜索推理

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

发表机构 * Seoul National University(首尔国立大学) LG AI Research(LG AI研究) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Seoul(首尔大学)

AI总结 提出混合搜索策略HybridDeepSearcher,通过并行查询扩展与显式证据聚合结合顺序推理,在多个基准上显著提升性能并实现测试时搜索扩展。

Comments Accepted to ICLR 2026

详情
AI中文摘要

大型推理模型(LRMs)结合检索增强生成(RAG)使得深度研究智能体能够通过外部知识检索进行多步推理。然而,我们发现现有方法很少展示测试时搜索扩展。通过单查询顺序搜索扩展推理的方法受限于证据覆盖范围,而每步生成多个独立查询的方法通常缺乏结构化聚合,阻碍了更深的顺序推理。我们提出一种混合搜索策略来解决这些限制。我们引入了HybridDeepSearcher,一种结构化的搜索智能体,它在进入更深的顺序推理之前集成了并行查询扩展与显式证据聚合。为了监督这种行为,我们引入了HDS-QA,一个新颖的数据集,通过包含并行子查询的监督推理-查询-检索轨迹,指导模型将广泛的并行搜索与结构化聚合相结合。在五个基准上,HybridDeepSearcher显著优于现有技术,在FanOutQA上F1分数提高+15.9,在BrowseComp子集上提高+9.2。进一步分析显示其一致的测试时搜索扩展:随着允许的额外搜索轮次或调用次数增加,性能持续提升,而竞争方法则趋于平稳。

英文摘要

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

2508.13309 2026-05-26 cs.CV cs.LG

DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

DASH:一种用于合成有效且隐蔽的对抗样本的元攻击框架

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine(缅因大学) University of Florida(佛罗里达大学) University of Tennessee, Knoxville(田纳西大学,基洛纳)

AI总结 提出DASH元攻击框架,通过多阶段自适应组合Lp约束攻击方法,生成有效且感知对齐的对抗样本,在多个数据集上优于现有方法。

Comments Accepted to CVPR 2026

详情
AI中文摘要

在白盒设置下,已有大量技术被提出用于在严格的Lp范数约束下生成对抗样本。然而,这类范数受限的样本往往与人类感知不一致,只有少数方法专门探索感知对齐的对抗样本。此外,尚不清楚能否有效利用Lp约束攻击的见解来提升感知效能。本文介绍DASH,一个完全可微的元攻击框架,通过策略性地组合现有基于Lp的攻击方法,生成有效且感知对齐的对抗样本。DASH以多阶段方式运行:在每个阶段,它使用学习到的自适应权重聚合来自多个基础攻击的候选对抗样本,并将结果传播到下一阶段。一种新颖的元损失函数通过联合最小化误分类损失和感知失真来指导这一过程,使框架能够动态调整每个基础攻击在各阶段的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上对对抗训练模型评估DASH。尽管仅依赖基于Lp约束的方法,DASH显著优于最先进的感知攻击如AdvAD,实现了更高的攻击成功率(例如提升20.63%)和更优的视觉质量(以SSIM、LPIPS和FID衡量,分别提升约11、0.015和5.7)。此外,DASH对未见过的防御具有良好的泛化能力,使其成为评估鲁棒性的实用且强大的基线,无需为每种新防御手工设计自适应攻击。

英文摘要

Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only a few methods specifically explore perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

2508.12628 2026-05-26 cs.CV

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Creative4U: 基于MLLMs的广告创意图像选择器与比较推理

Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团)

AI总结 提出基于多模态大语言模型的创意图像评估与选择范式,通过构建比较推理数据集CreativePair和强化学习方法Creative4U,实现可解释的创意选择。

详情
AI中文摘要

广告中的创意图像是电子商务平台的核心和灵魂。引人注目的创意图像可以提升用户的购物体验,增加广告主的收入以及平台的广告收入。随着AIGC技术的出现,广告主能够以极低的成本生产大量创意图像。然而,他们难以评估创意质量以进行选择。现有方法主要关注创意排序,无法满足可解释的创意选择需求。在这项工作中,我们提出了首个可解释的创意评估与选择范式。借助多模态大语言模型(MLLMs),我们的方法将创意图像的评估与选择整合到自然语言生成任务中。为了促进这项研究,我们构建了CreativePair,这是首个比较推理驱动的创意数据集,包含8k个带标注的图像对,每个样本包含一个标签,指示哪张图像更优。此外,我们引入了Creative4U(读作Creative for You),一种基于MLLMs的创意选择器,它考虑了用户的兴趣。通过Reason-to-Select RFT,其中包括基于思维链的监督微调(CoT-SFT)和基于组相对策略优化(GRPO)的强化学习,Creative4U能够准确评估和选择创意图像。离线和在线实验均证明了我们方法的有效性。我们的代码和数据集将公开,以推动研究和工业应用。

英文摘要

Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.

2508.03104 2026-05-26 cs.LG cs.AI

HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

HiTeC: 基于语义感知增强的文本属性超图层次对比学习

Mengting Pan, Fan Li, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * The University of New South Wales(新南威尔士大学) University of Wollongong(沃拉彭大学)

AI总结 提出HiTeC框架,通过两阶段层次对比学习,结合结构感知文本编码预训练和语义感知增强,解决文本属性超图中文本与拓扑关联不足、随机增强噪声及长程依赖捕获问题。

Comments 16 pages, 8 figures

详情
AI中文摘要

对比学习已成为自监督超图学习的主流范式,能够在无需昂贵标签的情况下实现有效训练。然而,现实世界超图中的节点实体通常关联丰富的文本信息,这在先前工作中被大量忽略。直接将现有基于对比学习的方法应用于此类文本属性超图(TAHGs)会导致三个关键限制:(1)普遍使用的图无关文本编码器无法捕获文本语义与超图拓扑之间的相关性,导致表示表达能力不足。(2)它们对随机数据增强的依赖引入了噪声并削弱了对比信号。(3)主要关注节点和超边级别的对比信号限制了捕获长程依赖的能力,而这对于有效的表示学习至关重要。为解决这些挑战,我们引入了HiTeC,一个两阶段层次对比学习框架,用于在TAHGs上进行有效的自监督学习。在第一阶段,我们使用结构感知的对比目标预训练文本编码器,以克服传统方法的图无关特性。在第二阶段,我们首先引入语义感知增强,包括结构上下文化的文本增强和语义感知的超边丢弃,以促进信息丰富的视图生成。随后,我们提出一个多尺度对比损失,结合基于$s$步行走的子图级别目标,以捕获长程依赖。在六个真实世界数据集上的大量实验验证了我们提出方法的有效性。

英文摘要

Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which has been largely ignored in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders fails to capture the correlations between textual semantics and hypergraph topology, resulting in less expressive representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive signals. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for effective representation learning. To address these challenges, we introduce HiTeC, a two-stage hierarchical contrastive learning framework for effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we begin by introducing semantic-aware augmentations, including structure-contextualized text augmentation and semantic-aware hyperedge dropping, to facilitate informative view generation. Subsequently, we propose a multi-scale contrastive loss with an $s$-walk-based subgraph-level objective to capture long-range dependencies. Extensive experiments on six real-world datasets validate the effectiveness of our proposed method.

2507.07644 2026-05-26 cs.AI

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

FloorplanQA:使用结构化表示进行大语言模型空间推理的基准测试

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

发表机构 * King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学) Miami University(迈阿密大学)

AI总结 提出FloorplanQA基准,通过结构化室内场景表示评估大语言模型在距离测量、可见性、路径查找和物体放置等空间推理任务上的表现,揭示模型在物理约束和空间一致性方面的盲点。

Comments ICML 2026, Project page: https://OldDeLorean.github.io/FloorplanQA/

详情
AI中文摘要

我们引入了FloorplanQA,一个用于评估大语言模型空间推理能力的诊断基准。FloorplanQA基于室内场景的结构化表示,例如(厨房、客厅、卧室、浴室等),这些场景以JSON或XML布局进行符号编码。该基准涵盖了核心空间任务,包括距离测量、可见性、路径查找以及在受限空间内的物体放置。我们在各种前沿开源和商业大语言模型上的实验结果表明,虽然模型可能在浅层查询上成功,但它们往往无法遵守物理约束、保持空间一致性,尽管它们对小的空间扰动大多保持鲁棒。FloorplanQA揭示了当前大语言模型的一个盲点:对室内布局的不一致推理。我们希望这个基准能激发新的工作,使语言模型能够在实际场景中准确推断和操作空间与几何属性。

英文摘要

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

2507.05890 2026-05-26 cs.CL cs.AI

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

使用具有特质-反应中介的虚拟受访者进行心理测量项目验证

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Communication, Seoul National University(首尔国立大学通信系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目)

AI总结 提出一种利用LLM模拟虚拟受访者(通过中介因素)来高效验证心理测量项目效度的框架,实验证明该方法能有效识别高有效性项目。

Comments This paper has been accepted for publication at TACL 2026

详情
AI中文摘要

随着心理测量调查越来越多地用于评估大型语言模型(LLM)的特质,对适用于LLM的可扩展调查项目生成的需求也随之增长。这里的一个关键挑战是确保生成项目的构念效度,即它们是否真正测量了预期的特质。传统上,这需要昂贵的大规模人类数据收集。为了提高效率,我们提出了一个使用LLM进行虚拟受访者模拟的框架。我们的核心思想是考虑中介因素:通过它们,相同的特质可能对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者,我们识别出那些在这些中介因素中与预期特质稳健相关的调查项目。在三种心理特质理论(大五人格、施瓦茨价值观、VIA性格优势)上的实验表明,我们的中介生成方法和模拟框架有效地识别了高有效性项目。LLM展示了从特质定义生成合理中介因素以及模拟受访者行为以进行项目验证的能力。我们的问题表述、指标、方法和数据集为成本效益高的调查开发以及更深入地理解LLM如何模拟人类调查反应开辟了新方向。我们发布数据集和代码以支持未来工作。

英文摘要

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that yield responses robustly correlated with intended traits across these mediators. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-efficient survey development and a deeper understanding of how LLMs simulate human survey responses. We release our dataset and code to support future work.

2507.03159 2026-05-26 cs.LG math.OC

MathOptAI.jl: Embed trained machine learning predictors into JuMP models

MathOptAI.jl: 将训练好的机器学习预测器嵌入JuMP模型

Oscar Dowson, Robert B Parker, Russel Bent

发表机构 * Dowson Farms(多森农场) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出开源Julia库MathOptAI.jl,将多种训练好的机器学习模型(神经网络、决策树、高斯过程)嵌入JuMP优化模型,并支持PyTorch模型的GPU加速。

详情
AI中文摘要

我们提出了 exttt{MathOptAI.jl},一个用于将训练好的机器学习预测器嵌入JuMP模型的开源Julia库。 exttt{MathOptAI.jl}可以将多种神经网络、决策树和高斯过程嵌入到更大的数学优化模型中。除了与一系列基于Julia的机器学习库(如 exttt{Lux.jl}和 exttt{Flux.jl})交互外, exttt{MathOptAI.jl}还利用Julia的Python接口提供对PyTorch模型的支持。当PyTorch支持与 exttt{MathOptAI.jl}的灰盒公式结合时,与PyTorch模型相关的函数、雅可比矩阵和海森矩阵评估被卸载到Python中的GPU上,而其余的非线性预言机则在Julia中的CPU上评估。\MathOptAI可在https://github.com/lanl-ansi/MathOptAI.jl上获取,采用BSD-3许可证。

英文摘要

We present \texttt{MathOptAI.jl}, an open-source Julia library for embedding trained machine learning predictors into a JuMP model. \texttt{MathOptAI.jl} can embed a wide variety of neural networks, decision trees, and Gaussian Processes into a larger mathematical optimization model. In addition to interfacing a range of Julia-based machine learning libraries such as \texttt{Lux.jl} and \texttt{Flux.jl}, \texttt{MathOptAI.jl} uses Julia's Python interface to provide support for PyTorch models. When the PyTorch support is combined with \texttt{MathOptAI.jl}'s gray-box formulation, the function, Jacobian, and Hessian evaluations associated with the PyTorch model are offloaded to the GPU in Python, while the rest of the nonlinear oracles are evaluated on the CPU in Julia. \MathOptAI is available at https://github.com/lanl-ansi/MathOptAI.jl under a BSD-3 license.

2506.21137 2026-05-26 cs.LG

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

Norm×Direction:恢复视觉线性注意力中缺失的查询范数

Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Pengcheng Laboratory, China(鹏城实验室) The University of Queensland, Australia(昆士兰大学)

AI总结 针对线性注意力中查询范数丢失和非负性导致信息损失的问题,提出基于范数-方向分解的NaLaFormer,通过注入查询范数恢复注意力分布尖峰性,并采用余弦相似度保证非负性,在多项任务上达到线性注意力新标杆。

详情
AI中文摘要

线性注意力缓解了softmax注意力的二次复杂度,但遭受了关键的表达能力损失。我们识别出两个主要原因:(1)归一化操作取消了查询范数,这打破了查询范数与softmax注意力中注意力分布的尖峰性(熵)之间的相关性。(2)强制非负性的标准技术通过抵消有效的内积交互导致破坏性的信息损失。为了解决这些挑战,我们引入了NaLaFormer,一种基于查询和键向量的范数×方向(ND)分解的新型线性注意力机制。我们利用每个分量解决一个不同的问题:查询范数被注入到我们的核中,以创建一个查询范数感知的映射,恢复注意力分布的尖峰性。方向向量通过基于几何的余弦相似度度量进行处理,该度量在保证非负性的同时保留了内积的丰富细粒度信息。我们通过全面的多模态评估验证了NaLaFormer,它在线性注意力上设立了新的最先进基准。我们的模型在ImageNet-1K上实现了高达7.5%的准确率提升,在ADE20K上实现了4.7%的mIoU改进,相比可比的基线。它展示了深刻的效率,在令牌密集的超分辨率任务(7万+令牌)中,将峰值内存减少了变革性的92.3%。NaLaFormer的通用性进一步得到证实,它在常识推理上超越了像Mamba这样的强基线,并在Long Range Arena(LRA)基准上设立了新的最先进水平。代码可在https://github.com/ZacharyMeng/NaLaFormer获取。

英文摘要

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm$\times$direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Code is available at https://github.com/ZacharyMeng/NaLaFormer .

2506.19037 2026-05-26 cs.CL cs.AI cs.IT cs.LG cs.NE math.IT

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

速度规划:用于掩码扩散语言模型的膨胀调度

Omer Luxembourg, Haim Permuter, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beersheba, Israel(电气与计算机工程学院,内盖夫本· Gurion大学,贝尔谢巴,以色列)

AI总结 提出膨胀解掩码调度器(DUS),通过将序列位置划分为非相邻的膨胀组并并行解掩码,最小化联合熵增益上界,在不修改去噪器的情况下实现高达5.8倍加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型(MDLM)承诺快速、非自回归的文本生成,然而现有的采样器根据模型置信度选择要解掩码的标记,忽略了并行解掩码多个位置时的交互,实际上退化为缓慢的自回归行为。我们提出了膨胀解掩码调度器(DUS),这是一种仅推理、无需规划模型的方法,它将序列位置划分为非相邻的膨胀组,并并行解掩码,以在每个去噪步骤中最小化联合熵增益的上界。通过明确权衡网络调用次数与生成质量,DUS恢复了传统并行解掩码策略下丢失的大部分性能。在数学(GSM8K, MATH500)、代码(HumanEval, MBPP)、通用知识(BBH, MMLU-Pro)和指令遵循(IFEval)基准测试中,DUS优于基于置信度的规划器,并将扩散特有的质量-速度权衡转化为由块大小$B$确定的确定性、可预测的加速,与逐标记MDLM解码相比,实现了高达5.8倍的墙钟加速,而无需修改底层去噪器。作为即插即用的后滤波器,膨胀间隔也改进了自适应采样器。代码可在https://github.com/omerlux/DUS获取。

英文摘要

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers. Code is available at https://github.com/omerlux/DUS.

2506.17629 2026-05-26 cs.CV cs.AI cs.CL

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) King Abdullah University of Science and Technology(科廷大学) Fudan University(复旦大学)

AI总结 提出CLiViS框架,通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知,构建动态认知地图以迭代更新场景上下文,实现无需训练的具身视觉推理。

详情
AI中文摘要

具身视觉推理(EVR)旨在基于自我中心视频遵循复杂、自由形式的指令,从而在动态环境中实现语义理解和时空推理。尽管具有潜力,EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型(LLM),这通常会遗漏关键视觉细节,要么依赖端到端视觉语言模型(VLM),后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势,我们提出了CLiViS。这是一个新颖的无训练框架,利用LLM进行高层任务规划,并协调VLM驱动的开放世界视觉感知,以迭代更新场景上下文。基于这种协同,CLiViS的核心是一个动态认知地图,它在推理过程中不断演化。该地图构建了具身场景的结构化表示,连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性,特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

2506.17326 2026-05-26 cs.LG stat.AP stat.ML

CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

CopulaSMOTE:基于Copula的过采样方法用于糖尿病预测中的不平衡分类

Agnideep Aich, Md Monzur Murshed, Bruce Wade, Sameera Hewage

发表机构 * Stanford University School of Medicine(斯坦福大学医学院) Minnesota State University(明尼苏达州立大学) University of Louisiana at Lafayette(路易斯安那大学拉斐特分校) Southern Utah University(犹他州南方大学)

AI总结 提出CopulaSMOTE方法,利用截断藤copula建模少数类联合依赖结构生成合成样本,在三个糖尿病数据集上结合多种分类器评估,显示能改善大表格数据集的少数类恢复。

详情
AI中文摘要

类别不平衡仍然是糖尿病等疾病临床预测模型开发中的一个实际障碍,其中确诊病例的数量通常远少于对照组。合成少数类过采样技术(SMOTE)及其变体被广泛用于解决这种不平衡,但它们通过特征空间中的局部插值生成合成观测值,并未显式建模少数类的联合依赖结构。为了解决这一挑战,我们的研究引入了一种基于copula的数据增强方法,该方法在生成合成样本时估计少数类的依赖结构,并与标准机器学习技术集成。具体来说,我们采用截断藤copula通过一系列双变量构建块来表示多元依赖。我们在三个公共糖尿病数据集上评估了所提出的方法,即Pima Indians糖尿病数据集、Iraqi糖尿病数据集和CDC BRFSS 2015糖尿病健康指标数据集,这些数据集涵盖了不同的样本量、维度和不平衡程度。对于每个数据集,使用5×2交叉验证协议和Dietterich配对t检验,在五个分类器上比较了五种重采样策略。我们的研究结果表明,CopulaSMOTE可以改善较大表格糖尿病数据集(尤其是CDC BRFSS数据集)中的少数类恢复,但其优势取决于分类器和评估指标。

英文摘要

Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.

2506.11027 2026-05-26 cs.LG cs.AI cs.PL

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

从推理到代码:针对代表性不足语言的GRPO优化

Federico Pennino, Bianca Raimondi, Massimo Rondelli, Andrea Gurioli, Maurizio Gabbrielli

发表机构 * Qwen2.5-Coder

AI总结 提出结合Qwen2.5-Coder小模型与GRPO的强化学习方法,利用执行反馈和奖励机制提升Prolog、Lisp等低资源语言的代码生成准确性与推理质量。

Comments Accepted ICLP 2026

详情
AI中文摘要

使用大型语言模型(LLM)生成准确且可执行的代码对于代表性不足的编程语言(如Prolog和Lisp)仍然是一个重大挑战,因为与Python等高资源语言相比,公共训练数据稀缺。本文介绍了一种可泛化的强化学习(RL)方法,将Qwen2.5-Coder模型的小规模版本与组相对策略优化(GRPO)相结合,通过推理实现有效的代码生成。为了解决稀疏数据集的局限性,我们将执行驱动的反馈直接集成到RL循环中,利用一个奖励系统,该系统同时利用逻辑正确性和结构格式。在GSM8K数据集上的实验结果表明,在代表性不足的语言中,推理质量和代码准确性有显著提升。这些发现强调了我们的方法通过利用符号推理和基于解释器的反馈,使缺乏广泛训练资源的多种编程语言受益的潜力。

英文摘要

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.

2506.10689 2026-05-26 cs.CV

Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

通过多任务和多年龄方法在无约束图像中筛查未成年人的未成年人检测

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

发表机构 * Department of Electrical, Systems and Automation Engineering(电气、系统与自动化工程系)

AI总结 提出一种基于冻结FaRL视觉语言骨干和紧凑两层MLP的多任务架构,结合α重加权焦点损失和年龄平衡采样,在无约束图像中准确检测未成年人,并在新基准上显著提升性能。

详情
AI中文摘要

在无约束图像中准确自动筛查未成年人需要模型对分布偏移具有鲁棒性,并能应对公共数据集中儿童代表性不足的问题。为解决这些问题,我们提出了一种多任务架构,基于冻结的FaRL视觉语言骨干,结合一个紧凑的两层MLP,该MLP在一个年龄回归头和四个二元未成年人头(12、15、18和21岁)之间共享特征,并包含专门的超/低龄判别任务。该设计聚焦于法律关键年龄范围,同时保持骨干冻结。通过$α$重加权焦点损失和年龄平衡小批量采样缓解类别不平衡,同时通过年龄间隔移除阈值附近的模糊样本。评估在我们的新总体未成年人基准(303k清洗训练图像,110k测试图像)上进行,定义了“ASORES-39k”受限总体测试(去除噪声最大的域)和年龄估计野移测试“ASWIFT-20k”(20k图像,强调极端姿态(>45°)、表情和低图像质量以模拟现实世界偏移)。在清洗总体集上使用重采样和年龄间隔训练后,我们的多年龄模型“F”将ASORES-39k上的平均绝对误差从4.175岁(仅年龄基线)降至4.068岁,并在1%虚假成人率下将18岁以下检测的F2分数从0.801提升至0.857。在ASWIFT-20k上,相同配置几乎保持0.99的召回率,同时F2从0.742提升至0.833,展示了域偏移的鲁棒性。

英文摘要

Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.

2506.06454 2026-05-26 cs.LG cs.AI stat.ML

LETS Forecast: Learning Embedology for Time Series Forecasting

LETS Forecast:用于时间序列预测的嵌入学

Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li

发表机构 * Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison(生物统计学与医学信息学系,威斯康星大学麦迪逊分校) Department of Computer Sciences, University of Wisconsin-Madison(计算机科学系,威斯康星大学麦迪逊分校)

AI总结 提出DeepEDM框架,结合非线性动力系统建模与深度学习,通过延迟嵌入和核回归学习潜在动态,实现高精度时间序列预测。

Comments Accepted at International Conference on Machine Learning (ICML) 2025

详情
AI中文摘要

现实世界的时间序列通常受复杂的非线性动力学支配。理解这些潜在动力学对于精确的未来预测至关重要。虽然深度学习在时间序列预测中取得了重大成功,但许多现有方法并未显式建模动力学。为弥补这一差距,我们引入了DeepEDM,一个将非线性动力系统建模与深度神经网络相结合的框架。受经验动态建模(EDM)启发并基于Takens定理,DeepEDM提出了一种新颖的深度模型,该模型从时间延迟嵌入中学习潜在空间,并使用核回归来逼近潜在动力学,同时利用softmax注意力的高效实现,允许对未来时间步进行准确预测。为了评估我们的方法,我们在非线性动力系统的合成数据以及跨领域的真实世界时间序列上进行了全面实验。结果表明,DeepEDM对输入噪声具有鲁棒性,并在预测准确性上优于最先进的方法。我们的代码可在以下网址获取:https://abrarmajeedi.github.io/deep_edm。

英文摘要

Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.

2506.04805 2026-05-26 cs.LG

Adaptive Preconditioners Trigger Loss Spikes in Adam

Adam中的自适应预处理器引发损失尖峰

Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

发表机构 * Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University(上海交通大学理论科学研究院) School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) MemTensor (Shanghai) Technology Co., Ltd.(MemTensor(上海)科技有限公司) Institute for Advanced Algorithms Research, Shanghai(上海先进算法研究院) Shanghai Seres Information Technology Co., Ltd, Shanghai 200040, China(上海塞瑞斯信息技术有限公司,上海200040,中国)

AI总结 通过分析Adam二阶矩估计器的内部动力学,发现自适应预处理器与瞬时平方梯度之间的解耦机制导致损失尖峰,并基于二次近似分析提出尖峰预测方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

损失尖峰在使用Adam优化器训练神经网络时普遍出现,跨越不同架构和规模,但其潜在机制仍不清楚。虽然先前的解释将这些现象归因于较低损失处更尖锐的损失景观,但我们表明仅景观几何不足以解释该现象。在这项工作中,我们将根本原因定位在Adam二阶矩估计器的内部动力学中。我们识别出一个关键的“解耦”机制,其中自适应预处理器 $v_t$ 未能跟踪瞬时平方梯度 $g_t^2$,导致自适应机制有效失效。这种解耦允许预处理器在梯度上升时自主衰减,从而将预处理Hessian的最大特征值推至稳定阈值 $2/η$ 以上持续一段时间,表现为剧烈的损失尖峰。通过二次近似分析,我们从理论和实验上刻画了尖峰演化的五个不同阶段,并提出了基于梯度方向曲率预测尖峰的指标。我们经验性地发现,所提出的损失尖峰机制虽然源于简化模型,但能很好地推广到从小型神经网络到大规模Transformer的实际场景。

英文摘要

Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner $v_t$ fails to track the instantaneous squared gradients $g_t^2$, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold $2/η$ for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradient-directional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.