arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2862
专题追踪
2506.06683 2026-03-10 cs.RO cs.AI

RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks

Shiying Duan, Pei Ren, Nanxiang Jiang, Zhengping Che, Jian Tang, Zhaoxin Fan, Yifan Sun, Wenjun Wu

Comments Accepted to ICLR 2026

详情
英文摘要

Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios. While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration. To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning. RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence. In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels. Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations. Our code is publicly available at https://github.com/AiDuanshiying/RoboPARA.

2506.06570 2026-03-10 cs.RO

Unsupervised Discovery of Failure Taxonomies from Deployment Logs

Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal

详情
英文摘要

As robotic systems become increasingly integrated into real-world environments, ranging from autonomous vehicles to household assistants, they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving system robustness. However, manually analyzing large-scale failure datasets is impractical and does not scale. In this work, we introduce the problem of unsupervised discovery of failure taxonomies from large volumes of raw failure logs, aiming to obtain semantically coherent and actionable failure modes directly from perceptual trajectories. Our approach first infers structured failure explanations from multimodal inputs using vision-language reasoning, and then performs clustering in the resulting semantic reasoning space, enabling the discovery of recurring failure modes rather than isolated episode-level descriptions. We evaluate our method across robotic manipulation, indoor navigation, and autonomous driving domains, and demonstrate that the discovered taxonomies are consistent, interpretable, and practically useful. In particular, we show that structured failure taxonomies guide targeted data collection for offline policy refinement and enhance runtime failure monitoring systems. Website: https://mllm-failure-clustering.github.io/

2506.05587 2026-03-10 cs.AI cs.CL cs.DB cs.LG

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish

Comments Full version of a paper accepted at NeurIPS 2025; Code and data available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU

详情
英文摘要

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

2505.23932 2026-03-10 cs.CL

SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Z. Morley Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong

Comments The paper has been accepted as an oral presentation at ICLR 2026

详情
英文摘要

We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: swing-bench.github.io

2505.21289 2026-03-10 cs.LG math.OC

LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning

Nurbek Tastan, Stefanos Laskaridis, Martin Takac, Karthik Nandakumar, Samuel Horvath

Comments Accepted to ICLR 2026

详情
英文摘要

Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights. While LoRA dramatically reduces trainable parameters with little overhead, it can still underperform full fine-tuning in accuracy and often converges more slowly. We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning by aligning the optimizer's internal dynamics with those of updating all model weights. LoFT not only learns weight updates in a low-rank subspace (like LoRA) but also properly projects the optimizer's first and second moments (Adam's momentum and variance) into the same subspace, mirroring full-model updates. By aligning the low-rank update itself with the full update, LoFT eliminates the need for tuning extra hyperparameters, e.g., the LoRA scaling factor $α$. Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning and consistently outperforms standard LoRA-style methods, all without increasing inference cost.

2505.18907 2026-03-10 cs.AI cs.LG

Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Sanjay Kariyappa, G. Edward Suh

详情
英文摘要

Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between $1.6\times$ and $9.2\times$ reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model's utility.

2505.14996 2026-03-10 cs.CL cs.AI cs.LG

MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, Shafiq Joty

Comments SEA@NeurIPS (Oral) 2025

详情
英文摘要

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.

2505.14357 2026-03-10 cs.CV cs.LG

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long

Comments Project page: http://knightnemo.github.io/vid2world/

详情
英文摘要

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

2505.13109 2026-03-10 cs.LG cs.AI cs.CL

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao

详情
英文摘要

Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabling effective overlap with computation, full latency hiding, and practical speedups from speculative recall. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to a 13$\times$ speedup compared to SOTA KV retrieval methods. Code is available at https://github.com/sjtu-zhao-lab/FreeKV.

2505.11709 2026-03-10 cs.CV cs.LG cs.RO

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang

Comments ICLR 2026

详情
英文摘要

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.

2505.10845 2026-03-10 cs.LG cs.AI

Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness

Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam

详情
英文摘要

Machine unlearning is the process of removing the imprint left by specific data samples during the training of a machine learning model. AI developers, including those building personalized technologies, employ machine unlearning for various purposes such as privacy protection, security, and to address ethical concerns. This paper introduces Ready2Unlearn, a learning-time optimization approach designed to facilitate future unlearning processes. Unlike the majority of existing unlearning efforts that focus on designing unlearning algorithms, which are typically implemented reactively when an unlearning request is made during the model deployment phase, Ready2Unlearn shifts the focus to the training phase, adopting a "forward-looking" perspective. Building upon well-established meta-learning principles, Ready2Unlearn proactively trains machine learning models with unlearning readiness, such that they are well prepared and can handle future unlearning requests in a more efficient and principled manner. Ready2Unlearn is model-agnostic and compatible with any gradient ascent-based machine unlearning algorithms. We evaluate the method on both language and vision tasks under various unlearning settings, including class-wise unlearning and random data unlearning. Experimental results show that by incorporating such preparedness at training time, Ready2Unlearn produces an unlearning-ready model state, which offers several key advantages when future unlearning is requested. We hope this study inspires future research on proactive strategies for equipping machine learning models with built-in unlearning readiness, particularly in modern information systems that rely heavily on user data for recommendation, search, and personalized services, where privacy risks and data deletion demands are increasingly prevalent.

2505.10238 2026-03-10 cs.CV

MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

Yanbo Ding, Xirui Hu, Zhizhi Guo, Yan Zhang, Xinrui Wang, Zhixiang He, Chi Zhang, Yali Wang, Xuelong Li

详情
英文摘要

Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatial-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Next, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for character image animation in the complex 4D world. We implement MTVCraft on both CogVideoX-5B (small scale) and Wan-2.1-14B (large scale), demonstrating that our framework is easily scalable and can be applied to models of varying sizes. Experiments on the TikTok and Fashion benchmarks demonstrate our state-of-the-art performance. Moreover, powered by robust motion tokens, MTVCraft showcases unparalleled zero-shot generalization. It can animate arbitrary characters in full-body and half-body forms, and even non-human objects across diverse styles and scenarios. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided video generation. Our project page is available at https://github.com/DINGYANB/MTVCrafter. A scaled version has been commercially deployed and is available at https://telestudio.teleagi.cn/generatevideo/creativeWorkshop.

2505.07365 2026-03-10 cs.SD cs.AI cs.CL cs.MM eess.AS

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

Comments Dataset: https://huggingface.co/datasets/PeacefulData/2025_DCASE_AudioQA_Official DCASE Task-5 challenge: dcase.community/challenge2025/task-audio-question-answering. Accepted to ICASSP 2026

详情
英文摘要

We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

2505.06746 2026-03-10 cs.RO cs.CV

M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark

Morui Zhu, Yongqi Zhu, Yihao Zhu, Qi Chen, Deyuan Qu, Song Fu, Qing Yang

Comments Accepted to ICRA 2026

详情
英文摘要

We introduce M$^3$CAD, a comprehensive benchmark designed to advance research in generic cooperative autonomous driving. M$^3$CAD comprises 204 sequences with 30,000 frames. Each sequence includes data from multiple vehicles and different types of sensors, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M$^3$CAD to support both single-vehicle and multi-vehicle cooperative autonomous driving research. To the best of our knowledge, M$^3$CAD is the most complete benchmark specifically designed for cooperative, multi-task autonomous driving research. To test its effectiveness, we use M$^3$CAD to evaluate both state-of-the-art single-vehicle and cooperative driving solutions, setting baseline performance results. Since most existing cooperative perception methods focus on merging features but often ignore network bandwidth requirements, we propose a new multi-level fusion approach which adaptively balances communication efficiency and perception accuracy based on the current network conditions. We release M$^3$CAD, along with the baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on https://github.com/zhumorui/M3CAD

2505.00400 2026-03-10 cs.RO

Holistic Optimization of Modular Robots

Matthias Mayer, Matthias Althoff

Comments 14 Pages, 6 figures, 8 tables. Please find and reference the open-access published version at https://ieeexplore.ieee.org/document/11227125

Journal ref in IEEE Transactions on Automation Science and Engineering, vol. 23, pp. 2703-2716, 2026

详情
英文摘要

Modular robots have the potential to revolutionize automation, as one can optimize their composition for any given task. However, finding optimal compositions is non-trivial. In addition, different compositions require different base positions and trajectories to fully use the potential of modular robots. We address this problem holistically for the first time by jointly optimizing the composition, base placement, and trajectory to minimize the cycle time of a given task. Our approach is evaluated on over 300 industrial benchmarks requiring point-to-point movements. Overall, we reduce cycle time by up to 25 % and find feasible solutions in twice as many benchmarks compared to optimizing the module composition alone. In the first real-world validation of modular robots optimized for point-to-point movement, we find that the optimized robot is successfully deployed in nine out of ten cases in less than an hour.

2504.19577 2026-03-10 cs.RO

Smart placement, faster robots-a comparison of algorithms for robot base-pose optimization

Matthias Mayer, Matthias Althoff

Comments 10 pages, 3 Figures, 1 Table. Find visualizations and source code at https://cobra.cps.cit.tum.de/tools/rbo. Supplementary Tables can be found at https://www.frontiersin.org/journals/manufacturing-technology/articles/10.3389/fmtec.2025.1642524/full

Journal ref Front. Manuf. Technol. 5:1642524

详情
英文摘要

Robotic automation is a key technology that increases the efficiency and flexibility of manufacturing processes. However, one of the challenges in deploying robots in novel environments is finding the optimal base pose for the robot, which affects its reachability and deployment cost. Yet, existing research on automatically optimizing the base pose of robots has not been compared. We address this problem by optimizing the base pose of industrial robots with Bayesian optimization (BO), exhaustive search (ES), genetic algorithms (GAs), and stochastic gradient descent (SGD), and we find that all algorithms can reduce the cycle time for various evaluated tasks in synthetic and real-world environments. Stochastic gradient descent shows superior performance with regard to the success rate, solving more than 90% of our real-world tasks, while genetic algorithms show the lowest final costs. All benchmarks and implemented methods are available as baselines against which novel approaches can be compared.

2504.19199 2026-03-10 cs.LG

Learning to Rank Critical Road Segments via Heterogeneous Graphs with Origin-Destination Flow Integration

Ming Xu, Jinrong Xiang, Zilong Xie, Xiangfu Meng

详情
英文摘要

Existing learning-to-rank methods for road networks often fail to incorporate origin-destination (OD) flows and route information, limiting their ability to model long-range spatial dependencies. To address this gap, we propose HetGL2R, a heterogeneous graph learning framework for ranking road-segment importance. HetGL2R builds a tripartite graph that unifies OD flows, routes, and network topology, and further introduces attribute-guided graphs that elevate node attributes into explicit nodes to model functional similarity. A heterogeneous joint random walk algorithm (HetGWalk) jointly samples both graph types to generate context-rich node sequences. These sequences are encoded using a Transformer to learn embeddings that capture long-range structural dependencies induced by OD flows and route configurations, as well as functional associations derived from attribute similarity. Finally, a listwise ranking strategy with a KL-divergence loss evaluates and ranks segment importance. Experiments on three SUMO-generated simulated networks of different scales show that, against state-of-the-art methods, HetGL2R achieves average improvements of approximately 7.52%, 4.40% and 3.57% in ranking performance.

2504.09587 2026-03-10 cs.RO

GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation

Haotian Xu, Yue Hu, Chen Gao, Zhengqiu Zhu, Yong Zhao, Yong Li, Quanjun Yin

Comments Published in Pattern Recognition (2026)

Journal ref Pattern Recognition, Volume 177, 113365, 2026

详情
英文摘要

Language-goal aerial navigation requires UAVs to localize targets in the complex outdoors, such as urban blocks based on textual instructions. The indoor methods are often hard to scale to urban scenes due to ambiguous objects, limited visual field, and spatial reasoning. In this work, we propose GeoNav, a multi-modal agent for long-range aerial navigation with geospatial awareness. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial reasoning patterns. To support such reasoning, it dynamically builds dual-scale spatial representations. The first is a global but schematic cognitive map, which fuses prior geographic knowledge and embodied visual cues into a top-down and explicit annotated form. It enables fast navigation to the landmark region via intuitive map-based reasoning. The second is a local but delicate scene graph representing hierarchical spatial relationships between landmarks and objects, utilized for accurate target localization. On top of the structured memory, GeoNav employs a spatial chain-of-thought mechanism to enable MLLMs with efficient and interpretable action-making across stages. On the CityNav benchmark, GeoNav surpasses the current SOTA up to 18.4% in success rate and significantly eliminates navigation error. The ablation studies highlight the importance of each module, positioning structured spatial perception as the key to advanced UAV navigation. Published in Pattern Recognition, 2026.

2504.09156 2026-03-10 cs.CV

LEL: Lipschitz Continuity Constrained Ensemble Learning for Efficient EEG-Based Intra-subject Emotion Recognition

Shengyu Gong, Yueyang Li, Zijian Kang, Bo Chai, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

Journal ref IEEE Sensors Journal, 2026

详情
英文摘要

Accurate and efficient recognition of emotional states is critical for human social functioning, and impairments in this ability are associated with significant psychosocial difficulties. While electroencephalography (EEG) offers a powerful tool for objective emotion detection, existing EEG-based Emotion Recognition (EER) methods suffer from three key limitations: (1) insufficient model stability, (2) limited accuracy in processing high-dimensional nonlinear EEG signals, and (3) poor robustness against intra-subject variability and signal noise. To address these challenges, we introduce Lipschitz continuity-constrained Ensemble Learning (LEL), a novel framework that enhances EEG-based emotion recognition by enforcing Lipschitz continuity constraints on Transformer-based attention mechanisms, spectral extraction, and normalization modules. This constraint ensures model stability, reduces sensitivity to signal variability and noise, and improves generalization capability. Additionally, LEL employs a learnable ensemble fusion strategy that optimally combines decisions from multiple heterogeneous classifiers to mitigate single-model bias and variance. Extensive experiments on three public benchmark datasets (EAV, FACED, and SEED) demonstrate superior performance, achieving average recognition accuracies of 74.25%, 81.19%, and 86.79%, respectively. The official implementation codes are available at https://github.com/NZWANG/LEL.

2504.09021 2026-03-10 cs.LG

A Champion-level Vision-based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7

Hojoon Lee, Takuma Seno, Jun Jet Tai, Kaushik Subramanian, Kenta Kawamoto, Peter Stone, Peter R. Wurman

Comments Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2025

详情
英文摘要

Deep reinforcement learning has achieved superhuman racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms GT7's built-drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

2504.06987 2026-03-10 cs.LG cs.AI

Enhancing Metabolic Syndrome Prediction with Hybrid Data Balancing and Counterfactuals

Sanyam Paresh Shah, Abdullah Mamun, Shovito Barua Soumma, Hassan Ghasemzadeh

Comments Accepted at the IEEE EMBC 2025 Conference. 7 pages, 3 figures

详情
英文摘要

Metabolic Syndrome (MetS) is a cluster of interrelated risk factors that significantly increases the risk of cardiovascular diseases and type 2 diabetes. Despite its global prevalence, accurate prediction of MetS remains challenging due to issues such as class imbalance, data scarcity, and methodological inconsistencies in existing studies. In this paper, we address these challenges by systematically evaluating and optimizing machine learning (ML) models for MetS prediction, leveraging advanced data balancing techniques and counterfactual analysis. Multiple ML models, including XGBoost, Random Forest, TabNet, etc., were trained and compared under various data balancing techniques such as random oversampling (ROS), SMOTE, ADASYN, and CTGAN. Additionally, we introduce MetaBoost, a novel hybrid framework that integrates SMOTE, ADASYN, and CTGAN, optimizing synthetic data generation through weighted averaging and iterative weight tuning to enhance the model's performance (achieving up to a 1.87% accuracy improvement over individual balancing techniques). A comprehensive counterfactual analysis is conducted to quantify the feature-level changes required to shift individuals from high-risk to low-risk categories. The results indicate that blood glucose (50.3%) and triglycerides (46.7%) were the most frequently modified features, highlighting their clinical significance in MetS risk reduction. Additionally, probabilistic analysis shows elevated blood glucose (85.5% likelihood) and triglycerides (74.9% posterior probability) as the strongest predictors. This study not only advances the methodological rigor of MetS prediction but also provides actionable insights for clinicians and researchers, highlighting the potential of ML in mitigating the public health burden of metabolic syndrome.

2504.05698 2026-03-10 cs.CV

Point-based Instance Completion with Scene Constraints

Wesley Khademi, Li Fuxin

Comments Published in ICLR 2025. Project Page: https://wkhademi.github.io/point_based_instance_completion/

详情
英文摘要

Recent point-based object completion methods have demonstrated the ability to accurately recover the missing geometry of partially observed objects. However, these approaches are not well-suited for completing objects within a scene, as they do not consider known scene constraints (e.g., other observed surfaces) in their completions and further expect the partial input to be in a canonical coordinate system, which does not hold for objects within scenes. While instance scene completion methods have been proposed for completing objects within a scene, they lag behind point-based object completion methods in terms of object completion quality and still do not consider known scene constraints during completion. To overcome these limitations, we propose a point cloud-based instance completion model that can robustly complete objects at arbitrary scales and pose in the scene. To enable reasoning at the scene level, we introduce a sparse set of scene constraints represented as point clouds and integrate them into our completion model via a cross-attention mechanism. To evaluate the instance scene completion task on indoor scenes, we further build a new dataset called ScanWCF, which contains labeled partial scans as well as aligned ground truth scene completions that are watertight and collision-free. Through several experiments, we demonstrate that our method achieves improved fidelity to partial scans, higher completion quality, and greater plausibility over existing state-of-the-art methods.

2504.04700 2026-03-10 cs.CL

Causal Retrieval with Semantic Consideration

Hyunseo Shin, Wonseok Hwang

详情
英文摘要

Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.

2503.22233 2026-03-10 cs.LG cs.AI cs.CL

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Yitong Li

详情
英文摘要

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

2503.15904 2026-03-10 cs.CL cs.AI

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen

详情
英文摘要

Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases. This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models. A systematic analysis of ten prominent LLMs shows a consistent pattern of overrepresenting female characters across occupations, likely due to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Paradoxically, despite this overrepresentation, the occupational gender distributions produced by these LLMs align more closely with human stereotypes than with real-world labor data. This highlights the challenge and importance of implementing balanced mitigation measures to promote fairness and prevent the establishment of potentially new biases. We release the prompts and LLM-generated stories at GitHub.

2503.14488 2026-03-10 cs.AI cs.SE

Engineering Systems for Data Analysis Using Interactive Structured Inductive Programming

Shraddha Surana, Ashwin Srinivasan, Michael Bain

Comments Accepted for publication in the 38th International Conference on Advanced Information Systems Engineering (CAiSE 2026)

详情
英文摘要

Engineering information systems for scientific data analysis presents significant challenges: complex workflows requiring exploration of large solution spaces, close collaboration with domain specialists, and the need for maintainable, interpretable implementations. Traditional manual development is time-consuming, while "No Code" approaches using large language models (LLMs) often produce unreliable systems. We present iProg, a tool implementing Interactive Structured Inductive Programming. iProg employs a variant of a '2-way Intelligibility' communication protocol to constrain collaborative system construction by a human and an LLM. Specifically, given a natural-language description of the overall data analysis task, iProg uses an LLM to first identify an appropriate decomposition of the problem into a declarative representation, expressed as a Data Flow Diagram (DFD). In a second phase, iProg then uses an LLM to generate code for each DFD process. In both stages, human feedback, mediated through the constructs provided by the communication protocol, is used to verify LLMs' outputs. We evaluate iProg extensively on two published scientific collaborations (astrophysics and biochemistry), demonstrating that it is possible to identify appropriate system decompositions and construct end-to-end information systems with better performance, higher code quality, and order-of-magnitude faster development compared to Low Code/No Code alternatives. The tool is available at: https://shraddhasurana.github.io/dhaani/

2503.11126 2026-03-10 cs.LG

MUSS: Multilevel Subset Selection for Relevance and Diversity

Vu Nguyen, Andrey Kan

Comments model in production at Amazon

详情
英文摘要

The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. In a recommender system application, our method can not only improve the performance up to $4$ percent points in precision, but is also $20$ to $80$ times faster. Our method is also capable of outperforming baselines on RAG-based question answering accuracy. We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective. Moreover, our analysis also resulted in a $\times 2$ tighter bound for DGDS compared to previously known bound.

2503.10336 2026-03-10 cs.LG nlin.CD

Characterizing Nonlinear Dynamics via Smooth Prototype Equivalences

Roy Friedman, Noa Moriel, Matthew Ricci, Guy Pelc, Yair Weiss, Mor Nitzan

详情
英文摘要

Characterizing the long term behavior of dynamical systems given limited measurements is a common challenge throughout the physical and biological sciences. This is a challenging task due to the sparsity and noise inherent to empirical observations, as well as the variability of possible long-term dynamics. We address this by introducing smooth prototype equivalences (SPE), a framework for matching sparse observations to prototypical behaviors using invertible neural networks which model smooth phase space deformations. SPE can localize the invariant sets describing long-term behavior of the observed dynamics through the learned mapping from prototype space to data space. Furthermore, SPE can classify dynamical regimes by comparing the data residual of the deformed measurements to prototype dynamics. Our method outperforms existing techniques in the classification of oscillatory systems and can efficiently identify invariant structures like limit cycles and fixed points in an equation-free manner, even when only a small, noisy subset of the phase space is observed. SPE further reveals driving genes in synthetic oscillators such as the repressilator regulatory circuit, and traces cyclic biological processes like the cell cycle trajectory directly from experimental high-dimensional single-cell gene expression data.

2503.10110 2026-03-10 cs.RO cs.AI cs.LG

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem Bıyık, Daniel Seita

详情
英文摘要

Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach generates an anisotropic cost map that encodes directional push safety. We pair this map with a contact-aware A* planner to find stable contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Our project website is available at https://impact-planning.github.io/.

2503.03935 2026-03-10 cs.LG cs.AI

LLM-Powered Prediction of Hyperglycemia and Discovery of Behavioral Treatment Pathways from Wearables and Diet

Abdullah Mamun, Asiful Arefeen, Susan B. Racette, Dorothy D. Sears, Corrie M. Whisner, Matthew P. Buman, Hassan Ghasemzadeh

Comments 16 pages, 10 figures

详情
英文摘要

Postprandial hyperglycemia, marked by the blood glucose level exceeding the normal range after consuming a meal, is a critical indicator of progression toward type 2 diabetes in people with prediabetes and in healthy individuals. A key metric for understanding blood glucose dynamics after eating is the postprandial area under the curve (AUC). Predicting postprandial AUC in advance based on a person's lifestyle factors, such as diet and physical activity level, and explaining the factors that affect postprandial blood glucose could allow an individual to adjust their lifestyle accordingly to maintain normal glucose levels. In this study, we developed an explainable machine learning solution, GlucoLens, that takes sensor-driven inputs and uses advanced data processing, large language models, and trainable machine learning models to predict postprandial AUC and hyperglycemia from diet, physical activity, and recent glucose patterns. We used data obtained from wearables in a five-week clinical trial of 10 adults who worked full-time to develop and evaluate the proposed computational model that integrates wearable sensing, multimodal data, and machine learning. Our machine learning model takes multimodal data from wearable activity and glucose monitoring sensors, along with food and work logs, and provides an interpretable prediction of the postprandial glucose pattern. Our GlucoLens system achieves a normalized root mean squared error (NRMSE) of 0.123 in its best configuration. On average, the proposed technology provides a 16% better performance level compared to the comparison models. Additionally, our technique predicts hyperglycemia with an accuracy of 73.3% and an F1 score of 0.716 and recommends different treatment options to help avoid hyperglycemia through diverse counterfactual explanations. Code available: https://github.com/ab9mamun/GlucoLens.