arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1800
2603.06613 2026-03-11 cs.LG cs.AI cs.CV cs.NE

OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence

Stamatis Mastromichalakis

Comments 23 pages, 10 figures, 7 tables

详情
英文摘要

This paper presents OptiRoulette, a stochastic meta-optimizer that selects update rules during training instead of fixing a single optimizer. The method combines warmup optimizer locking, random sampling from an active optimizer pool, compatibility-aware learning-rate scaling during optimizer transitions, and failure-aware pool replacement. OptiRoulette is implemented as a drop-in, "torch.optim.Optimizer-compatible" component and packaged for pip installation. We report completed 10-seed results on five image-classification suites: CIFAR-100, CIFAR-100-C, SVHN, Tiny ImageNet, and Caltech-256. Against a single-optimizer AdamW baseline, OptiRoulette improves mean test accuracy from 0.6734 to 0.7656 on CIFAR-100 (+9.22 percentage points), 0.2904 to 0.3355 on CIFAR-100-C (+4.52), 0.9667 to 0.9756 on SVHN (+0.89), 0.5669 to 0.6642 on Tiny ImageNet (+9.73), and 0.5946 to 0.6920 on Caltech-256 (+9.74). Its main advantage is convergence reliability at higher targets: it reaches CIFAR-100/CIFAR-100-C 0.75, SVHN 0.96, Tiny ImageNet 0.65, and Caltech-256 0.62 validation accuracy in 10/10 runs, while the AdamW baseline reaches none of these targets within budget. On shared targets, OptiRoulette also reduces time-to-target (e.g., Caltech-256 at 0.59: 25.7 vs 77.0 epochs). Paired-seed deltas are positive on all datasets; CIFAR-100-C test ROC-AUC is the only metric not statistically significant in the current 10-seed study.

2603.05698 2026-03-11 cs.CL

Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

Hazem Amamou, Stéphane Gagnon, Alan Davoust, Anderson R. Avila

Comments Accepted at IEEE SMC 2025 (International Conference on Systems, Man, and Cybernetics)

详情
英文摘要

Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.

2603.01549 2026-03-11 cs.CV cs.AI cs.RO

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim

详情
英文摘要

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations. Project page: https://jiiiisoo.github.io/Pri4R/

2603.01479 2026-03-11 cs.RO

Multimodal Adversarial Quality Policy for Safe Grasping

Kunlin Xie, Chenghao Li, Haolan Zhang, Nak Young Chong

Comments submitted

详情
英文摘要

Vision-guided robot grasping based on Deep Neural Networks (DNNs) generalizes well but poses safety risks in the Human-Robot Interaction (HRI). Recent works solved it by designing benign adversarial attacks and patches with RGB modality, yet depth-independent characteristics limit their effectiveness on RGBD modality. In this work, we propose the Multimodal Adversarial Quality Policy (MAQP) to realize multimodal safe grasping. Our framework introduces two key components. First, the Heterogeneous Dual-Patch Optimization Scheme (HDPOS) mitigates the distribution discrepancy between RGB and depth modalities in patch generation by adopting modality-specific initialization strategies, employing a Gaussian distribution for depth patches and a uniform distribution for RGB patches, while jointly optimizing both modalities under a unified objective function. Second, the Gradient-Level Modality Balancing Strategy (GLMBS) is designed to resolve the optimization imbalance from RGB and Depth patches in patch shape adaptation by reweighting gradient contributions based on per-channel sensitivity analysis and applying distance-adaptive perturbation bounds. We conduct extensive experiments on the benchmark datasets and a cobot, showing the effectiveness of MAQP.

2602.22755 2026-03-11 cs.CL

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang

详情
英文摘要

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

2602.16020 2026-03-11 cs.LG cond-mat.mtrl-sci

MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching

Cheng Zeng, Harry W. Sullivan, Thomas Egg, Maya M. Martirossyan, Philipp Höllmer, Jirui Jin, Richard G. Hennig, Adrian Roitberg, Stefano Martiniani, Ellad B. Tadmor, Mingjie Liu

Comments 20 pages, 4 figures. 14 pages in SI. Code: https://github.com/Liu-Group-UF/MolCrystalFlow

详情
英文摘要

Molecular crystal structure prediction represents a grand challenge in computational chemistry due to large sizes of constituent molecules and complex intra- and intermolecular interactions. While generative modeling has revolutionized structure discovery for molecules, inorganic solids, and metal-organic frameworks, extending such approaches to fully periodic molecular crystals is still elusive. Here, we present MolCrystalFlow, a flow-based generative model for molecular crystal structure prediction. The framework disentangles intramolecular complexity from intermolecular packing by embedding molecules as rigid bodies and jointly learning the lattice matrix, molecular orientations, and centroid positions. Centroids and orientations are represented on their native Riemannian manifolds, allowing geodesic flow construction and graph neural network operations that respects geometric symmetries. We benchmark our model against a state-of-the-art generative model (MOFFlow) for large-size periodic crystals and a rule-based structure generation method (Genarris) on two open-source molecular crystal datasets. MolCrystalFlow outperforms MOFFlow while achieving competitive performance against Genarris. We also demonstrate an integration of MolCrystalFlow model with universal machine learning potential to accelerate molecular crystal structure prediction, paving the way for data-driven generative discovery of molecular crystals.

2602.03733 2026-03-11 cs.CV

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek

Comments Accepted by ICLR 2026

详情
英文摘要

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.

2601.22444 2026-03-11 cs.LG cs.AI

Automating Forecasting Question Generation and Resolution for AI Evaluation

Nikos I. Bosse, Peter Mühlbacher, Jack Wildman, Lawrence Phillips, Dan Schwarz

Comments 41 pages, 4 figures

详情
英文摘要

Forecasting future events is highly valuable in decision-making and is a robust measure of general intelligence. As forecasting is probabilistic, developing and evaluating AI forecasters requires generating large numbers of diverse and difficult questions, and accurately resolving them. Previous efforts to automate this laborious work relied on recurring data sources (e.g., weather, stocks), limiting diversity and utility. In this work, we present a system for generating and resolving high-quality forecasting questions automatically and at scale using LLM-powered web research agents. We use this system to generate 1499 diverse, real-world forecasting questions, and to resolve them several months later. We estimate that our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform. We also find that our system resolves questions at approximately 95% accuracy. We verify that forecasting agents powered by more intelligent LLMs perform better on these questions (Brier score of 0.134 for Gemini 3 Pro, 0.149 for GPT-5, and 0.179 for Gemini 2.5 Flash). Finally, we demonstrate how our system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set, yielding a significant improvement in Brier scores (0.132 vs. 0.141).

2601.20461 2026-03-11 cs.CV

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu, Xiao Liu, Yuexuan Wang, Mondal Soumik

详情
英文摘要

With the rapid proliferation of powerful image generators, accurate detection of AI-generated images has become essential for maintaining a trustworthy online environment. However, existing deepfake detectors often generalize poorly to images produced by unseen generators. Notably, despite being trained under vastly different paradigms, such as diffusion or autoregressive modeling, many modern image generators share common final architectural components that serve as the last stage for converting intermediate representations into images. Motivated by this insight, we propose to "contaminate" real images using the generator's final component and train a detector to distinguish them from the original real images. We further introduce a taxonomy based on generators' final components and categorize 21 widely used generators accordingly, enabling a comprehensive investigation of our method's generalization capability. Using only 100 samples from each of three representative categories, our detector-fine-tuned on the DINOv3 backbone-achieves an average accuracy of 98.83% across 22 testing sets from unseen generators.

2601.11915 2026-03-11 cs.CV

Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection

Chi Wang, Xinjue Hu, Boyu Wang, Ziwen He, Zhangjie Fu

详情
英文摘要

The generalization problem remains a key challenge in face forgery detection. This paper explores the reasons for the generalization failure of Vanilla CLIP: in ``real vs. fake" detection, the few dominant principal components in the feature space primarily encode forgery-irrelevant information, rather than authentic forgery traces. However, this irrelevant information inevitably leads to spurious correlations, severely limiting detector performance. We define this phenomenon as ``low-rank spurious bias". To address this, we propose a low-rank representation space intervention paradigm, named the SeLop, from the perspective of causal representation learning. SeLop unifies the spurious correlation factors irrelevant to forgery into a low-rank subspace and cuts off the statistical shortcut between it and the label, thus aligning representation learning with authentic forgery traces. Specifically, we decompose spurious correlation features into a low-rank subspace through orthogonal low-rank projection, then remove this subspace from the original representation and train its orthogonal complement to capture forgery-related features. This low-rank projection removal effectively eliminates spurious correlation factors, ensuring that classification decisions are based on authentic forgery cues. With only 0.39M trainable parameters, our method achieves state-of-the-art performance across several benchmarks, demonstrating excellent robustness and generalization.

2512.15943 2026-03-11 cs.AI

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

Polaris Jhandi, Owais Kazi, Shreyas Subramanian, Neel Sendas

Comments Accepted at AAAI 2026 Workshop on Agentic AI Benchmarks and Applications for Enterprise Tasks

详情
英文摘要

As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55\% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00\%), ToolLLaMA-DFS (30.18\%), and ToolLLaMA-CoT (16.27\%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.

2511.00153 2026-03-11 cs.RO

EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations

Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, Philipp Wu

详情
英文摘要

Imitation learning from human demonstrations offers a promising approach for robot skill acquisition, but egocentric human data introduces fundamental challenges due to the embodiment gap. During manipulation, humans actively coordinate head and hand movements, continuously reposition their viewpoint and use pre-action visual fixation search strategies to locate relevant objects. These behaviors create dynamic, task-driven head motions that static robot sensing systems cannot replicate, leading to a significant distribution shift that degrades policy performance. We present EgoMI (Egocentric Manipulation Interface), a framework that captures synchronized end-effector and active head trajectories during manipulation tasks, resulting in data that can be retargeted to compatible semi-humanoid robot embodiments. To handle rapid and wide-spanning head viewpoint changes, we introduce a memory-augmented policy that selectively incorporates historical observations. We evaluate our approach on a bimanual robot equipped with an actuated camera head and find that policies with explicit head-motion modeling consistently outperform baseline methods. Results suggest that coordinated hand-eye learning with EgoMI effectively bridges the human-robot embodiment gap for robust imitation learning on semi-humanoid embodiments. Project page: https://egocentric-manipulation-interface.github.io

2510.24427 2026-03-11 cs.CL

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

Comments ICLR 2026

详情
英文摘要

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

2510.15242 2026-03-11 cs.LG

Bradley-Terry Policy Optimization for Generative Preference Modeling

Shengyu Feng, Yun He, Shuang Ma, Beibin Li, Yuanhao Xiong, Songlin Li, Karishma Mandyam, Julian Katz-Samuels, Shengjie Bi, Licheng Yu, Hejia Zhang, Karthik Abinav Sankararaman, Han Fang, Yiming Yang, Manaal Faruqui

详情
英文摘要

Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models for tasks with verifiable answers. However, extending RL-based thought training to more general non-verifiable tasks-where supervision is provided only through pairwise human preferences-remains challenging. Existing approaches typically apply RL objectives designed for verifiable rewards to preference-based settings in a heuristic manner. In this work, we show that introducing CoT reasoning into preference modeling fundamentally changes the structure of the Bradley-Terry (BT) likelihood, as the reasoning process must be treated as a latent variable. This results in a preference likelihood expressed as a ratio of expectations over stochastic generation trajectories, which cannot be optimized using Jensen-style bounds or standard RL objectives. To address this challenge, we derive a consistent Monte Carlo estimator for the gradient of the resulting likelihood, leading to Bradley-Terry Policy Optimization (BTPO). Empirically, BTPO enables stable and effective training of generative preference models with CoT reasoning, consistently outperforming prior heuristic approaches across multiple benchmarks and model scales.

2510.08173 2026-03-11 cs.RO cs.AI cs.CL cs.CV

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu, Yihan Wang, Ziyan Yu, Wenzhe Cai, Lei Kang, Hao Dong

Comments ICRA 2026

详情
英文摘要

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

2510.06261 2026-03-11 cs.AI cs.CL cs.LG

AlphaApollo: A System for Deep Agentic Reasoning

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo Han

Comments Ongoing project

详情
英文摘要

We present AlphaApollo, an agentic reasoning system that targets two bottlenecks in foundation-model reasoning: (1) limited reasoning capacity for complex, long-horizon problem solving and (2) unreliable test-time evolution without trustworthy verification. AlphaApollo orchestrates models and tools via three components: (i) multi-turn agentic reasoning, which formalizes model-environment interaction with structured tool calls and responses; (ii) multi-turn agentic learning, which applies turn-level reinforcement learning to optimize tool-use reasoning while decoupling actions from tool responses for stable training; and (iii) multi-round agentic evolution, which refines solutions through a propose-judge-update loop with tool-assisted verifications and long-horizon memory. Across seven math reasoning benchmarks and multiple model scales, AlphaApollo improves performance through reliable tool use (> 85% tool-call success), substantial gains from multi-turn RL (Avg@32: Qwen2.5-1.5B-Instruct 1.07% -> 9.64%, Qwen2.5-7B-Instruct 8.77% -> 20.35%), and improvements from evolution (e.g., Qwen2.5-3B-Instruct 5.27% -> 7.70%, Qwen2.5-14B-Instruct 16.53% -> 21.08%). This project is still ongoing. We welcome feedback from the community and will frequently update the source code and technical report.

2510.06195 2026-03-11 cs.CL cs.AI cs.LG eess.AS

Latent Speech-Text Transformer

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le

Comments Accepted to ICLR 2026 (Oral)

详情
英文摘要

Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The code is available at https://github.com/facebookresearch/lst.

2510.03504 2026-03-11 cs.RO

Connectivity Maintenance and Recovery for Multi-Robot Motion Planning

Yutong Wang, Lishuo Pan, Yichun Qu, Tengxiang Wang, Nora Ayanian

详情
英文摘要

Connectivity is crucial in many multi-robot applications, yet balancing between maintaining it and the fleet's traversability in obstacle-rich environments remains a challenge. Reactive controllers, such as control barrier functions, while providing connectivity guarantees, often struggle to traverse obstacle-rich environments due to deadlocks. We propose a real-time Bézier-based constrained motion planning algorithm, namely, MPC--CLF--CBF, that produces trajectory and control concurrently, under high-order control barrier functions and control Lyapunov functions conditions. Our motion planner significantly improves the navigation success rate of connected fleets in a cluttered workspace and recovers after inevitable connection loss by bypassing obstacles or from an initially disconnected fleet configuration. In addition, our predictive motion planner, owing to its Bézier curve solution, can easily obtain continuous-time arbitrary orders of derivatives, making it suitable for agile differentially flat systems, such as quadrotors. We validate the proposed algorithm through simulations and a physical experiment with $8$ Crazyflie nano-quadrotors.

2510.02097 2026-03-11 cs.CV

Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

Walid Rabehi, Marion Le Texier, Rémi Lemoy

详情
英文摘要

Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.

2510.00172 2026-03-11 cs.CL

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh, Étienne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Peter West, Giuseppe Carenini, Christopher Pal, Alexandre Drouin, Issam H. Laradji

详情
英文摘要

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, "What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 100 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code and data are available at https://github.com/ServiceNow/drbench.

2509.21609 2026-03-11 cs.CV cs.LG

VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George

Comments 28 pages, 30 figures, 1 algorithms

详情
英文摘要

General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (VLCE), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. VLCE operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate VLCE on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, VLCE with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary.

2509.18057 2026-03-11 cs.LG cs.AI cs.CC math.CO

Reinforced Generation of Combinatorial Structures: Hardness of Approximation

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

详情
英文摘要

Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of the SOTA of $16/17$ that relies on a custom PCP (rather than a reduction from ``standard'' Håstad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of $111/110$ using AlphaEvolve to discover a new gadget, thus improving the SOTA of $117/116$. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$ for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.

2509.15339 2026-03-11 cs.CL

Quantifying Genuine Awareness in Hallucination Prediction Beyond Question-Side Shortcuts

Yeongbin Seo, Dongha Lee, Jinyoung Yeo

详情
英文摘要

Many works have proposed methodologies for language model (LM) hallucination detection and reported seemingly strong performance. However, we argue that the reported performance to date reflects not only a model's genuine awareness of its internal information, but also awareness derived purely from question-side information (e.g., benchmark hacking). While benchmark hacking can be effective for boosting hallucination detection score on existing benchmarks, it does not generalize to out-of-domain settings and practical usage. Nevertheless, disentangling how much of a model's hallucination detection performance arises from question-side awareness is non-trivial. To address this, we propose a methodology for measuring this effect without requiring human labor, Approximate Question-side Effect (AQE). Our analysis using AQE reveals that existing hallucination detection methods rely heavily on benchmark hacking.

2509.14342 2026-03-11 cs.RO

Multi-Quadruped Cooperative Object Transport: Learning Decentralized Pinch-Lift-Move

Bikram Pandit, Aayam Kumar Shrestha, Alan Fern

Comments Accepted to ICRA 2026. Project page: https://decplm.github.io

详情
英文摘要

We study decentralized cooperative transport using teams of N-quadruped robots with arm that must pinch, lift, and move ungraspable objects through physical contact alone. Unlike prior work that relies on rigid mechanical coupling between robots and objects, we address the more challenging setting where mechanically independent robots must coordinate through contact forces alone without any communication or centralized control. To this end, we employ a hierarchical policy architecture that separates base locomotion from arm control, and propose a constellation reward formulation that unifies position and orientation tracking to enforce rigid contact behavior. The key insight is encouraging robots to behave as if rigidly connected to the object through careful reward design and training curriculum rather than explicit mechanical constraints. Our approach enables coordination through shared policy parameters and implicit synchronization cues - scaling to arbitrary team sizes without retraining. We show extensive simulation experiments to demonstrate robust transport across 2-10 robots on diverse object geometries and masses, along with sim2real transfer results on lightweight objects.

2509.00544 2026-03-11 cs.CL

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He

Comments ICLR 2026

详情
英文摘要

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

2508.17366 2026-03-11 cs.AI cs.CY cs.MA

Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents

Hanzhong Zhang, Muhua Huang, Jindong Wang

Comments 20 pages, 3 figures

详情
英文摘要

This paper introduces CMASE, a framework for Computational Multi-Agent Society Experiments that integrates generative agent-based modeling with virtual ethnographic methods to support researcher embedding, interactive participation, and mechanism-oriented intervention in virtual social environments. By transforming the simulation into a simulated ethnographic field, CMASE shifts the researcher from an external operator to an embedded participant. Specifically, the framework is designed to achieve three core capabilities: (1) enabling real-time human-computer interaction that allows researchers to dynamically embed themselves into the system to characterize complex social intervention processes; (2) reconstructing the generative logic of social phenomena by combining the rigor of computational experiments with the interpretative depth of traditional ethnography; and (3) providing a predictive foundation with causal explanatory power to make forward-looking judgments without sacrificing empirical accuracy. Experimental results show that CMASE can not only simulate complex phenomena, but also generate behavior trajectories consistent with both statistical patterns and mechanistic explanations. These findings demonstrate CMASE's methodological value for intervention modeling, highlighting its potential to advance interdisciplinary integration in the social sciences. The official code is available at: https://github.com/armihia/CMASE .

2508.12296 2026-03-11 cs.RO

A robust and compliant robotic assembly control strategy for batch precision assembly task with uncertain fit types and fit amounts

Bin Wang, Jiwen Zhang, Song Wang, Dan Wu

详情
Journal ref
IEEE Transactions on Automation Science and Engineering, vol. 23, pp. 5502-5515, 2026
英文摘要

In some high-precision industrial applications, robots are deployed to perform precision assembly tasks on mass batches of manufactured pegs and holes. If the peg and hole are designed with transition fit, machining errors may lead to either a clearance or an interference fit for a specific pair of components, with uncertain fit amounts. This paper focuses on the robotic batch precision assembly task involving components with uncertain fit types and fit amounts, and proposes an efficient methodology to construct the robust and compliant assembly control strategy. Specifically, the batch precision assembly task is decomposed into multiple deterministic subtasks, and a force-vision fusion controller-driven reinforcement learning method and a multi-task reinforcement learning training method (FVFC-MTRL) are proposed to jointly learn multiple compliance control strategies for these subtasks. Subsequently, the multi-teacher policy distillation approach is designed to integrate multiple trained strategies into a unified student network, thereby establishing a robust control strategy. Real-world experiments demonstrate that the proposed method successfully constructs the robust control strategy for high-precision assembly task with different fit types and fit amounts. Moreover, the MTRL framework significantly improves training efficiency, and the final developed control strategy achieves superior force compliance and higher success rate compared with many existing methods.

2508.11144 2026-03-11 cs.LG

CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets

Gauri Jain, Dominik Rothenhäusler, Kirk Bansak, Elisabeth Paulson

详情
英文摘要

Machine learning (ML) tasks often utilize large-scale data that is drawn from several distinct sources, such as different locations, treatment arms, or groups. In such settings, practitioners often desire predictions that not only exhibit good overall accuracy, but also remain reliable within each source and preserve the differences that matter across sources. For instance, several asylum and refugee resettlement programs now use ML-based employment predictions to guide where newly arriving families are placed within a host country, which requires generating informative and differentiated predictions for many and often small source locations. However, this task is made challenging by several common characteristics of the data in these settings: the presence of numerous distinct data sources, distributional shifts between them, and substantial variation in sample sizes across sources. This paper introduces Clustered Transfer Residual Learning (CTRL), a meta-learning method that combines the strengths of cross-domain residual learning and adaptive pooling/clustering in order to simultaneously improve overall accuracy and preserve source-level heterogeneity. We establish new theory showing that high-quality clusters can be learned efficiently, bypassing the need for repeated model refitting over candidate subsets. We evaluate CTRL alongside other state-of-the-art benchmarks on 5 large-scale datasets. This includes a dataset from the national asylum program in Switzerland, where the algorithmic geographic assignment of asylum seekers is currently being piloted. CTRL consistently outperforms the benchmarks across several key metrics and when using a range of different base learners.

2507.18311 2026-03-11 cs.CV

Improving Large Vision-Language Models' Understanding for Flow Field Data

Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang Zhang

Comments Accepted by Machine Intelligence Research

详情
英文摘要

Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale image and video datasets paired with text, enabling them to bridge visual perception and natural language processing. However, their application to scientific domains, especially in interpreting complex field data commonly used in the natural sciences, remains underexplored. In this work, we introduce FieldLVLM, a novel framework designed to improve large vision-language models' understanding of field data. FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning. The field-aware language generation strategy leverages a special-purpose machine learning pipeline to extract key physical features from field data, such as flow classification, Reynolds number, and vortex patterns. This information is then converted into structured textual descriptions that serve as a dataset. The data-compressed multimodal model tuning focuses on LVLMs with these generated datasets, using a data compression strategy to reduce the complexity of field inputs and retain only the most informative values. This ensures compatibility with the models language decoder and guides its learning more effectively. Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data. Our findings suggest that this approach opens up new possibilities for applying large vision-language models to scientific research, helping bridge the gap between large models and domain-specific discovery.

2506.15304 2026-03-11 cs.CL cs.AI cs.LG

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

Comments EACL 2026 - Main Conference

详情
英文摘要

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.