arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪
2602.01848 2026-02-17 cs.AI cs.MA

ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Salaheddin Alzu'bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, Sewoong Oh

详情
英文摘要

Current agentic frameworks underperform on long-horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency-aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles --Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator -- which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi-agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine-tuning, we further introduce GEPA$+$, an improved Genetic-Pareto prompt proposer that searches over prompts within ROMA's component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system-level performance on reasoning and long-form generation benchmarks. On SEAL-0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM-4.6 improves accuracy by 9.9\% over Kimi-Researcher. On EQ-Bench, a long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model-agnostic.

2601.23232 2026-02-17 cs.CV cs.AI

ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu, Haopeng Jin, Hao Wang, Shenghua Chai, Yujia Yang, Junhao Gong, Jiaming Guo, Minghui Zhang, Xinlong Chen, Zhenghao Zhang, Yuxuan Zhou, Yufei Xiong, Shanbin Zhang, Jiabing Yang, Hongzhu Yi, Xinming Wang, Cheng Zhong, Xiao Ma, Zhang Zhang, Yan Huang, Liang Wang

Comments 28 pages, 7 figures, Project website: https://github.com/yutao1024/ShotFinder

详情
英文摘要

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

2601.22323 2026-02-17 cs.LG

Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning

Qi Cao, Shuhao Zhang, Ruizhe Zhou, Ruiyi Zhang, Peijia Qin, Pengtao Xie

Comments We propose SCOPE, a model routing framework that predicts how accurate and how expensive each model will be before running it, allowing users to control cost-accuracy trade-offs and naturally handle new models

详情
英文摘要

Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.

2601.16443 2026-02-17 cs.LG cs.CL

Endless Terminals: Scaling RL Environments for Terminal Agents

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos

详情
英文摘要

Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

2601.08005 2026-02-17 cs.AI

Internal Deployment Gaps in AI Regulation

Joe Kwon, Stephen Casper

详情
英文摘要

Frontier AI regulations primarily focus on systems deployed to external users, where deployment is more visible and subject to outside scrutiny. However, high-stakes applications can occur internally when companies deploy highly capable systems within their own organizations, such as for automating R&D, accelerating critical business processes, and handling sensitive proprietary data. This paper examines how frontier AI regulations in the United States and European Union in 2025 handle internal deployment. We identify three gaps that could cause internally-deployed systems to evade intended oversight: (1) scope ambiguity that allows internal systems to evade regulatory obligations, (2) point-in-time compliance assessments that fail to capture the continuous evolution of internal systems, and (3) information asymmetries that subvert regulatory awareness and oversight. We then analyze why these gaps persist, examining tensions around measurability, incentives, and information access. Finally, we map potential approaches to address them and their associated tradeoffs. By understanding these patterns, we hope that policy choices around internally deployed AI systems can be made deliberately rather than incidentally.

2601.05525 2026-02-17 cs.AI cs.LG physics.comp-ph physics.soc-ph

Explainable AI: Learning from the Learners

Ricardo Vinuesa, Steven L. Brunton, Gianmarco Mengaldo

详情
英文摘要

Artificial intelligence now outperforms humans in several scientific and engineering tasks, yet its internal representations often remain opaque. In this Perspective, we argue that explainable artificial intelligence (XAI), combined with causal reasoning, enables {\it learning from the learners}. Focusing on discovery, optimization and certification, we show how the combination of foundation models and explainability methods allows the extraction of causal mechanisms, guides robust design and control, and supports trust and accountability in high-stakes applications. We discuss challenges in faithfulness, generalization and usability of explanations, and propose XAI as a unifying framework for human-AI collaboration in science and engineering.

2601.02456 2026-02-17 cs.RO

InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation

Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, Yanan Lu, Qi Lv, Haoxiang Ma, Jiangmiao Pang, Yu Qiao, Zherui Qiu, Yanqing Shen, Xu Shi, Yang Tian, Bolun Wang, Hanqing Wang, Jiaheng Wang, Tai Wang, Xueyuan Wei, Chao Wu, Yiman Xie, Boyang Xing, Yuqiang Yang, Yuyin Yang, Qiaojun Yu, Feng Yuan, Jia Zeng, Jingjing Zhang, Shenghan Zhang, Shi Zhang, Zhuoma Zhaxi, Bowen Zhou, Yuanzhen Zhou, Yunsong Zhou, Hongrui Zhu, Yangkun Zhu, Yuchen Zhu

Comments Homepage: https://internrobotics.github.io/internvla-a1.github.io/

详情
英文摘要

Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness in the presence of video prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on heterogeneous data sources over real-world robot data, synthetic simulation data, and human videos, covering over 692M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 on 12 real-world robotic tasks and a simulation benchmark. The results show that InternVLA-A1 consistently outperforms prior leading models: compared with pi0.5, it achieves +4.4\% on static manipulation tasks and +2.6\% on the RoboTwin 2.0 simulation benchmark, and delivers a +26.7\% boost on dynamic manipulation tasks.

2512.20352 2026-02-17 cs.CL cs.AI

Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

Nilesh Jain, Hyungil Suh, Seyi Adeyinka, Leor Roseman, Aza Allsop

Comments 11 pages, 1 figure, 3 tables

详情
英文摘要

Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($κ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($κ= 0.907$, cosine=95.3%), followed by GPT-4o ($κ= 0.853$, cosine=92.6%) and Claude ($κ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($κ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.

2512.19027 2026-02-17 cs.AI cs.LG

Recontextualization Mitigates Specification Gaming without Modifying the Specification

Ariana Azarbal, Victor Gillioz, Vladimir Ivanov, Bryce Woodworth, Jacob Drori, Nevan Wichers, Aram Ebtekar, Alex Cloud, Alexander Matt Turner

Comments v2 adds a new experimental setting (in place of the lie detector setting)

详情
英文摘要

Developers often struggle to specify correct training labels and rewards. Perhaps they don't need to. We propose recontextualization, which reduces how often language models "game" training signals, performing misbehaviors those signals mistakenly reinforce. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) overwrite evaluation functions rather than write correct code; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.

2512.18956 2026-02-17 cs.AI cs.LG

Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

Yizhi Wang, Linan Yue, Min-Ling Zhang

详情
英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model's reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.

2512.12206 2026-02-17 cs.CV cs.AI cs.LG

ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

Jeongjun Park, Sunwook Hwang, Hyeonho Noh, Jin Mo Yang, Hyun Jong Yang, Saewoong Bahk

Comments Published in IEEE Access. DOI: 10.1109/ACCESS.2026.3663636 This version reflects the peer-reviewed and published manuscript

详情
英文摘要

Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment.

2512.09407 2026-02-17 cs.CV

Geometry-to-Image Synthesis-Driven Generative Point Cloud Registration

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

Comments Journal extension of the ICML 2025 paper "Generative Point Cloud Registration". This version adopts a new title, and includes substantial methodological improvements, additional experiments, and extended analysis. Under review at IEEE TPAMI

详情
英文摘要

In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To this end, we introduce DepthMatch-ControlNet and LiDARMatch-ControlNet, two matching-specific, controllable 2D generative models. Specifically, for depth camera-based 3D registration with point clouds derived from the depth maps, DepthMatch-ControlNet leverages the depth-conditioned generation capabilities of ControlNet to synthesize perspective-view RGB images that are geometrically consistent with depth maps, ensuring accurate 2D-3D alignment. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, it further promotes cross-view feature interaction, guiding texture consistency generation. To address LiDAR-based 3D registration with point clouds captured by LiDAR sensors, LiDARMatch-ControlNet extends this framework by conditioning on paired equirectangular range maps projected from 360-degree LiDAR point clouds, generating corresponding panoramic RGB images. Our generative 3D registration paradigm is general and can be seamlessly integrated into a wide range of existing registration methods to improve their performance. Extensive experiments on the 3DMatch and ScanNet datasets (for depth-camera settings), as well as the Dur360BEV dataset (for LiDAR settings), demonstrate the effectiveness of our approach.

2512.08161 2026-02-17 cs.CV

Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing

Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang

详情
英文摘要

Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: https://github.com/Dilizlr/Fourier-RWKV.

2512.03474 2026-02-17 cs.CV

Procedural Mistake Detection via Action Effect Modeling

Wenliang Guo, Yujiang Pu, Yu Kong

Comments Accepted by ICLR 2026

详情
英文摘要

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

2511.10645 2026-02-17 cs.CL

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang, Haisheng Chen, Zihan Zhang, Song Han, Zhijian Liu

Comments ICLR 2026 | Project page: https://paroquant.z-lab.ai | GitHub: https://github.com/z-lab/paroquant

详情
英文摘要

Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitudes across channels and narrow the dynamic range within each quantization group, effectively addressing the outlier issue. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. Under weight-only quantization, ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks, with less than 10% overhead. ParoQuant also matches the accuracy of state-of-the-art weight-activation quantization methods. This paves the way for more efficient and accurate deployment of reasoning LLMs.

2511.06781 2026-02-17 cs.LG cs.AI

On the Mechanisms of Collaborative Learning in VAE Recommenders

Tung-Long Vuong, Julien Monteil, Hien Dang, Volodymyr Vaskovych, Trung Le, Vu Nguyen

详情
英文摘要

Variational Autoencoders (VAEs) are a powerful alternative to matrix factorization for recommendation. A common technique in VAE-based collaborative filtering (CF) consists in applying binary input masking to user interaction vectors, which improves performance but remains underexplored theoretically. In this work, we analyze how collaboration arises in VAE-based CF and show it is governed by \emph{latent proximity}: we derive a latent sharing radius that informs when an SGD update on one user strictly reduces the loss on another user, with influence decaying as the latent Wasserstein distance increases. We further study the induced geometry: with clean inputs, VAE-based CF primarily exploits \emph{local} collaboration between input-similar users and under-utilizes \emph{global} collaboration between far-but-related users. We compare two mechanisms that encourage \emph{global} mixing and characterize their trade-offs: \ding{172} $β$-KL regularization directly tightens the information bottleneck, promoting posterior overlap but risking representational collapse if too large; \ding{173} input masking induces stochastic \emph{geometric} contractions and expansions, which can bring distant users onto the same latent neighborhood but also introduce neighborhood drift. To preserve user identity while enabling global consistency, we propose an anchor regularizer that aligns user posteriors with item embeddings, stabilizing users under masking and facilitating signal sharing across related items. Our analyses are validated on the Netflix, MovieLens-20M, and Million Song datasets. We also successfully deployed our proposed algorithm on an Amazon streaming platform following a successful online experiment.

2510.21842 2026-02-17 cs.CV cs.CR

Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

Michael Aerni, Joshua Swanson, Kristina Nikolić, Florian Tramèr

Comments Accepted to ICLR 2026

详情
英文摘要

We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.

2510.16701 2026-02-17 cs.AI

An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems

Ni Zhang, Zhiguang Cao, Jianan Zhou, Cong Zhang, Yew-Soon Ong

Comments Accepted by iclr2026

详情
英文摘要

Complex vehicle routing problems (VRPs) remain a fundamental challenge, demanding substantial expert effort for intent interpretation and algorithm design. While large language models (LLMs) offer a promising path toward automation, current approaches still rely on external intervention, which restrict autonomy and often lead to execution errors and low solution feasibility. To address these challenges, we propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems, achieving full automation from problem instance to solution. AFL directly extracts knowledge from raw inputs and enables self-contained code generation without handcrafted modules or external solvers. To improve trustworthiness, AFL decomposes the overall pipeline into three manageable subtasks and employs four specialized agents whose coordinated interactions enforce cross-functional consistency and logical soundness. Extensive experiments on 60 complex VRPs, ranging from standard benchmarks to practical variants, validate the effectiveness and generality of our framework, showing comparable performance against meticulously designed algorithms. Notably, it substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility, achieving rates close to 100% on the evaluated benchmarks.

2510.14581 2026-02-17 cs.LG cs.AI

Model-agnostic Selective Labeling with Provable Statistical Guarantees

Huipeng Huang, Wenbo Liao, Huajun Xi, Hao Zeng, Mengchen Zhao, Hongxin Wei

详情
英文摘要

Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models' predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models' predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.

2510.10854 2026-02-17 cs.LG cs.AI stat.ML

Discrete State Diffusion Models: A Sample Complexity Perspective

Aadithya Srikanth, Mudit Gaur, Vaneet Aggarwal

详情
英文摘要

Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(ε^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.

2510.07976 2026-02-17 cs.CV

The impact of abstract and object tags on image privacy classification

Darya Baranouskaya, Andrea Cavallaro

Comments This work has been accepted to the ICASSP 2026

详情
英文摘要

Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.

2510.07182 2026-02-17 cs.LG

Bridged Clustering: Semi-Supervised Sparse Bridging

Patrick Peixuan Ye, Chen Shani, Ellen Vitercik

详情
英文摘要

We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

2510.03777 2026-02-17 cs.AI cs.LG

GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

Divij Handa, Mihir Parmar, Aswin RRV, Md Nayem Uddin, Hamid Palangi, Chitta Baral

Comments Accepted in ICLR 2026

详情
英文摘要

Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new inference algorithm, GuidedSampling, which decouples the exploration and generation phases during inference, increasing diversity of generated candidate solutions. The exploration phase identifies multiple concepts that can be utilized to solve the problem, while the generation phase applies a specific concept to provide final solution candidates. We first define the theoretical bounds of GuidedSampling and then empirically demonstrate that it improves the performance of base model at pass@50 by on an average ~21.6% across various benchmarks compared to RS. Furthermore, models trained on trajectories of GuidedSampling exhibit substantial performance improvements at pass@5 by on an average ~9.7%, compared to models trained on traditional RS. Additionally, models trained with GuidedSampling increases the average number of concepts per instance (1.67 -> 3.03), yielding a diverse set of candidates than traditional RS.

2509.20993 2026-02-17 cs.LG cs.DS stat.ML

Learning the Inverse Temperature of Ising Models under Hard Constraints using One Sample

Rohan Chauhan, Ioannis Panageas

Comments Accepted to Appear in ICLR '26

详情
英文摘要

We consider the problem of estimating inverse temperature parameter $β$ of an $n$-dimensional truncated Ising model using a single sample. Given a graph $G = (V,E)$ with $n$ vertices, a truncated Ising model is a probability distribution over the $n$-dimensional hypercube $\{-1,1\}^n$ where each configuration $\mathbfσ$ is constrained to lie in a truncation set $S \subseteq \{-1,1\}^n$ and has probability $\Pr(\mathbfσ) \propto \exp(β\mathbfσ^\top A\mathbfσ)$ with $A$ being the adjacency matrix of $G$. We adopt the recent setting of [Galanis et al. SODA'24], where the truncation set $S$ can be expressed as the set of satisfying assignments of a $k$-SAT formula. Given a single sample $\mathbfσ$ from a truncated Ising model, with inverse parameter $β^*$, underlying graph $G$ of bounded degree $Δ$ and $S$ being expressed as the set of satisfying assignments of a $k$-SAT formula, we design in nearly $O(n)$ time an estimator $\hatβ$ that is $O(Δ^3/\sqrt{n})$-consistent with the true parameter $β^*$ for $k \gtrsim \log(d^2k)Δ^3.$ Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics '07] or with truncation [Galanis et al. SODA '24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC '19, Galanis et al. SODA '24], to confront the more challenging setting of the truncated Ising model.

2509.18776 2026-02-17 cs.CL cs.AI cs.LG

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao

Comments Accepted by Advanced Engineering Informatics. Code and data available at: https://github.com/ArchiAI-LAB/AECBench

Journal ref Advanced Engineering Informatics, Vol. 71, Article 104314 (2026)

详情
英文摘要

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark features a five-level, cognition-oriented evaluation framework (i.e., Knowledge Memorization, Understanding, Reasoning, Calculation, and Application). Based on the framework, 23 representative evaluation tasks were defined. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an "LLM-as-a-Judge" approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

2509.17196 2026-02-17 cs.CL cs.AI

Evolution of Concepts in Language Model Pre-Training

Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

Comments 35 pages, 35 figures. Accepted to ICLR 2026

详情
英文摘要

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

2509.15704 2026-02-17 cs.CV

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

详情
英文摘要

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

2509.10080 2026-02-17 cs.CV

BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals

Minsang Kong, Myeongjun Kim, Sang Gu Kang, Hejiu Lu, Yupeng Zhong, Sang Hun Lee

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems (under review)

详情
英文摘要

In autonomous driving, trajectory prediction is essential for safe and efficient navigation. While recent methods often rely on high-definition (HD) maps to provide structured environmental priors, such maps are costly to maintain, geographically limited, and unreliable in dynamic or unmapped scenarios. Directly leveraging raw sensor data in Bird's-Eye View (BEV) space offers greater flexibility, but BEV features are dense and unstructured, making agent-centric spatial reasoning challenging and computationally inefficient. To address this, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features. We further introduce a Sparse Goal Candidate Proposal (SGCP) module that predicts a small set of realistic goals, enabling fully end-to-end multimodal forecasting without heuristic post-processing. Extensive experiments show that BEVTraj achieves performance comparable to state-of-the-art HD map-based methods while providing greater robustness and flexibility without relying on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.

2509.00787 2026-02-17 cs.CV

Image-to-Brain Signal Generation for Visual Prosthesis with CLIP Guided Multimodal Diffusion Models

Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Guoxu Zhou, Boyu wang, Jian Zhu, Jinyi Long

详情
英文摘要

Visual prostheses hold great promise for restoring vision in blind individuals. While researchers have successfully utilized M/EEG signals to evoke visual perceptions during the brain decoding stage of visual prostheses, the complementary process of converting images into M/EEG signals in the brain encoding stage remains largely unexplored, hindering the formation of a complete functional pipeline. In this work, we present a novel image-to-brain signal framework that generates M/EEG from images by leveraging the diffusion transformer architecture enhanced with cross-attention mechanisms. Specifically, we employ a diffusion transformer (DiT) architecture based on denoising diffusion implicit models (DDIM) to achieve brain signal generation. To realize the goal of image-to-brain signal conversion, we use cross-attention mechanisms to align brain signal embeddings with CLIP image embeddings. Moreover, we leverage large language models (LLMs) to generate image captions, and concatenate the resulting CLIP text embeddings with CLIP image embeddings to form unified embeddings for cross-attention alignment, enabling our model to capture core semantic information. Moreover, to capture core semantic information, we use large language models (LLMs) to generate descriptive and semantically accurate captions for images. Furthermore, we introduce a learnable spatio-temporal position encoding that combines brain region embeddings with temporal embeddings to capture both spatial and temporal characteristics of brain signals. We evaluate the framework on two multimodal benchmark datasets (THINGS-EEG2 and THINGS-MEG) and demonstrate that it generates biologically plausible brain signals.

2508.13415 2026-02-17 cs.LG

MAVIS: Multi-Objective Alignment via Inference-Time Value-Guided Selection

Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil

Comments 32 pages, 7 figures

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Many traditional methods for aligning outputs to user-specific preferences require fine-tuning models for each objective or for specific preference configurations, which is computationally expensive and inflexible. We introduce \textbf{MAVIS} -- \textit{Multi-Objective Alignment via Inference-Time Value-Guided Selection} -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that enables monotonic improvement of the KL-regularized policy. We show empirically that MAVIS achieves a superior pareto front compared to baselines which fine-tune per-objective models and combine them post hoc or train a single preference-conditioned value model for guidance. Our code is available at https://github.com/5-Jeremy/MAVIS/tree/main.