arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2797
2509.21464 2026-02-10 cs.CV cs.RO

Residual Vector Quantization For Communication-Efficient Multi-Agent Perception

Dereje Shenkut, B. V. K Vijaya Kumar

Comments Accepted at ICASSP 2026. 5 pages

详情
英文摘要

Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.

2509.21266 2026-02-10 cs.AI

Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture

Zijian Shao, Haiyang Shen, Mugeng Liu, Gecheng Fu, Yaoqi Guo, Yanfeng Wang, Yun Ma

Comments under review

详情
英文摘要

In clinical decision-making, predictive models face a persistent trade-off: accurate models are often opaque "black boxes," while interpretable methods frequently lack predictive precision or statistical grounding. In this paper, we challenge this dichotomy, positing that high predictive accuracy and high-quality descriptive explanations are not competing goals but synergistic outcomes of a deep, first-hand understanding of data. We propose the Reflective Cognitive Architecture (RCA), a novel framework designed to enable Large Language Models (LLMs) to learn directly from tabular data through experience and reflection. RCA integrates two core mechanisms: an iterative rules optimization process that refines logical argumentation by learning from prediction errors, and a distribution-aware rules check that grounds this logic in global statistical evidence to ensure robustness. We evaluated RCA against over 20 baselines - ranging from traditional machine learning to advanced reasoning LLMs and agents - across diverse medical datasets, including a proprietary real-world Catheter-Related Thrombosis (CRT) cohort. Crucially, to demonstrate real-world scalability, we extended our evaluation to two large-scale datasets. The results confirm that RCA achieves state-of-the-art predictive performance and superior robustness to data noise while simultaneously generating clear, logical, and evidence-based explanatory statements, maintaining its efficacy even at scale. The code is available at https://github.com/ssssszj/RCA.

2509.21265 2026-02-10 cs.CV cs.AI

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu

Comments ICCV 2025

详情
英文摘要

High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.

2509.20102 2026-02-10 cs.AI

Steerable Adversarial Scenario Generation through Test-Time Preference Alignment

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Sun, Haotian Shi, Wei Ma, Jian Sun

Comments ICLR 2026

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Adversarial scenario generation is a cost-effective approach for safety assessment of autonomous driving systems. However, existing methods are often constrained to a single, fixed trade-off between competing objectives such as adversariality and realism. This yields behavior-specific models that cannot be steered at inference time, lacking the efficiency and flexibility to generate tailored scenarios for diverse training and testing requirements. In view of this, we reframe the task of adversarial scenario generation as a multi-objective preference alignment problem and introduce a new framework named \textbf{S}teerable \textbf{A}dversarial scenario \textbf{GE}nerator (SAGE). SAGE enables fine-grained test-time control over the trade-off between adversariality and realism without any retraining. We first propose hierarchical group-based preference optimization, a data-efficient offline alignment method that learns to balance competing objectives by decoupling hard feasibility constraints from soft preferences. Instead of training a fixed model, SAGE fine-tunes two experts on opposing preferences and constructs a continuous spectrum of policies at inference time by linearly interpolating their weights. We provide theoretical justification for this framework through the lens of linear mode connectivity. Extensive experiments demonstrate that SAGE not only generates scenarios with a superior balance of adversariality and realism but also enables more effective closed-loop training of driving policies. Project page: https://tongnie.github.io/SAGE/.

2509.16813 2026-02-10 cs.CL

Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition-Informed Approach to Quantifying Identity Fusion from Text

Devin R. Wright, Jisun An, Yong-Yeol Ahn

Comments Authors' accepted manuscript (postprint; camera-ready). To appear in the Proceedings of EMNLP 2025. Pagination/footer layout may differ from the Version of Record

详情
英文摘要

Quantifying identity fusion -- the psychological merging of self with another entity or abstract target (e.g., a religious group, political party, ideology, value, brand, belief, etc.) -- is vital for understanding a wide range of group-based human behaviors. We introduce the Cognitive Linguistic Identity Fusion Score (CLIFS), a novel metric that integrates cognitive linguistics with large language models (LLMs), which builds on implicit metaphor detection. Unlike traditional pictorial and verbal scales, which require controlled surveys or direct field contact, CLIFS delivers fully automated, scalable assessments while maintaining strong alignment with the established verbal measure. In benchmarks, CLIFS outperforms both existing automated approaches and human annotation. As a proof of concept, we apply CLIFS to violence risk assessment to demonstrate that it can improve violence risk assessment by more than 240%. Building on our identification of a new NLP task and early success, we underscore the need to develop larger, more diverse datasets that encompass additional fusion-target domains and cultural backgrounds to enhance generalizability and further advance this emerging area. CLIFS models and code are public at https://github.com/DevinW-sudo/CLIFS.

2509.16010 2026-02-10 cs.SD cs.AI eess.AS

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

Qi Wang, Shituo Ma, Guoxin Yu, Hanyang Peng, Yue Yu

Comments Accepted by ICASSP 2026

详情
英文摘要

Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.

2509.15180 2026-02-10 cs.RO

Parallel Simulation of Contact and Actuation for Soft Growing Robots

Yitian Gao, Lucas Chen, Priyanka Bhovad, Sicheng Wang, Zachary Kingston, Laura H. Blumenschein

Comments 37 pages, 12 figures, 2 tables. To appear in Soft Robotics

详情
英文摘要

Soft growing robots, commonly referred to as vine robots, have demonstrated remarkable ability to interact safely and robustly with unstructured and dynamic environments. It is therefore natural to exploit contact with the environment for planning and design optimization tasks. Previous research has focused on planning under contact for passively deforming robots with pre-formed bends. However, adding active steering to these soft growing robots is necessary for successful navigation in more complex environments. To this end, we develop a unified modeling framework that integrates vine robot growth, bending, actuation, and obstacle contact. We extend the beam moment model to include the effects of actuation on kinematics under growth and then use these models to develop a fast parallel simulation framework. We validate our model and simulator with real robot experiments. To showcase the capabilities of our framework, we apply our model in a design optimization task to find designs for vine robots navigating through cluttered environments, identifying designs that minimize the number of required actuators by exploiting environmental contacts. We show the robustness of the designs to environmental and manufacturing uncertainties. Finally, we fabricate an optimized design and successfully deploy it in an obstacle-rich environment.

2509.13213 2026-02-10 cs.LG

Density-Aware Farthest Point Sampling

Paolo Climaco, Jochen Garcke

Comments 14 pages, 2 figures

详情
英文摘要

We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set: a quantity we can estimate simply by considering the data features. We introduce ''Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

2509.11433 2026-02-10 cs.RO

A Software-Only Post-Processor for Indexed Rotary Machining on GRBL-Based CNCs

Pedro Portugal, Damian D. Venghaus, Diego Lopez

Comments 13 pages, 10 figures, Python post-processor source code and web interface included

详情
英文摘要

Affordable desktop CNC routers are common in education, prototyping, and makerspaces, but most lack a rotary axis, limiting fabrication of rotationally symmetric or multi-sided parts. Existing solutions often require hardware retrofits, alternative controllers, or commercial CAM software, raising cost and complexity. This work presents a software-only framework for indexed rotary machining on GRBL-based CNCs. A custom post-processor converts planar toolpaths into discrete rotary steps, executed through a browser-based interface. While not equivalent to continuous 4-axis machining, the method enables practical rotary-axis fabrication using only standard, off-the-shelf mechanics, without firmware modification. By reducing technical and financial barriers, the framework expands access to multi-axis machining in classrooms, makerspaces, and small workshops, supporting hands-on learning and rapid prototyping.

2509.02986 2026-02-10 cs.RO

CTBC: Contact-Triggered Blind Climbing for Wheeled Bipedal Robots with Instruction Learning and Reinforcement Learning

Rankun Li, Hao Wang, Qi Li, Zhuo Han, Yifei Chu, Linqi Ye, Wende Xie, Wenlong Liao

详情
英文摘要

In recent years, wheeled bipedal robots have garnered significant attention due to their exceptional mobility on flat terrain. However, while stair climbing has been achieved in prior studies, these existing methods often suffer from a severe lack of versatility, making them difficult to adapt to varying hardware specifications or diverse complex terrains. To overcome these limitations, we propose a generalized Contact-Triggered Blind Climbing (CTBC) framework. Upon detecting wheel-obstacle contact, the framework triggers a leg-lifting motion integrated with a strongly-guided feedforward trajectory. This allows the robot to rapidly acquire agile climbing skills, significantly enhancing its capability to traverse unstructured environments. Distinct from previous approaches, CTBC demonstrates superior robustness and adaptability, having been validated across multiple wheeled bipedal platforms with different wheel radii and tire materials. Real-world experiments demonstrate that, relying solely on proprioceptive feedback, the proposed framework enables robots to achieve reliable and continuous climbing over obstacles well beyond their wheel radius.

2509.00036 2026-02-10 cs.LG cs.CV

A-FloPS: Accelerating Diffusion Models via Adaptive Flow Path Sampler

Cheng Jin, Zhenyu Xiao, Yuantao Gu

Comments published on AAAI26

详情
英文摘要

Diffusion models deliver state-of-the-art generative performance across diverse modalities but remain computationally expensive due to their inherently iterative sampling process. Existing training-free acceleration methods typically improve numerical solvers for the reverse-time ODE, yet their effectiveness is fundamentally constrained by the inefficiency of the underlying sampling trajectories. We propose A-FloPS (Adaptive Flow Path Sampler), a principled, training-free framework that reparameterizes the sampling trajectory of any pre-trained diffusion model into a flow-matching form and augments it with an adaptive velocity decomposition. The reparameterization analytically maps diffusion scores to flow-compatible velocities, yielding integration-friendly trajectories without retraining. The adaptive mechanism further factorizes the velocity field into a linear drift term and a residual component whose temporal variation is actively suppressed, restoring the accuracy benefits of high-order integration even in extremely low-NFE regimes. Extensive experiments on conditional image generation and text-to-image synthesis show that A-FloPS consistently outperforms state-of-the-art training-free samplers in both sample quality and efficiency. Notably, with as few as $5$ function evaluations, A-FloPS achieves substantially lower FID and generates sharper, more coherent images. The adaptive mechanism also improves native flow-based generative models, underscoring its generality. These results position A-FloPS as a versatile and effective solution for high-quality, low-latency generative modeling.

2508.17061 2026-02-10 cs.CV

REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework

Stefanos Pasios, Nikos Nikolaidis

Comments 8 pages

Journal ref IEEE Transactions on Games (2026)

详情
英文摘要

Photorealism is an important aspect of modern video games since it can shape player experience and impact immersion, narrative engagement, and visual fidelity. To achieve photorealism, beyond traditional rendering pipelines, generative models have been increasingly adopted as an effective approach for bridging the gap between the visual realism of synthetic and real worlds. However, under real-time constraints of video games, existing generative approaches continue to face a tradeoff between visual quality and runtime efficiency. In this work, we present a framework for enhancing the photorealism of rendered game frames using generative networks. We propose REGEN, which first employs a robust unpaired image-to-image translation model to generate semantically consistent photorealistic frames. These generated frames are then used to create a paired dataset, which transforms the problem to a simpler unpaired image-to-image translation. This enables training with a lightweight method, achieving real-time inference without compromising visual quality. We evaluate REGEN on Unreal Engine, showing, by employing the CMMD metric, that it achieves comparable or slightly improved visual quality compared to the robust method, while improving the frame rate by 12x. Additional experiments also validate that REGEN adheres to the semantic preservation of the initial robust image-to-image translation method and maintains temporal consistency. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN

2508.15476 2026-02-10 cs.CV cs.AI

LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion

Chengqi Dong, Fenghe Tang, Rongge Mao, Xinpei Gao, S. Kevin Zhou

Comments Accepted by ECAI 2025

Journal ref Frontiers in Artificial Intelligence and Applications, 413, 739-746 (2025)

详情
英文摘要

Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.

2508.14313 2026-02-10 cs.LG cs.AI

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Zihan Dong, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

详情
英文摘要

Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

2508.14042 2026-02-10 cs.RO

Sim-to-Real Dynamic Object Manipulation on Conveyor Systems via Optimization Path Shaping

Zhuoling Li, Jinrong Yang, Yong Zhao, Liangliang Ren, Xiaoyang Wu, Zhenhua Xu, Hengshuang Zhao

详情
英文摘要

Realizing generalizable dynamic object manipulation on conveyor systems is important for enhancing manufacturing efficiency, as it eliminates specialized engineering for different scenarios. To this end, imitation learning emerges as a promising paradigm, leveraging expert demonstrations to teach a policy manipulation skills. Although the generalization of an imitation learning policy can be improved by increasing demonstrations, demonstration collection is labor-intensive. Besides, public dynamic object manipulation data is scarce. In this work, we address this data scarcity problem via generating demonstrations in a simulator. A significant challenge of using simulated data lies in the appearance gap between simulated and real-world observations. To tackle this challenge, we propose Geometry-Enhanced Model (GEM), which employs our designed appearance noise annealing strategy to shape the policy optimization path, thereby prioritizing the geometry information in observations. Extensive experiments in simulated and real-world tasks demonstrate that GEM can generalize across environment backgrounds, robot embodiments, motion dynamics, and object geometries. Notably, GEM is deployed in a real canteen for tableware collection. Without test-scene data, GEM achieves a success rate of over 97% across more than 10,000 operations.

2508.13333 2026-02-10 cs.AI cs.NE math.OC

HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design

Chentong Chen, Mengyuan Zhong, Ye Fan, Jialong Shi, Jianyong Sun

Comments Accepted at ICLR 2026

详情
英文摘要

LLM-based Automatic Heuristic Design (AHD) within Evolutionary Computation (EC) frameworks has shown promising results. However, its effectiveness is hindered by the use of static operators and the lack of knowledge accumulation mechanisms. We introduce HiFo-Prompt, a framework that guides LLMs with two synergistic prompting strategies: Foresight and Hindsight. Foresight-based prompts adaptively steer the search based on population dynamics, managing the exploration-exploitation trade-off. In addition, hindsight-based prompts mimic human expertise by distilling successful heuristics from past generations into fundamental, reusable design principles. This dual mechanism transforms transient discoveries into a persistent knowledge base, enabling the LLM to learn from its own experience. Empirical results demonstrate that HiFo-Prompt significantly outperforms state-of-the-art LLM-based AHD methods, generating higher-quality heuristics while achieving substantially faster convergence and superior query efficiency. Our code is available at https://github.com/Challenger-XJTU/HiFo-Prompt.

2508.08266 2026-02-10 cs.LG cs.AI cs.CL cs.CV cs.IR

Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Ryan Mioduski

Journal ref Journal of Spatial Information Science, No. 31 (2025), pp. 57-96

详情
英文摘要

Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.

2508.08107 2026-02-10 cs.CV cs.AI

Hyperspectral Imaging

Danfeng Hong, Chenyu Li, Naoto Yokoya, Bing Zhang, Xiuping Jia, Antonio Plaza, Paolo Gamba, Jon Atli Benediktsson, Jocelyn Chanussot

Comments Accepted by Nature Reviews Methods Primers

详情
英文摘要

Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI's ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society.

2508.03442 2026-02-10 cs.CV

MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance

Shangwen Zhu, Qianyu Peng, Zhilei Shu, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Andy Zheng, Xinyu Cui, Jian Zhao, Ruili Feng, Fan Cheng

详情
英文摘要

High-fidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3x speedup on Stable Diffusion v3.5 (SD3.5) and 4x on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2x while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.

2507.23601 2026-02-10 cs.CV

Mamba-based Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li, Keren Fu, Qijun Zhao

Comments 13 pages, 12 figures

详情
英文摘要

Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearances for motion perception. However, the high foreground-background similarity in VCOD limits the discriminability of such features (e.g. color and texture). Recent studies demonstrate that frequency features can not only compensate for appearance limitations, but also perceive motion through dynamic variations in spectral energy. Meanwhile, the emerging state space model called Mamba enables efficient motion perception in frame sequences with its linear-time long-sequence modeling capability. Motivated by this, we propose Vcamba, a visual camouflage Mamba based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, by analyzing the spatial representations of frequency components, we reveal a structural evolution pattern that emerges from the ordered superposition of components. Based on this observation, we propose a unique frequency-domain sequential scanning (FSS) strategy to unfold the spectrum. Utilizing FSS, the adaptive frequency enhancement (AFE) module employs Mamba to model the causal dependencies within sequences, enabling effective frequency learning. Furthermore, we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features into unified motion representation. Experiments show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming its superiority. Code is available at: https://github.com/BoydeLi/Vcamba.

2507.19595 2026-02-10 cs.CL cs.AI

Efficient Attention Mechanisms for Large Language Models: A Survey

Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang

Comments work in progress

详情
英文摘要

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.

2507.17307 2026-02-10 cs.LG cs.AI cs.CL

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang

详情
英文摘要

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

2507.12709 2026-02-10 cs.LG

From SGD to Spectra: A Theory of Neural Network Weight Dynamics

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula

详情
英文摘要

Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

2507.09202 2026-02-10 cs.LG cs.AI physics.ao-ph

XiChen: A global weather observation-to-forecast machine learning system via four-dimensional variational gradient-guided flexible assimilation

Wuxin Wang, Weicheng Ni, Lilan Huang, Tao Hao, Ben Fei, Shuo Ma, Taikang Yuan, Yanlai Zhao, Kefeng Deng, Xiaoyong Li, Hongze Leng, Boheng Duan, Lei Bai, Weimin Zhang, Junqiang Song, Kaijun Ren

Comments 24 pages, 6 figures

详情
英文摘要

Machine Learning (ML) has shown great promise in revolutionizing weather forecasting, yet most ML systems still rely on initial conditions generated by Numerical Weather Prediction (NWP) systems. End-to-end ML models aim to eliminate this dependency, but they often rely on observation-specific encoders and require redesign or retraining when observation sources change, thereby limiting their operational robustness. Here, we introduce XiChen, a global weather observation-to-forecast ML system via four-dimensional variational (4DVar) gradient-guided flexible assimilation. We demonstrate that the gradient of the 4DVar cost function serves as a physically grounded interface that maps heterogeneous observations into a common state space. This novel formulation enables XiChen to flexibly assimilate diverse conventional and raw satellite observations while preserving physical consistency. Experiments show that the system achieves forecasting metrics competitive with operational NWP systems. This work provides a practical and physically consistent route toward operational ML-based global weather forecasting systems with heterogeneous and evolving observations.

2507.08249 2026-02-10 cs.AI cs.CR

Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm

Bill Marino, Ari Juels

详情
英文摘要

There is growing interest in giving AI agents access to cryptocurrencies as well as to the smart contracts that transact them. But doing so, this position paper argues, could lead to formidable new vectors of AI harm. To support this argument, we first examine the unique properties of cryptocurrencies and smart contracts that could give rise to these new vectors of AI harm. Next, we describe each of these new vectors of AI harm in detail, providing a first-of-its-kind taxonomy. Finally, we conclude with a call for more technical research aimed at preventing and mitigating these new vectors of AI , thereby making it safer to endow AI agents with cryptocurrencies and smart contracts.

2507.03006 2026-02-10 cs.CV cs.LG

Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification

Faisal Ahmed

Comments 18 pages, 12 figures

详情
英文摘要

This work presents a comparative evaluation of two fundamentally different feature extraction paradigms--Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA)--for medical image classification, with a focus on retinal fundus imagery. HOG captures local structural information by modeling gradient orientation distributions within spatial regions, effectively encoding texture and edge patterns. In contrast, TDA, implemented through cubical persistent homology, extracts global topological descriptors that characterize shape, connectivity, and intensity-based structure across images. We evaluate both approaches on the publicly available APTOS retinal fundus dataset for two classification tasks: binary classification (normal vs. diabetic retinopathy (DR)) and five-class DR severity grading. From each image, 26,244 HOG features and 800 TDA features are extracted and independently used to train seven classical machine learning models, including logistic regression, random forest, XGBoost, support vector machines, decision trees, k-nearest neighbors, and Extra Trees, using 10-fold cross-validation. Experimental results show that XGBoost achieves the best performance across both feature types. For binary classification, accuracies of 94.29% (HOG) and 94.18% (TDA) are obtained, while multi-class classification yields accuracies of 74.41% and 74.69%, respectively. These results demonstrate that gradient-based and topological features provide complementary representations of retinal image structure and highlight the potential of integrating both approaches for interpretable and robust medical image classification.

2506.21910 2026-02-10 cs.CL

AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, Vikas Chandra

Comments Accepted at ACL 2025

详情
英文摘要

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

2506.21550 2026-02-10 cs.LG cs.AI

mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Xiaona Zhou, Constantin Brif, Ismini Lourentzou

Journal ref Transactions on Machine Learning Research, 2026

详情
英文摘要

Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.

2506.13564 2026-02-10 cs.CV

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Geewook Kim, Minjoon Seo

Comments AAAI 2026 (Oral). Project page: https://github.com/naver-ai/mambamia

详情
英文摘要

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

2506.10371 2026-02-10 cs.CV cs.LG

Revisiting Transformers with Insights from Image Filtering and Boosting

Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen

详情
英文摘要

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.