arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3188
专题追踪
2509.21029 2026-03-03 cs.LG

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu

Comments Accepted by CVPR 2026

详情
英文摘要

The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

2509.20323 2026-03-03 cs.LG math.OC stat.ML

A Recovery Guarantee for Sparse Neural Networks

Sara Fridovich-Keil, Mert Pilanci

Comments ICLR 2026

详情
英文摘要

We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning. Code is available at https://github.com/voilalab/MLP-IHT.

2509.19877 2026-03-03 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph physics.comp-ph

Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

Shi Yin, Zujian Dai, Xinyang Pan, Lixin He

详情
英文摘要

Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of "ghost states" caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

2509.18487 2026-03-03 cs.CL

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen White

详情
英文摘要

Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.

2509.17050 2026-03-03 cs.CV

Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition

Junhao Jia, Yunyou Liu, Yifei Sun, Huangwei Chen, Feiwei Qin, Changmiao Wang, Yong Peng

Comments The paper has been accepted by ICASSP 2026

详情
英文摘要

Nonlinear manifolds are pervasive in deep visual features, where Euclidean distances can misrepresent true similarity. This mismatch is particularly detrimental to prototype-based interpretable fine-grained recognition, where even subtle semantic distinctions are crucial. To mitigate this issue, this work presents a novel paradigm for prototype-based recognition by grounding similarity in the intrinsic geometry of deep features. Concretely, we distill the latent manifold structure of each class into a diffusion space and, critically, devise a differentiable Nyström interpolation to make this geometry accessible to both unseen samples and learnable prototypes. To maintain efficiency, we employ compact per-class landmark sets with periodic updates. This strategy keeps the embedding synchronized with the evolving backbone, enabling fast inference at scale. Comprehensive experiments on two benchmark datasets demonstrate that our GeoProto yields prototypes focusing on semantically corresponding parts, significantly outperforming Euclidean prototype networks.

2509.16557 2026-03-03 cs.CV cs.ET cs.HC cs.LG

Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram

Comments 21 pages, 8 figures, 7 tables. Preprint of a manuscript to appear in CCF Trans. Pervasive Comp. Interact. (2026)

详情
英文摘要

Human-Object Interaction Recognition (HOIR) and user identification play a crucial role in advancing augmented reality (AR)-based personalized assistive technologies. These systems are increasingly being deployed in high-stakes, human-centric environments such as aircraft cockpits, aerospace maintenance, and surgical procedures. This research introduces I2S (Interact2Sign), a multi stage framework designed for unobtrusive user identification through human object interaction recognition, leveraging 3D hand pose analysis in egocentric videos. I2S utilizes handcrafted features extracted from 3D hand poses and per forms sequential feature augmentation: first identifying the object class, followed by HOI recognition, and ultimately, user identification. A comprehensive feature extraction and description process was carried out for 3D hand poses, organizing the extracted features into semantically meaningful categories: Spatial, Frequency, Kinematic, Orientation, and a novel descriptor introduced in this work, the Inter-Hand Spatial Envelope (IHSE). Extensive ablation studies were conducted to determine the most effective combination of features. The optimal configuration achieved an impressive average F1-score of 97.52% for user identification, evaluated on a bimanual object manipulation dataset derived from the ARCTIC and H2O datasets. I2S demonstrates state-of-the-art performance while maintaining a lightweight model size of under 4 MB and a fast inference time of 0.1 seconds. These characteristics make the proposed framework highly suitable for real-time, on-device authentication in security-critical, AR-based systems.

2509.15888 2026-03-03 cs.CL cs.AI

Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang

Comments Accepted by NeurIPS'25

详情
英文摘要

Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models. Code is available at https://github.com/dl-m9/SVDecode.

2509.14965 2026-03-03 cs.CV

Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis

Junhao Jia, Yunyou Liu, Cheng Yang, Yifei Sun, Feiwei Qin, Changmiao Wang, Yong Peng

Comments The paper has been accepted by ICASSP 2026

详情
英文摘要

Functional magnetic resonance imaging (fMRI) reveals complex brain functional networks with hierarchical topologies crucial for cognitive processing. Standard Euclidean Graph Neural Networks (GNNs) often struggle to represent these hierarchical structures without high distortion due to inherent spatial constraints. We propose Brain-HGCN, a geometric deep learning framework based on hyperbolic geometry, which leverages negatively curved space to model brain network hierarchy with high fidelity. Grounded in the Lorentz model, our framework employs a novel hyperbolic graph attention layer with a signed aggregation mechanism to distinctly process excitatory and inhibitory connections. Furthermore, we learn robust graph-level representations via a geometrically principled Fréchet mean for graph readout. Experiments on two large-scale fMRI datasets for psychiatric disorder classification demonstrate that Brain-HGCN significantly outperforms state-of-the-art Euclidean baselines. This work highlights the potential of hyperbolic GNNs in computational psychiatry by pioneering a new geometric paradigm for fMRI analysis.

2509.13789 2026-03-03 cs.CV cs.AI

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia

详情
英文摘要

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 6$\times$ speedup with comparable visual quality.

2509.12813 2026-03-03 cs.RO cs.SY eess.SY

Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks

Bowen Ye, Junyue Huang, Yang Liu, Xiaozhen Qiao, Xiang Yin

详情
英文摘要

We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstructured real-world environments. We propose the \emph{Structured-MoE STL Planner} (\textbf{S-MSP}), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A \emph{structure-aware} Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S-MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based \emph{safety filter} at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.

2509.12282 2026-03-03 cs.AI cs.LG

AISSISTANT: Human-AI Collaborative Review and Perspective Research Workflows in Data Science

Sasi Kiran Gaddipati, Farhana Keya, Gollam Rabby, Sören Auer

详情
英文摘要

High-quality scientific review and perspective papers require substantial time and effort, limiting researchers' ability to synthesize emerging knowledge. While Large Language Models (LLMs) leverage AI Scientists for scientific workflows, existing frameworks focus primarily on autonomous workflows with very limited human intervention. We introduce AIssistant, the first open-source agentic framework for Human--AI collaborative generation of scientific perspectives and review research in data science. AIssistant employs specialized LLM-driven agents augmented with external scholarly tools and allows human intervention throughout the workflow. The framework consists of two main multi-agent systems: Research Workflow with seven agents and a Paper Writing Workflow with eight agents. We conducted a comprehensive evaluation with both human expert reviewers and LLM-based assessment following NeurIPS standards. Our experiments show that OpenAI o1 achieves the highest quality scores on chain-of-thought prompting with augmented Literature Search tools. We also conducted a Human--AI interaction survey with results showing a 65.7\% time savings. We believe that our work establishes a baseline for Human--AI collaborative scientific workflow for review and perspective research in data science, demonstrating that agent-augmented pipelines substantially reduce effort while maintaining research integrity through strategic human oversight.

2509.11617 2026-03-03 cs.RO

AssemMate: Graph-Based LLM for Robotic Assembly Assistance

Qi Zheng, Chaoran Zhang, Zijian Liang, EnTe Lin, Shubo Cui, Qinghongbing Xie, Zhaobo Xu, Long Zeng

详情
英文摘要

Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning. In order to bridge this gap, we present AssemMate, which utilizes the graph\textemdash a concise and accurate form of knowledge representation\textemdash as input. This graph-based LLM enables knowledge graph question answering (KGQA), supporting human-robot interaction and assembly task planning for specific products. Beyond interactive QA, AssemMate also supports sensing stacked scenes and executing grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4\% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page: https://github.com/cristina304/AssemMate.git

2509.10980 2026-03-03 cs.CV

TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

Haoming Lu

Comments The dataset is available for download at https://drive.google.com/file/d/1_ndw5uyY4h4DLL5iGTL4bVDKdE_g_H4B/view?usp=sharing

详情
英文摘要

Skin tone recognition and generation play important roles in model fairness, healthcare, and generative AI, yet they remain challenging due to the lack of comprehensive datasets and robust methodologies. Compared to other human image analysis tasks, state-of-the-art large multimodal models (LMMs) and image generation models struggle to recognize and synthesize skin tones accurately. To address this, we introduce TrueSkin, a dataset with 7299 images systematically categorized into 6 classes, collected under diverse lighting conditions, camera angles, and capture settings. Using TrueSkin, we benchmark existing recognition and generation approaches, revealing substantial biases: LMMs tend to misclassify intermediate skin tones as lighter ones, whereas generative models struggle to accurately produce specified skin tones when influenced by inherent biases from unrelated attributes in the prompts, such as hairstyle or environmental context. We further demonstrate that training a recognition model on TrueSkin improves classification accuracy by more than 20\% compared to LMMs and conventional approaches, and fine-tuning with TrueSkin significantly improves skin tone fidelity in image generation models. Our findings highlight the need for comprehensive datasets like TrueSkin, which not only serves as a benchmark for evaluating existing models but also provides a valuable training resource to enhance fairness and accuracy in skin tone recognition and generation tasks.

2509.08628 2026-03-03 cs.CV

LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation

Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Dong Wang, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers

Journal ref Pattern Recognition. DAGM GCPR 2025

详情
英文摘要

Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.

2509.07413 2026-03-03 cs.RO

DA-VPC: Disturbance-Aware Visual Predictive Control Scheme of Docking Maneuvers for Autonomous Trolley Collection

Yuhan Pang, Bingyi Xia, Zhe Zhang, Zhirui Sun, Peijia Xie, Bike Zhu, Wenjun Xu, Jiankun Wang

详情
英文摘要

Service robots have demonstrated significant potential for autonomous trolley collection and redistribution in public spaces like airports or warehouses to improve efficiency and reduce cost. Usually, a fully autonomous system for the collection and transportation of multiple trolleys is based on a Leader-Follower formation of mobile manipulators, where reliable docking maneuvers of the mobile base are essential to align trolleys into organized queues. However, developing a vision-based robotic docking system faces significant challenges: high precision requirements, environmental disturbances, and inherent robot constraints. To address these challenges, we propose a Disturbance-Aware Visual Predictive Control (DA-VPC) scheme that incorporates active infrared markers for robust feature extraction across diverse lighting conditions. This framework explicitly models nonholonomic kinematics and visibility constraints for image-based visual servoing (IBVS), solving the predictive control problem through optimization. It is augmented with an extended state observer (ESO) designed to counteract disturbances during trolley pushing, ensuring precise and stable docking. Experimental results across diverse environments demonstrate the robustness of this system, with quantitative evaluations confirming high docking accuracy.

2509.04932 2026-03-03 cs.CV

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

Haowang Cui, Rui Chen, Jiaze Wang, Tao Guo, Zheng Qin

详情
英文摘要

The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

2509.03516 2026-03-03 cs.CV

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng

Comments Accepted to ICLR 2026. Project Page: https://t2i-corebench.github.io/

详情
英文摘要

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: \textbf{\textit{composition}} and \textbf{\textit{reasoning}}. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose \textbf{\textsc{T2I-CoReBench}}, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (\textit{instance}, \textit{attribute}, and \textit{relation}) and reasoning around the philosophical framework of inference (\textit{deductive}, \textit{inductive}, and \textit{abductive}), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual \textit{yes/no} questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 38 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

2509.03214 2026-03-03 cs.CV

RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min

Comments The paper has been accepted by BIBM 2025

Journal ref 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2025, pp. 2301-2308

详情
英文摘要

Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject's activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.

2508.19754 2026-03-03 cs.CV

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

Yue Wu, Xuanhong Chen, Yufan Wu, Wen Li, Yuxi Lu, Kairui Feng

详情
英文摘要

Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable 3D avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods.

2508.18672 2026-03-03 cs.LG cs.AI cs.CL

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Comments Accepted as an oral at ICLR 2026

详情
英文摘要

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

2508.12811 2026-03-03 cs.CV cs.AI cs.LG

Next Visual Granularity Generation

Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy

Comments ICLR 2026

详情
英文摘要

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 $\rightarrow$ 3.03, 2.57 $\rightarrow$ 2.44, 2.09 $\rightarrow$ 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models are released at https://yikai-wang.github.io/nvg.

2508.11999 2026-03-03 cs.CV cs.AI cs.IR cs.LG

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

Daoze Zhang, Chenghan Fu, Zhanheng Nie, Jianyu Liu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng

Comments Accepted by WSDM 2026 (oral). 11 pages, 9 figures

详情
英文摘要

With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding. The data of our MBE benchmark is given in https://huggingface.co/datasets/Daoze/MM-Bench-E-Commerce.

2508.11484 2026-03-03 cs.CV

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen

Comments ICLR2026 Accept; Project Page:https://uknowsth.github.io/CineTrans/

详情
英文摘要

Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

2508.11428 2026-03-03 cs.CV

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang

Comments Accepted for publication in 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
英文摘要

Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

2508.09003 2026-03-03 cs.RO cs.SY eess.SY

Large Scale Robotic Material Handling: Learning, Planning, and Control

Filippo A. Spinelli, Yifan Zhai, Fang Nan, Pascal Egli, Julian Nubert, Thilo Bleumer, Lukas Miller, Ferdinand Hofmann, Marco Hutter

Comments Final version published in IEEE Transactions on Field Robotics. It includes additional experiments and comparisons with classical methods

详情
英文摘要

Bulk material handling involves the efficient and precise moving of large quantities of materials, a core operation in many industries, including cargo ship unloading, waste sorting, construction, and demolition. These repetitive, labor-intensive, and safety-critical operations are typically performed using large hydraulic material handlers equipped with underactuated grippers. In this work, we present a comprehensive framework for the autonomous execution of large-scale material handling tasks. The system integrates specialized modules for environment perception, pile attack point selection, path planning, and motion control. The main contributions of this work are two reinforcement learning-based modules: an attack point planner that selects optimal grasping locations on the material pile to maximize removal efficiency and minimize the number of scoops, and a robust trajectory following controller that addresses the precision and safety challenges associated with underactuated grippers in movement, while utilizing their free-swinging nature to release material through dynamic throwing. We validate our framework through real-world experiments on a 40 t material handler in a representative worksite, focusing on two key tasks: high-throughput bulk pile management and high-precision truck loading. Comparative evaluations against human operators demonstrate the system's effectiveness in terms of precision, repeatability, and operational safety. To the best of our knowledge, this is the first complete automation of material handling tasks on a full scale.

2508.03926 2026-03-03 cs.LG cs.NA math.DS math.NA

Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning

Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos

Comments 43 pages (16 pages of Appendix), 16 figures (11 in Appendix)

详情
英文摘要

Bridging the microscopic and macroscopic modelling scales in crowd dynamics constitutes an open challenge for systematic numerical analysis, optimization, and control. Here, we propose a manifold-informed machine learning approach to learn the discrete evolution operator for the emergent/collective crowd dynamics in latent spaces from high-fidelity individual/agent-based simulations. The proposed framework is a four-stage one, \textit{explicitly conserving the mass} of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians' positions) using Kernel Density Estimation. In the second step, we construct a map from the density-field space into an appropriate latent space parametrized by a few coordinates based on Proper-Orthogonal Decomposition (POD) of the corresponding density distributions. The third step involves learning reduced-order surrogate models in the latent space using machine learning techniques, particularly Long Short-Term Memory networks and Multivariate Autoregressive models. Finally, we reconstruct the crowd dynamics in the high-dimensional space with POD, demonstrating that the POD reconstruction conserves the mass. Thus, with this ``embed -> learn in latent space -> lift back to the high-dimensional space'' pipeline, we create an effective solution operator of the unavailable (at the macroscopic scale) PDE for the evolution of the density distribution. For our illustrations, we used the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions in two scenarios: (i) a unidirectional flow, and (ii) a counterflow. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.

2507.20034 2026-03-03 cs.RO cs.CV

Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying

Z. Ahmed, E. Bates, P. Francesch Huc, S. Y. W. Low, A. Golan, T. Bell, A. Rizza, S. D'Amico

Journal ref 2026 Rocky Mountain AAS GN&C Conference, Breckenridge, Colorado

详情
英文摘要

Spacecraft Rendezvous, Proximity Operations (RPO), and Formation Flying (FF) rely on safety-critical guidance, navigation and control (GNC) that must satisfy stringent performance and robustness requirements. However, verifying GNC performance is challenging due to the complexity and inaccessibility of the space environment, necessitating a verification and validation (V\&V) process that bridges simulation and real-world behavior. This paper contributes a unified, closed-loop, end-to-end digital and robotic twinning framework that enables software- and hardware-in-the-loop testing of spacecraft GNC systems. The framework is designed for modularity and flexibility, supporting interchangeable sensing modalities, control algorithms, and operational regimes. The digital twin includes an event-driven faster-than-real-time simulation environment to support rapid prototyping. The architecture is augmented with hardware-based robotic testbeds from Stanford's Space Rendezvous Laboratory (SLAB): the GNSS and Radiofrequency Autonomous Navigation Testbed for Distributed Space Systems (GRAND) to validate RF-based navigation techniques, and the Testbed for Rendezvous and Optical Navigation (TRON) and Optical Stimulator (OS) to validate vision-based methods. The test article for this work is an integrated multi-modal GNC software stack developed at SLAB. This paper introduces the hybrid twinning framework, summarizes calibration and error characterization of the robotic testbeds, and evaluates GNC performance across multiple operational modes in a full-range RPO scenario in LEO. The results demonstrate consistency between software- and hardware-in-the-loop tests with clear explainability for deviations in performance, thus validating the hybrid twinning pipeline as a reliable framework for realistic assessment and verification of GNC systems.

2507.15852 2026-03-03 cs.CV cs.AI

Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Comments project page: https://rookiexiong7.github.io/projects/SeC/ ; code: https://github.com/OpenIXCLab/SeC ; dataset: https://huggingface.co/datasets/OpenIXCLab/SeCVOS

详情
英文摘要

We propose Segment Concept (SeC), a concept-driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept-level features at those points. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state-of-the-art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware VOS.

2507.14894 2026-03-03 cs.CL

SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in one case. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities. The code and data are available at https://github.com/Aatrox103/SASFT.

2507.08776 2026-03-03 cs.CV

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa

Comments Project page: https://clift-nvs.github.io

详情
英文摘要

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.