arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3188
专题追踪
2505.21366 2026-03-03 cs.LG

PLANETALIGN: A Comprehensive Python Library for Benchmarking Network Alignment

Qi Yu, Zhichen Zeng, Yuchen Yan, Zhining Liu, Baoyu Jing, Ruizhong Qiu, Ariful Azad, Hanghang Tong

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Network alignment (NA) aims to identify node correspondence across different networks and serves as a critical cornerstone behind various downstream multi-network learning tasks. Despite growing research in NA, there lacks a comprehensive library that facilitates the systematic development and benchmarking of NA methods. In this work, we introduce PLANETALIGN, a comprehensive Python library for network alignment that features a rich collection of built-in datasets, methods, and evaluation pipelines with easy-to-use APIs. Specifically, PLANETALIGN integrates 18 datasets and 14 NA methods with extensible APIs for easy use and development of NA methods. Our standardized evaluation pipeline encompasses a wide range of metrics, enabling a systematic assessment of the effectiveness, scalability, and robustness of NA methods. Through extensive comparative studies, we reveal practical insights into the strengths and limitations of existing NA methods. We hope that PLANETALIGN can foster a deeper understanding of the NA problem and facilitate the development and benchmarking of more effective, scalable, and robust methods in the future. The source code of PLANETALIGN is available at https://github.com/yq-leo/PlanetAlign.

2505.21082 2026-03-03 cs.CL

RPM: Reasoning-Level Personalization for Black-Box Large Language Models

Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee

详情
英文摘要

While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences. Current personalization methods are fundamentally limited to response-level personalization; they only match final outputs, failing to model the underlying reasoning that connects user behavior to responses. To address this, this work introduces reasoning-level personalization as a new paradigm and proposes RPM, the first systematic framework that automatically discovers user-specific reasoning structures from raw behavioral data to guide the model's personalized inference. RPM constructs a structured model of user behavior-built from response-influential features and statistical factors-to create personalized reasoning paths and retrieve beneficial examples for guiding inference through a feature-based retrieval mechanism. Extensive experiments across four diverse tasks demonstrate that RPM consistently outperforms existing response-level methods while simultaneously enhancing both personalization performance and interpretability, providing a promising direction for black-box LLM personalization.

2505.19193 2026-03-03 cs.LG

SuperMAN: Interpretable and Expressive Networks over Temporally Sparse Heterogeneous Data

Maya Bechler-Speicher, Andrea Zerio, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs

详情
英文摘要

Real-world temporal data often consists of multiple signal types recorded at irregular, asynchronous intervals. For instance, in the medical domain, different types of blood tests can be measured at different times and frequencies, resulting in fragmented and unevenly scattered temporal data. Similar issues of irregular sampling occur in other domains, such as the monitoring of large systems using event log files. Effectively learning from such data requires handling sets of temporal sparse and heterogeneous signals. In this work, we propose Super Mixing Additive Networks (SuperMAN), a novel and interpretable-by-design framework for learning directly from such heterogeneous signals, by modeling them as sets of implicit graphs. SuperMAN provides diverse interpretability capabilities, including node-level, graph-level, and subset-level importance, and enables practitioners to trade finer-grained interpretability for greater expressivity when domain priors are available. SuperMAN achieves state-of-the-art performance in real-world high-stakes tasks, including predicting Crohn's disease onset and hospital length of stay from routine blood test measurements and detecting fake news. Furthermore, we demonstrate how SuperMAN's interpretability properties assist in revealing disease development phase transitions and provide crucial insights in the healthcare domain.

2505.17702 2026-03-03 cs.CV cs.AI

Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Xueyang Li, Jiahao Li, Yu Song, Yunzhong Lou, Xiangdong Zhou

Comments Accepted to ICLR 2026. The datatset has been released publicly and can be acessed in https://github.com/Sunny-Hack/Seek-CAD

详情
英文摘要

The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced closed-source LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the substantial cost and limitations of local deployment of the top-tier closed-source LLMs pose challenges in practical applications. The Seek-CAD is the pioneer exploration of locally deployed open-source inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the first investigation to incorporate both visual and Chain-of-Thought (CoT) feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.

2505.17132 2026-03-03 cs.CV cs.CL

Dynamic Token Reweighting for Robust Vision-Language Models

Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, Ting Wang

Comments CVPR 2026

详情
英文摘要

Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that DTR outperforms existing defenses in both attack robustness and benign-task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available at: https://github.com/TanqiuJiang/DTR.

2505.16448 2026-03-03 cs.AI

The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models

Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen

Comments ICLR 2026 poster

详情
英文摘要

Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.

2505.16056 2026-03-03 cs.LG cs.AI

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei

详情
英文摘要

Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the hit rate of an expert cache utilizing a length of future information under a cache limit. We analyze 20 MoE LLMs with diverse sizes and architectures and use toy models to verify key factors related to local routing consistency. We find a strong trade-off between local routing consistency and *local* load balance, while showing that *global* load balance can coexist with local routing consistency. Meanwhile, settings like shared experts that decrease expert combination space can lead to low local routing consistency. We further reveal that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models balance between cache effectiveness and efficiency with cache sizes approximately twice the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

2505.16017 2026-03-03 cs.LG cs.CV

GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection

Mariia Seleznova, Hung-Hsu Chou, Claudio Mayrink Verdun, Gitta Kutyniok

Journal ref In Proceedings of International Conference on Learning Representations (ICLR), 2026

详情
英文摘要

We introduce GradPCA, an Out-of-Distribution (OOD) detection method that exploits the low-rank structure of neural network gradients induced by Neural Tangent Kernel (NTK) alignment. GradPCA applies Principal Component Analysis (PCA) to gradient class-means, achieving more consistent performance than existing methods across standard image classification benchmarks. We provide a theoretical perspective on spectral OOD detection in neural networks to support GradPCA, highlighting feature-space properties that enable effective detection and naturally emerge from NTK alignment. Our analysis further reveals that feature quality -- particularly the use of pretrained versus non-pretrained representations -- plays a crucial role in determining which detectors will succeed. Extensive experiments validate the strong performance of GradPCA, and our theoretical framework offers guidance for designing more principled spectral OOD detectors.

2505.14362 2026-03-03 cs.CV

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu

Comments Accepted by ICLR2026. Ziwei, Michael, Jack, and Chenxiao are equal-contribution. The list order is random

详情
英文摘要

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

2505.14218 2026-03-03 cs.CV

Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion

Jie Li, Shengwei Tian, Long Yu, Xin Ning

Comments Accepted by IEEE TPAMI 2026. This is the author's version of the work. \c{opyright} 2026 IEEE. Personal use of this material is permitted. Code is available at this https URL [https://github.com/Carroll-Li/FCD]

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

详情
英文摘要

The Chamfer Distance (CD) is a cornerstone objective function for point cloud completion, yet its inherent symmetric weighting mechanism limits the quality of the generated results. By penalizing local detail deviations and global coverage deficiencies equally, standard CD often causes structural defects such as point aggregation and incomplete spatial structures. We introduce the Flexible-weighted Chamfer Distance (FCD), which decouples CD into local precision and global completeness sub-objectives. FCD employs an asymmetric weighting strategy that prioritizes global structural integrity, steering the optimization away from sub-optimal solutions. As a plug-and-play module with negligible overhead, extensive experiments on state-of-the-art networks demonstrate that FCD significantly enhances global distribution metrics while preserving local precision. Specifically, on the ShapeNet55 benchmark using AdaPoinTr, FCD reduces the Density-aware Chamfer Distance (DCD) by approximately 12.4% (from 0.613 to 0.537), effectively mitigating point clustering. Similarly, on the PCN dataset, the proposed method reduces the Earth Mover's Distance (EMD) from 23.79 to 21.40, demonstrating superior global uniformity compared to the standard CD baseline. Furthermore, FCD demonstrates excellent generalization. When applied to diverse tasks and datasets, including real-world scans (KITTI), industrial components (ABC), and point cloud upsampling (PU-GAN), it yields significant quantitative gains and produces visually more uniform and structurally complete point clouds. These results underscore FCD's potential as a versatile objective function for the broader point cloud generation domain.

2505.12734 2026-03-03 cs.SD cs.AI cs.GR cs.HC eess.AS

SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang

Comments 12 pages, 4 figures

详情
英文摘要

Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception. Project page: https://gisense.github.io/SounDiT-Page/

2505.12096 2026-03-03 cs.LG cs.AI stat.ML

When Bias Meets Trainability: Connecting Theories of Initialization

Alberto Bassi, Marco Baity-Jesi, Aurelien Lucchi, Carlo Albert, Emanuele Francazi

详情
英文摘要

The statistical properties of deep neural networks (DNNs) at initialization play an important role to comprehend their trainability and the intrinsic architectural biases they possess before data exposure Well established mean field (MF) theories have uncovered that the distribution of parameters of randomly initialized networks strongly influences the behavior of the gradients, dictating whether they explode or vanish. Recent work has showed that untrained DNNs also manifest an initial guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we provide a theoretical proof that links IGB to previous MF theories for a vast class of DNNs, showing that efficient learning is tightly connected to a network prejudice towards a specific class. This connection leads to a counterintuitive conclusion: the initialization that optimizes trainability is systematically biased rather than neutral.

2505.11076 2026-03-03 cs.LG

Addition is almost all you need: Compressing large language models with double binary factorization

Vladimír Boža, Vladimír Macko

详情
英文摘要

Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. The code is available at: https://github.com/usamec/double_binary

2505.09662 2026-03-03 cs.CL

When Large Language Models are More PersuasiveThan Incentivized Humans, and Why

Philipp Schoenegger, Francesco Salvi, Jiacheng Liu, Xiaoli Nan, Ramit Debnath, Barbara Fasolo, Evelina Leivada, Gabriel Recchia, Fritz Günther, Ali Zarifhonarvar, Joe Kwon, Zahoor Ul Islam, Marco Dehnert, Daryl Y. H. Lee, Madeline G. Reinecke, David G. Kamper, Mert Kobaş, Adam Sandford, Jonas Kgomo, Luke Hewitt, Shreya Kapoor, Kerem Oktar, Eyup Engin Kucuk, Bo Feng, Cameron R. Jones, Izzy Gainsburg, Sebastian Olschewski, Nora Heinzelmann, Francisco Cruz, Ben M. Tappin, Tao Ma, Peter S. Park, Rayan Onyonka, Arthur Hjorth, Peter Slattery, Qingcheng Zeng, Lennart Finke, Igor Grossmann, Alessandro Salatiello, Ezra Karger

详情
英文摘要

Large Language Models (LLMs) have been shown to be highly persuasive, but when and why they outperform humans is still an open question. We compare the persuasiveness of two LLMs (Claude 3.5 Sonnet and DeepSeek v3) against humans who had incentives to persuade, using an interactive, real-time conversational setting. We demonstrate that LLMs persuasive superiority is context-dependent: it depends on whether the persuasion attempt is truthful (towards the right answer) or deceptive (towards the wrong answer) and on the LLM model, and wanes over repeated interactions (unlike human persuasiveness). In our first large-scale experiment, humans vs LLMs (Claude 3.5 Sonnet) interacted with other humans who were completing an online quiz for a reward, attempting to persuade them toward a given (either correct or incorrect) answer. Claude was more persuasive than incentivized human persuaders both in truthful and deceptive contexts and it significantly increased accuracy if persuasion was truthful, but decreased it if persuasion was deceptive. In a follow-up experiment with Deepseek v3, we replicated the findings about accuracy but found greater LLM persuasiveness only if the persuasion was deceptive. Linguistic analyses of the persuaders texts suggest that these effects may be due to LLMs expressing higher conviction than humans.

2505.09305 2026-03-03 cs.RO

Embodied intelligent industrial robotics: Framework and techniques

Chaoran Zhang, Chenhao Zhang, Zhaobo Xu, Qinghongbing Xie, Jinliang Hou, Pingfa Feng, Long Zeng

Comments 71 pages, 13 figures. The associated project can be found at https://github.com/jackyzengl/EIIR

详情
英文摘要

The combination of embodied intelligence and robots has great prospects and is becoming increasingly common. In order to work more efficiently, accurately, reliably, and safely in industrial scenarios, robots should have at least general knowledge, working-environment knowledge, and operating-object knowledge. These pose significant challenges to existing embodied intelligent robotics (EIR) techniques. Thus, this paper first briefly reviews the history of industrial robotics and analyzes the limitations of mainstream EIR frameworks. Then, a new knowledge-driven technical framework of embodied intelligent industrial robotics (EIIR) is proposed for various industrial environments. It has five modules: a world model, a high-level task planner, a low-level skill controller, a simulator, and a physical system. The development of techniques related to each module are also thoroughly reviewed, and recent progress regarding their adaption to industrial applications are discussed. A case study of real-world assembly system is given to demonstrate the newly proposed EIIR framework's applicability and potentiality. Finally, the key challenges that EIIR encounters in industrial scenarios are summarized and future research directions are suggested. The authors believe that EIIR technology is shaping the next generation of industrial robotics and EIIR-based industrial systems supply a new technological paradigm for intelligent manufacturing. It is expected that this review could serve as a valuable reference for scholars and engineers that are interested in industrial embodied intelligence. Together, scholars can use this research to drive their rapid advancement and application of EIIR techniques. The authors would continue to track and contribute new studies in the project page https://github.com/jackyzengl/EIIR

2505.06566 2026-03-03 cs.CV

Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

Zequn Xie, Haoming Ji, Chengxuan Li, Lingwei Meng

详情
英文摘要

Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, which amplify this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.

2505.02881 2026-03-03 cs.LG cs.AI

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki

详情
英文摘要

The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +16.1 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting yielding the largest gains. By releasing datasets, prompts, checkpoints, and pipeline code, we ensure reproducibility and provide a transferable transform-and-retain methodology that can be adapted to other base models and LLM rewriting setups.

2505.02825 2026-03-03 cs.CV

Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

Alex Hoi Hang Chan, Otto Brookes, Urs Waldmann, Hemal Naik, Iain D. Couzin, Majid Mirmehdi, Noël Adiko Houa, Emmanuelle Normand, Christophe Boesch, Lukas Boesch, Mimi Arandjelovic, Hjalmar Kühl, Tilo Burghardt, Fumihiro Kano

Comments Accepted at CVPR Workshops, CV4Animals 2025

详情
英文摘要

Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.

2504.21464 2026-03-03 cs.CV

VR-FuseNet: A Fusion of Heterogeneous Fundus Data and Explainable Deep Network for Diabetic Retinopathy Classification

Shamim Rahim Refat, Ziyan Shirin Raha, Shuvashis Sarker, Faika Fairuj Preotee, MD. Musfikur Rahman, Tashreef Muhammad, Mohammad Shafiul Alam

Comments Published in Biomedical Materials & Devices (Springer)

Journal ref Biomedical Materials & Devices (2026)

详情
英文摘要

Diabetic retinopathy is a severe eye condition caused by diabetes where the retinal blood vessels get damaged and can lead to vision loss and blindness if not treated. Early and accurate detection is key to intervention and stopping the disease progressing. For addressing this disease properly, this paper presents a comprehensive approach for automated diabetic retinopathy detection by proposing a new hybrid deep learning model called VR-FuseNet. Diabetic retinopathy is a major eye disease and leading cause of blindness especially among diabetic patients so accurate and efficient automated detection methods are required. To address the limitations of existing methods including dataset imbalance, diversity and generalization issues this paper presents a hybrid dataset created from five publicly available diabetic retinopathy datasets. Essential preprocessing techniques such as SMOTE for class balancing and CLAHE for image enhancement are applied systematically to the dataset to improve the robustness and generalizability of the dataset. The proposed VR-FuseNet model combines the strengths of two state-of-the-art convolutional neural networks, VGG19 which captures fine-grained spatial features and ResNet50V2 which is known for its deep hierarchical feature extraction. This fusion improves the diagnostic performance and achieves an accuracy of 91.824%. The model outperforms individual architectures on all performance metrics demonstrating the effectiveness of hybrid feature extraction in Diabetic Retinopathy classification tasks. To make the proposed model more clinically useful and interpretable this paper incorporates multiple XAI techniques. These techniques generate visual explanations that clearly indicate the retinal features affecting the model's prediction such as microaneurysms, hemorrhages and exudates so that clinicians can interpret and validate.

2504.12419 2026-03-03 cs.LG math.OC quant-ph

Standardization of Multi-Objective QUBOs

Loong Kuan Lee, Thore Gerlach, Nico Piatkowski

Comments 7 pages, 3 figures; Published in the 2025 IEEE International Conference on Quantum Computing and Engineering (QCE); For associated code, see https://gitlab.com/lklee/qubo-standardization

Journal ref 2025 IEEE International Conference on Quantum Computing and Engineering (QCE) (Vol. 1, pp. 58-64)

详情
英文摘要

Multi-objective optimization involving Quadratic Unconstrained Binary Optimization (QUBO) problems arises in various domains. A fundamental challenge in this context is the effective balancing of multiple objectives, each potentially operating on very different scales. This imbalance introduces complications such as the selection of appropriate weights when scalarizing multiple objectives into a single objective function. In this paper, we propose a novel technique for scaling QUBO objectives that uses an exact computation of the variance of each individual QUBO objective. By scaling each objective to have unit variance, we align all objectives onto a common scale, thereby allowing for more balanced solutions to be found when scalarizing the objectives with equal weights, as well as potentially assisting in the search or choice of weights during scalarization. Finally, we demonstrate its advantages through empirical evaluations on various multi-objective optimization problems. Our results are noteworthy since manually selecting scalarization weights is cumbersome, and reliable, efficient solutions are scarce.

2504.03889 2026-03-03 cs.LG

Identifying and Evaluating Inactive Heads in Pretrained LLMs

Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs

Comments Accepted to ICLR 2026. Code available at https://github.com/psandovalsegura/inactive-heads

详情
英文摘要

Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we evaluate 12 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present different attention behaviors.

2504.01519 2026-03-03 cs.CL eess.AS

Chain of Correction for Full-text Speech Recognition with Large Language Models

Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

Comments ICASSP 2026

详情
英文摘要

Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC's performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.

2503.24378 2026-03-03 cs.AI

ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

Comments Accepted at Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026), see https://openreview.net/forum?id=WIXohR7mEo

详情
英文摘要

The ACPBench dataset provides atomic reasoning tasks required for efficient planning. The dataset is aimed at distilling the complex plan generation task into separate atomic reasoning tasks in their easiest possible form, boolean or multiple-choice questions, where the model has to choose the right answer from the provided options. While the aim of ACPBench is to test the simplest form of reasoning about action and change, when tasked with planning, a model does not typically have options to choose from and thus the reasoning required for planning dictates an open-ended, generative form for these tasks. To that end, we introduce ACPBench Hard, a generative version of ACPBench, with open-ended questions which the model needs to answer. Models that perform well on these tasks could in principle be integrated into a planner or be used directly as a policy. We discuss the complexity of these tasks as well as the complexity of validating the correctness of their answers and present validation algorithms for each task. Equipped with these validators, we test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that no model outperforms another in these tasks and with a few exceptions all tested language models score below 65%, indicating that even the current frontier language models have a long way to go before they can reliably reason about planning. In fact, even the so-called reasoning models struggle with solving these reasoning tasks. ACPBench Hard collection is available at the following link: https://ibm.github.io/ACPBench

2503.22178 2026-03-03 cs.LG cs.AI cs.CV

AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong

Comments ICLR 2026. Code available at: github.com/david3684/AdaRank

详情
英文摘要

Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.

2503.18950 2026-03-03 cs.CV

Target-Aware Video Diffusion Models

Taeksoo Kim, Hanbyul Joo

Comments ICLR 2026. The project page is available at https://taeksuu.github.io/tavid/

详情
英文摘要

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human-object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

2503.16553 2026-03-03 cs.CL

A Foundational Individual Mobility Prediction Model based on Open-Source Large Language Models

Zhenlin Qin, Leizhen Wang, Yancheng Ling, Francisco Camara Pereira, Zhenliang Ma

Journal ref Transportation Research Part C: Emerging Technologies, Vol. 185, 105562 (2026)

详情
英文摘要

Individual mobility prediction plays a key role in urban transport, enabling personalized service recommendations and effective travel management. It is widely modeled by data-driven methods such as machine learning, deep learning, as well as classical econometric methods to capture key features of mobility patterns. However, such methods are hindered in promoting further transferability and robustness due to limited capacity to learn mobility patterns from different data sources, predict in out-of-distribution settings (a.k.a ``zero-shot"). To address this challenge, this paper introduces MoBLLM, a foundational model for individual mobility prediction that aims to learn a shared and transferable representation of mobility behavior across heterogeneous data sources. Based on a lightweight open-source large language model (LLM), MoBLLM employs Parameter-Efficient Fine-Tuning (PEFT) techniques to create a cost-effective training pipeline, avoiding the need for large-scale GPU clusters while maintaining strong performance. We conduct extensive experiments on six real-world mobility datasets to evaluate its accuracy, robustness, and transferability across varying temporal scales (years), spatial contexts (cities), and situational conditions (e.g., disruptions and interventions). MoBLLM achieves the best F1 score and accuracy across all datasets compared with state-of-the-art deep learning models and shows better transferability and cost efficiency than commercial LLMs. Further experiments reveal its robustness under network changes, policy interventions, special events, and incidents. These results indicate that MoBLLM provides a generalizable modeling foundation for individual mobility behavior, enabling more reliable and adaptive personalized information services for transportation management.

2503.11120 2026-03-03 cs.LG cs.CV

A Multi-Objective Evaluation Framework for Analyzing Utility-Fairness Trade-Offs in Machine Learning Systems

Gökhan Özbulak, Oscar Jimenez-del-Toro, Maíra Fatoretto, Lilian Berton, André Anjos

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:050

Journal ref Machine.Learning.for.Biomedical.Imaging. 2025 (3)

详情
英文摘要

The evaluation of fairness models in Machine Learning involves complex challenges, such as defining appropriate metrics, balancing trade-offs between utility and fairness, and there are still gaps in this stage. This work presents a novel multi-objective evaluation framework that enables the analysis of utility-fairness trade-offs in Machine Learning systems. The framework was developed using criteria from Multi-Objective Optimization that collect comprehensive information regarding this complex evaluation task. The assessment of multiple Machine Learning systems is summarized, both quantitatively and qualitatively, in a straightforward manner through a radar chart and a measurement table encompassing various aspects such as convergence, system capacity, and diversity. The framework's compact representation of performance facilitates the comparative analysis of different Machine Learning strategies for decision-makers, in real-world applications, with single or multiple fairness requirements. In particular, this study focuses on the medical imaging domain, where fairness considerations are crucial due to the potential impact of biased diagnostic systems on patient outcomes. The proposed framework enables a systematic evaluation of multiple fairness constraints helping to identify and mitigate disparities among demographic groups while maintaining diagnostic performance. The framework is model-agnostic and flexible to be adapted to any kind of Machine Learning systems, that is, black- or white-box, any kind and quantity of evaluation metrics, including multidimensional fairness criteria. The functionality and effectiveness of the proposed framework is shown with different simulations, and an empirical study conducted on three real-world medical imaging datasets with various Machine Learning systems. Our evaluation framework is publicly available at https://pypi.org/project/fairical.

2503.08980 2026-03-03 cs.LG cs.CL

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi

详情
英文摘要

Recent empirical evidence shows that LLM representations encode human-interpretable concepts. Nevertheless, the mechanisms by which these representations emerge remain largely unexplored. To shed further light on this, we introduce a novel generative model that generates tokens on the basis of such concepts formulated as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish rigorous identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an linear transformation. This theoretical finding: 1) provides evidence that LLMs capture essential underlying generative factors, 2) offers a unified and principled perspective for understanding the linear representation hypothesis, and 3) motivates a theoretically grounded approach for evaluating sparse autoencoders. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.

2503.07392 2026-03-03 cs.CV

SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Yanbin Hao, Fuli Feng

Comments Accepted to ICLR 2026

详情
英文摘要

Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.

2503.06764 2026-03-03 cs.CV cs.AI

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Zisheng Chen, Chunwei Wang, Runhui Huang, Hongbin Xu, Xiuwei Chen, Jun Zhou, Jianhua Han, Hang Xu, Xiaodan Liang

Comments ICLR 2026

详情
英文摘要

In this paper, we introduce SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete representations for multimodal understanding and generation. Recently, unified image tokenizers have sparked exploration within the research community, which is designed to capture high-level semantic features for understanding and retaining low-level pixel features for generation. Previous works attempt to train a unified image tokenizer by combining loss for semantic distillation and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through a novel semantic-guided hierarchical codebook, which builds pixel sub-codebooks on a pretrained semantic codebook. This design decouples the semantic and pixel in terms of structure and training strategy, enabling the tokenizer to capture pixel features while retaining its ability to comprehend high-level semantic information. Our experiments demonstrate that SemHiTok achieves leading performance in image reconstruction and multimodal understanding under the LLaVA-v1.5 setting. Further, we develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks. Extensive experiments confirm our analysis, showing that our unified image tokenizer architecture achieves a better trade-off.