arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3007
2509.00084 2026-04-28 cs.LG cs.AI cs.CL

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Qibin Wang, Pu Zhao, Shaohan Huang, Fangkai Yang, Lu Wang, Furu Wei, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

详情
英文摘要

Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Parallel self-refinement, generating multiple candidates and synthesizing a refined answer conditioned on them, offers a promising alternative, but the underlying mechanism driving its effectiveness remains obscure. To bridge this gap in understanding, we introduce a new metric, the Refinement Gap, designed to quantify the relative improvement of self-refinement beyond majority voting. We show that the Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability. Based on this discovery, we propose Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students. Crucially, GSR jointly trains a single model to generate strong candidates and refine a better final answer based on these candidates. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks over other parallel aggregation methods, while the learned refinement skill transfers across multiple model scales and families and exhibits robust generalization to an out-of-distribution domain.

2508.19652 2026-04-28 cs.CV

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu

Comments 16 pages, two figures

详情
英文摘要

Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.

2507.20088 2026-04-28 cs.LG math-ph math.MP math.OC stat.ML

Learning Latent Graph Geometry via Fixed-Point Schrödinger-Type Activation: A Theoretical Study

Dmitry Pasechnyuk-Vilensky, Martin Takáč

Comments 50 pages

详情
英文摘要

We study neural architectures in which each hidden layer is defined by the stationary state of a dissipative Schrödinger-type dynamics on a learned latent graph. On stable branches, the local stationary problem defines a differentiable implicit graph layer. To learn the graph itself, we optimize over the stratified moduli space of weighted graphs and equip each stratum with a non-degenerate Kähler-Hessian metric that keeps natural-gradient descent and face crossing well posed. We then show that a multilayer stationary network is equivalent to an exact global stationary problem on a supra-graph, and that it admits a penalized global relaxation whose stationary states converge to the exact one as the penalty parameter tends to infinity. Reverse-mode differentiation is recovered as the adjoint of the exact global system, and the penalized adjoint converges to it in the same limit. Finally, under finite-dimensional strong-monotonicity and admissible-lift assumptions, the corresponding represented hypothesis classes coincide among resolvent feed-forward networks, graph-stationary networks, supra-graph stationary systems, and sheaf-based architectures with unitary connection. The resulting structural identifications yield complexity bounds controlled by sparse graph or supra-graph geometry rather than dense ambient connectivity.

2507.09245 2026-04-28 cs.CL

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Deshan Sumanathilaka, Sameera Perera, Sachithya Dharmasiri, Maneesha Athukorala, Anuja Dilrukshi Herath, Rukshan Dias, Pasindu Gamage, Ruvan Weerasinghe, Y. H. P. P. Priyadarshana

Comments 15 pages, 5 Tables, 3 figures

详情
英文摘要

The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

2507.06542 2026-04-28 cs.LG cs.DC cs.MA stat.ML

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang

Comments We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environments

详情
Journal ref
ICLR 2026 (Oral Presentation)
英文摘要

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

2505.16518 2026-04-28 cs.CL cs.AI

CUB: Benchmarking Context Utilisation Techniques for Language Models

Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein

Comments Accepted at ACL 2026, 33 pages

详情
英文摘要

Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help diagnose CMTs under diverse noisy context conditions within retrieval-augmented generation (RAG). With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to 11 LMs. Our findings expose critical gaps in current CMT evaluation practices, demonstrating the need for holistic testing. We reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world RAG scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples.

2505.12009 2026-04-28 cs.CV

LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation

Zhiying Li, Guanggang Geng, Yeying Jin, Shuyuan Lin, Fengyuan Ma, Zhaoxin Fan, Lili Wang

Comments 10 pages, 6 figures

详情
英文摘要

Expressive human pose and shape estimation (EHPS) plays a central role in digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities, such as generating inappropriate content, violent actions, or racially offensive gestures and expressions. Current adversarial attacks on EHPS models often generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address this limitation, we propose an unnoticeable adversarial method, termed \textbf{LatentStealth}, specifically tailored for EHPS models. The key idea is to exploit the structured latent representations of natural images as the medium for crafting perturbations. Instead of injecting noise directly into the pixel space, our method projects inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions. This latent-space manipulation enables the attack to maintain high imperceptibility while preserving its effectiveness. Furthermore, as the optimization process is guided by only a small number of model output queries, the framework achieves competitive attack performance with low computational overhead, making it both practical and efficient for real-world scenarios. Extensive experiments on the 3DPW and UBody datasets demonstrate the superiority of LatentStealth, revealing critical vulnerabilities in current systems. These findings highlight the urgent need to address and mitigate security risks in digital human generation technologies.

2505.02922 2026-04-28 cs.LG

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang

Comments 16 pages; Accepted by VLDB 2026

详情
Journal ref
PVLDB, 19(5): 1016-1031, 2026
英文摘要

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.

2505.01595 2026-04-28 cs.CL cs.AI cs.LG

Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Liaoyaqi Wang, Zhengping Jiang, Anqi Liu, Benjamin Van Durme

详情
英文摘要

We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.

2504.10527 2026-04-28 cs.AI cs.CY

Explainable Artificial Intelligence Techniques for Interpretation of Food Models: a Review

Leonardo Arrighi, Ingrid Alves de Moraes, Marco Zullich, Michele Simonato, Douglas Fernandes Barbin, Sylvio Barbon Junior

Comments 47 pages, 10 figures, 7 tables

详情
Journal ref
Artificial Intelligence Review (2026)
英文摘要

Artificial Intelligence (AI) has become essential for analyzing complex data and solving highly-challenging tasks. It is being applied across numerous disciplines beyond computer science, including Food Engineering, where there is a growing demand for accurate and reliable predictions to meet stringent food quality standards. However, this requires increasingly complex AI models, raising concerns. In response, eXplainable AI (XAI) has emerged to provide insights into AI decision-making, aiding model interpretation by developers and users. Nevertheless, XAI remains underutilized in Food Engineering, limiting model reliability. For instance, in food quality control, AI models using spectral imaging can detect contaminants or assess freshness levels, but their opaque decision-making process hinders adoption. XAI techniques such as SHAP (Shapley Additive Explanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) can pinpoint which spectral wavelengths or image regions contribute most to a prediction, enhancing transparency and aiding quality control inspectors in verifying AI-generated assessments. This survey presents a taxonomy for classifying food quality research using XAI techniques, organized by data types and explanation methods, to guide researchers in choosing suitable approaches. We also highlight trends, challenges, and opportunities to encourage the adoption of XAI in Food Engineering.

2504.09499 2026-04-28 cs.LG cs.AI

Decoding the mechanisms of the Hattrick football manager game using Bayesian network structure learning

Anthony C. Constantinou, Nicholas Higgins, Neville K. Kitson

详情
英文摘要

Hattrick is a free web-based probabilistic football manager game with over 200,000 users competing for titles at national and international levels. Launched in Sweden in 1997 as part of an MSc project, the game's slow-paced design has fostered a loyal community, with users remaining active for decades. Hattrick's game-engine mechanics are partially hidden, and users have attempted to decode them with incremental success over the years. Rule-based, statistical and machine learning models have been developed to aid this effort and are widely used by the community, but have not been formally evaluated in the scientific literature. This study is the first to explore Hattrick using structure learning techniques and Bayesian networks, integrating expert knowledge with data to develop models that simulate and explain the game-engine. We assess the effectiveness of structure learning algorithms in relation to knowledge-based structures, and publicly share a fully specified Bayesian network model that matches the performance of top models used by the Hattrick community. We further demonstrate how analysis extends beyond prediction by providing a visual representation of dependencies between features, and using the optimal model for in-game decision-making. To support future research, we make all data, graphical structures, and models publicly available online.

2503.10666 2026-04-28 cs.CL cs.AI cs.LG

Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan

Comments 9 pages, 5 figures

详情
英文摘要

Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.

2501.07237 2026-04-28 cs.LG cs.AI

GWT: Scalable Optimizer State Compression for Large Language Model Training

Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun

详情
英文摘要

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing benchmarks. However, the escalating scale of model parameters imposes prohibitive memory overheads during training, especially when employing stateful optimizers such as Adam. Conventional memory-efficient strategies, typically involving singular value decomposition (SVD) or weight freezing, often incur non-negligible performance degradation relative to full-rank updates. To address these limitations, this paper explores memory-efficient optimization beyond low-rank constraints and proposes the Gradient Wavelet Transform (GWT). GWT characterizes a novel compression framework that projects gradients into wavelet subspaces, effectively compacting optimizer states while preserving essential update information. We theoretically and empirically demonstrate that GWT can be seamlessly integrated into existing optimization protocols, facilitating resource-efficient training without compromising model fidelity. Rigorous evaluations encompassing both large-scale pre-training and task-specific fine-tuning reveal that GWT yields performance parity with advanced memory-efficient optimizers and full-rank updates. Furthermore, GWT provides a scalable and robust solution for managing the memory-intensive pipelines inherent in modern large-scale data engineering and knowledge discovery systems.

2410.05970 2026-04-28 cs.CV cs.AI cs.CL

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai

Comments Accepted by International Journal of Computer Vision (IJCV)

详情
英文摘要

Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.

2408.00923 2026-04-28 cs.CV cs.AI

Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

Róisín Luo, Alexandru Drimbarean, James McDermott, Colm O'Riordan

Comments Accepted by The 35th British Machine Vision Conference (BMVC 2024)

详情
英文摘要

This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets). Our framework, dubbed \textbf{CoRa} (Optimal Quantization Residual \textbf{Co}nvolutional Operator Low-\textbf{Ra}nk Adaptation), is motivated by two key aspects. Firstly, quantization residual knowledge, i.e. the lost information between floating-point weights and quantized weights, has long been neglected by the research community. Reclaiming the critical residual knowledge, with an infinitesimal extra parameter cost, can reverse performance degradation without training. Secondly, state-of-the-art quantization frameworks search for optimal quantized weights to address the performance degradation. Yet, the vast search spaces in weight optimization pose a challenge for the efficient optimization in large models. For example, state-of-the-art BRECQ necessitates $2 \times 10^4$ iterations to quantize models. Fundamentally differing from existing methods, \textbf{CoRa} searches for the optimal architectures of low-rank adapters, reclaiming critical quantization residual knowledge, within the search spaces smaller compared to the weight spaces, by many orders of magnitude. The low-rank adapters approximate the quantization residual weights, discarded in previous methods. We evaluate our approach over multiple pre-trained ConvNets on ImageNet. \textbf{CoRa} achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines, in $4$-bit and $3$-bit quantization, by using less than $250$ iterations on a small calibration set with $1600$ images. Thus, \textbf{CoRa} establishes a new state-of-the-art in terms of the optimization efficiency in low-bit quantization.

2407.14974 2026-04-28 cs.LG cs.AI

Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations

Phuong Quynh Le, Jörg Schlötterer, Christin Seifert

Comments Accepted to TMLR

详情
英文摘要

Machine learning models are known to learn spurious correlations, i.e., features having strong relations with class labels but no causal relation. Relying on those correlations leads to poor performance in the data groups without these correlations and poor generalization ability. To improve the robustness of machine learning models to spurious correlations, we propose an approach to extract a subnetwork from a fully trained network that does not rely on spurious correlations. The subnetwork is found by the assumption that data points with the same spurious attribute will be close to each other in the representation space when training with ERM, then we employ supervised contrastive loss in a novel way to force models to unlearn the spurious connections. The increase in the worst-group performance of our approach contributes to strengthening the hypothesis that there exists a subnetwork in a fully trained dense network that is responsible for using only invariant features in classification tasks, therefore erasing the influence of spurious features even in the setup of multi spurious attributes and no prior knowledge of attributes labels.

2405.20642 2026-04-28 cs.LG stat.ML

Learning Under Moral Hazard with Instrumental Regression and Generalized Method of Moments

Shiliang Zuo

详情
英文摘要

Machine learning has become increasingly popular in informing data-driven policy-making. Policies influence behavior in individuals or populations, and ideally, through observational signals, policy-makers learn which policies are effective. However, in many settings, individual actions cannot be perfectly observed. This issue, known in economics as moral hazard, poses a significant challenge. In this work, we study the foundational multitasking principal-agent contract design problem and demonstrate how instrumental regression and the generalized method of moments (GMM) estimator can be used to estimate or learn a good contract. As a bonus result, we also give a uniformity characterization of the shape of the optimal contract.

2405.04211 2026-04-28 cs.CV

Leveraging Medical Foundation Model Features in Graph Neural Network-Based Retrieval of Breast Histopathology Images

Nematollah Saeidi, Hossein Karshenas, Bijan Shoushtarian, Sepideh Hatamikia, Ramona Woitek, Amirreza Mahbod

Comments 29 pages

详情
Journal ref
International Journal of Imaging Systems and Technology, 2026
英文摘要

Breast cancer is the most common cancer type in women worldwide. Early detection and appropriate treatment can significantly reduce its impact. While histopathology examinations play a vital role in rapid and accurate diagnosis, they often require experienced medical experts for proper recognition and cancer grading. Automated image retrieval systems have the potential to assist pathologists in identifying cancerous tissues, thereby accelerating the diagnostic process. Nevertheless, proposing an accurate image retrieval model is challenging due to considerable variability among the tissue and cell patterns in histological images. In this work, we leverage the features from foundation models in a novel attention-based adversarially regularized variational graph autoencoder model for breast histological image retrieval. Our results confirm the superior performance of models trained with foundation model features compared to those using pre-trained convolutional neural networks (up to 7.7% and 15.5% for mAP and mMV, respectively), with the pre-trained general-purpose self-supervised model for computational pathology (UNI) delivering the best overall performance. By evaluating two publicly available histology image datasets of breast cancer, our top-performing model, trained with UNI features, achieved average mAP/mMV scores of 96.7%/91.5% and 97.6%/94.2% for the BreakHis and BACH datasets, respectively. Our proposed retrieval model has the potential to be used in clinical settings to enhance diagnostic performance and ultimately benefit patients.

2403.16958 2026-04-28 cs.CV

TwinLiteNet+: An Enhanced Multi-Task Segmentation Model for Autonomous Driving

Quang-Huy Che, Duc-Tri Le, Minh-Quan Pham, Vinh-Tiep Nguyen, Duc-Khai Lam

详情
Journal ref
Computers and Electrical Engineering 128 (2025) 110694
英文摘要

Semantic segmentation is a fundamental perception task in autonomous driving, particularly for identifying drivable areas and lane markings to enable safe navigation. However, most state-of-the-art (SOTA) models are computationally intensive and unsuitable for real-time deployment on resource-constrained embedded devices. In this paper, we introduce TwinLiteNet+, an enhanced multi-task segmentation model designed for real-time drivable area and lane segmentation with high efficiency. TwinLiteNet+ employs a hybrid encoder architecture that integrates stride-based dilated convolutions and depthwise separable dilated convolutions, balancing representational capacity and computational cost. To improve task-specific decoding, we propose two lightweight upsampling modules-Upper Convolution Block (UCB) and Upper Simple Block (USB)-alongside a Partial Class Activation Attention (PCAA) mechanism that enhances segmentation precision. The model is available in four configurations, ranging from the ultra-compact TwinLiteNet+_{Nano} (34K parameters) to the high-performance TwinLiteNet+_{Large} (1.94M parameters). On the BDD100K dataset, TwinLiteNet+_{Large} achieves 92.9% mIoU for drivable area segmentation and 34.2% IoU for lane segmentation-surpassing existing state-of-the-art models while requiring 11x fewer floating-point operations (FLOPs) for computation. Extensive evaluations on embedded devices demonstrate superior inference speed, quantization robustness (INT8/FP16), and energy efficiency, validating TwinLiteNet+ as a compelling solution for real-world autonomous driving systems. Code is available at https://github.com/chequanghuy/TwinLiteNetPlus.

2401.13568 2026-04-28 cs.RO

Investigating the Performance of Soft Robotic Adaptive Feet with Longitudinal and Transverse Arches

Anna Pace, Giorgio Grioli, Alice Ghezzi, Antonio Bicchi, Manuel G. Catalano

Comments Submitted to Frontiers in Robotics and AI

详情
英文摘要

Biped robots usually adopt feet with a rigid structure that simplifies walking on flat grounds and yet hinders ground adaptation in unstructured environments, thus jeopardizing stability. We recently explored in the SoftFoot the idea of adapting a robotic foot to ground irregularities along the sagittal plane. Building on the previous results, we propose in this paper a novel robotic foot able to adapt both in the sagittal and frontal planes, similarly to the human foot. It features five parallel modules with intrinsic longitudinal adaptability that can be combined in many possible designs through optional rigid or elastic connections. By following a methodological design approach, we narrow down the design space to five candidate foot designs and implement them on a modular system. Prototypes are tested experimentally via controlled application of force, through a robotic arm, onto a sensorized plate endowed with different obstacles. Their performance is compared, using also a rigid foot and the previous SoftFoot as a baseline. Analysis of footprint stability shows that the introduction of the transverse arch, by elastically connecting the five parallel modules, is advantageous for obstacle negotiation, especially when obstacles are located under the forefoot. In addition to biped robots' locomotion, this finding might also benefit lower-limb prostheses design.

2401.03563 2026-04-28 cs.CL cs.IR

Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

Yingqian Min, Kun Zhou, Dawei Gao, Wayne Xin Zhao, He Hu, Yaliang Li

Comments 14 pages, working in progress

详情
英文摘要

Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training and convergence of the model. To address it, we propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training, to minimize the interference risks from the two views. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk, which is exactly the traveling salesman problem, hence we utilize a simulated annealing algorithm to find its solution. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training. Experiments on MTEB sentence representation evaluation tasks show that our approach can boost the performance of state-of-the-art methods. Our code and data are publicly available at the link: \url{https://github.com/RUCAIBox/Data-CUBE}.

2312.08410 2026-04-28 cs.LG math.PR stat.ML

Universal approximation property of Banach space-valued random feature models including random neural networks

Ariel Neufeld, Philipp Schmocker

Comments 52 pages, 4 figures, 4 tables

详情
英文摘要

We introduce a Banach space-valued extension of random feature learning, a data-driven supervised machine learning technique for large-scale kernel approximation. By randomly initializing the feature maps, only the linear readout needs to be trained, which reduces the computational complexity substantially. Viewing random feature models as Banach space-valued random variables, we prove a universal approximation result in the corresponding Bochner space. Moreover, we derive approximation rates and an explicit algorithm to learn an element of the given Banach space by such models. The framework of this paper includes random trigonometric/Fourier regression and in particular random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized, whence only the linear readout needs to be trained. For the latter, we can then lift the universal approximation property of deterministic neural networks to random neural networks, even within function spaces over non-compact domains, e.g., weighted spaces, $L^p$-spaces, and (weighted) Sobolev spaces, where the latter includes the approximation of the (weak) derivatives. In addition, we analyze when the training costs for approximating a given function grow polynomially in both the input/output dimension and the reciprocal of a pre-specified tolerated approximation error. Furthermore, we demonstrate in a numerical example the empirical advantages of random feature models over their deterministic counterparts.

2105.12708 2026-04-28 cs.CL cs.SD eess.AS

Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Julia Pritzen, Michael Gref, Dietlind Zühlke, Christoph Schmidt

Comments Submitted to LREC 2022

详情
Journal ref
Proceedings of the 13th Language Resources and Evaluation Conference (2022) 3242-3249
英文摘要

Anglicisms are a challenge in German speech recognition. Due to their irregular pronunciation compared to native German words, automatically generated pronunciation dictionaries often include faulty phoneme sequences for Anglicisms. In this work, we propose a multitask sequence-to-sequence approach for grapheme-to-phoneme conversion to improve the phonetization of Anglicisms. We extended a grapheme-to-phoneme model with a classifier to distinguish Anglicisms from native German words. With this approach, the model learns to generate pronunciations differently depending on the classification result. We used our model to create supplementary Anglicism pronunciation dictionaries that are added to an existing German speech recognition model. Tested on a dedicated Anglicism evaluation set, we improved the recognition of Anglicisms compared to a baseline model, reducing the word error rate by 1 % and the Anglicism error rate by 3 %. We show that multitask learning can help solving the challenge of Anglicisms in German speech recognition.

1912.13213 2026-04-28 cs.LG math.OC stat.ML

A Modern Introduction to Online Learning

Francesco Orabona

Comments Major update: One new chapter (Online Learning to X); massive tightening of all the math; simplification of the betting algorithm that loses a constant fraction of money; exp-concave functions are now for extended-real-valued function; new layout for publication; added index

详情
英文摘要

In this book, I introduce the basic concepts of Online Learning through the modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are addressed through convex surrogate losses and randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. Finally, I also cover advanced topics, including black-box reductions, saddle-point optimization, sequential investment, and non-stationary forms of regret analysis. The book concludes with a selection of applications of online learning to domains far from it, such as generalization theory and concentration inequalities. I tried to maintain an informal, but mathematically serious, tone throughout the book. No prior knowledge of convex analysis is required. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible. This also means that sometimes I have added one or two additional assumptions, just to simplify the proofs.

2604.24242 2026-04-28 cs.RO

OpenPodcar2: a robust, ROS2 vehicle for self-driving research

Rakshit Soni, Chris Waltham, Md Umar Ibrahim, Mark Crampton, Charles Fox

详情
英文摘要

OpenPodcar2 is a robust, ROS2-interfaced, low-cost, open source hardware and software, autonomous vehicle platform based on an off-the-shelf, hard-canopy, mobility scooter donor vehicle. It is a modification of the previous OpenPodcar design, which extends it with robust electronics and ROS2 interfacing, to enable both research and also potential deployment use cases. The platform consists of (a) hardware components: documented as a bill of materials and build instructions; (b) integration to the general purpose OSH R4 mechatronics board and a Gazebo simulation of the vehicle, both presenting a common ROS2 interface (c) higher-level ROS2 software implementations and configurations of standard robot autonomous planning and control, including the nav2 stack which performs SLAM and enacts commands to drive the vehicle from a current to a desired pose around obstacles. OpenPodcar2 can transport a human passenger or similar load at speeds up to 15km/h, for example for use as a last-mile autonomous taxi service or to transport delivery containers similarly around a city center. It is small and safe enough to be parked in a standard research lab robust enough for some deployment cases. Total build cost was around 7,000USD from new components, or 2,000USD with a used Donor Vehicle. OpenPodcar2 thus provides a research balance between real world utility, safety, cost and robustness.

2604.24238 2026-04-28 cs.LG

GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models

Yiming Zhang, Sitong Liu, Ke Li, Zhihong Wu, Alex Cloninger, Melvin Leok

详情
英文摘要

Diffusion models are a leading paradigm for data generation, but training-free editing typically re-runs the full denoising trajectory for every edit strength, making iterative refinement expensive. To address this issue, we instead edit near the data manifold, where small local updates can replace repeated re-synthesis. To enable this, we estimate a local manifold tangent space directly from perturbed samples and prove that this sample-based estimator closely approximates the true tangent. Building on this guarantee, we devise a Jacobian-free algorithm that constructs a tangent frame via small perturbations to the initial noise and alternates small tangent moves with diffusion-based projections. Updates within this frame follow principled on-manifold directions while suppressing off-manifold drift, enabling fine-grained edits without full re-diffusion or additional training. Edit strength is controlled by the number of steps for rapid, continuous adjustments that preserve fidelity and plug into existing samplers. Empirically, the resulting tangent directions yield smooth, semantic unsupervised traversals and effective CLIP-guided optimization, demonstrating practical interactive continuous editing.

2604.24235 2026-04-28 cs.CV

Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking

Yin Lin, Domenico Aquino, Alberto Redaelli, Massimiliano Del Bene, Riccardo Barbieri, Simona Ferrante

详情
Journal ref
2026 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
英文摘要

Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.

2604.24234 2026-04-28 cs.CV

Graph-augmented Segmentation of Complex Shapes in Laser Powder bed Fusion for Enhanced In Situ Inspection

Stefano Raimondo, Matteo Bugatti, Marco Grasso

Comments Submitted to IEEE Transactions on Automation Science and Engineering (T-ASE)

详情
英文摘要

The technological maturity of in situ inspection and monitoring methods in additive manufacturing is steadily increasing, enabling more efficient and practical qualification procedures. In this context, image segmentation of powder bed images in Laser Powder Bed Fusion (L-PBF) has been investigated by various authors, leveraging both edge detection and machine learning approaches to identify deviations from nominal geometry. Despite these developments, several challenges remain, including the sensitivity of segmentation performance to industrial illumination conditions and layer-to-layer variability in pixel intensity patterns. The study addresses these limitations by proposing a graph-augmented segmentation approach. The underlying principle consists of preserving the geometrical information at a global level rather than at pixel-wise level, modeling dependencies and relational information among spatial regions with a Graph Neural Network bottleneck embedded into a U-Net architecture. This allows enhancing the consistency and accuracy of the geometry reconstruction in the presence of spatial and layer-wise photometric variability systematically faced in real data. The method is evaluated against benchmark techniques for the in situ reconstruction of lattice structures produced by L-PBF, demonstrating its potential as a scalable solution for robust in situ inspection and geometric verification in industrial environments.

2604.24230 2026-04-28 cs.CV

Radiomics- and Clinical Feature-Driven Prediction of Volumetric Response in Skull-Base Meningioma after CyberKnife Radiosurgery

Yin Lin, Elena De Martin, Giacomo Conte, Domenico Aquino, Cristiana Pedone, Alberto Redaelli, Riccardo Barbieri, Laura Fariselli, Simona Ferrante

详情
Journal ref
2026 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
英文摘要

Skull-base meningiomas are often characterized by favorable long-term prognosis, yet their anatomical complexity and proximity to critical neurovascular structures make treatment selection challenging. Stereotactic radiosurgery with CyberKnife represents an effective therapeutic option when surgical resection is not feasible; however, not all patients benefit equally from this treatment. Early identification of patients likely to respond to radiosurgery remains an open clinical problem. In this study, we propose a radiomics- and clinical feature-driven framework for predicting volumetric response in skull-base meningiomas treated with CyberKnife. Unlike most existing approaches that focus on progression-free survival or recurrence, our method targets volumetric response as an indicator of treatment efficacy. Pre-treatment MRI images from 104 patients were processed to extract radiomic features, which were combined with clinical variables and analyzed using six models. To ensure methodological rigor, the entire modeling process was implemented within a nested cross-validation scheme. Among the evaluated models, TabPFN achieved the best overall performance, with an AUC of 0.81 and consistently favorable classification metrics. These results suggest that advanced machine learning architectures, when combined with robust validation strategies, can effectively capture patterns associated with treatment response even in small-sample, high-dimensional settings.

2604.24224 2026-04-28 cs.LG

IMPA-Net: Meteorology-Aware Multi-Scale Attention and Dynamic Loss for Extreme Convective Radar Nowcasting

Haofei Cui, Guangxin He, Juanzhen Sun, Jingjia Luo, Haonan Chen, Xiaoran Zhuang, Mingxuan Chen, Xian Xiao

详情
英文摘要

Short-range prediction of convective precipitation from weather radar observations is essential for severe weather warnings. However, deep learning models trained with pixel-wise error metrics tend to produce overly smooth forecasts that suppress intense echoes critical for hazard detection. This issue is exacerbated by insufficient multi-scale feature interaction and suboptimal fusion of heterogeneous geophysical inputs. We propose IMPA-Net (Integrated Multi-scale Predictive Attention Network), a deterministic 0-2 hour nowcasting framework that addresses these limitations through meteorologically-informed designs at the input, architecture, and loss function levels. A parameter-free Spatial Mixer reorganizes heterogeneous input channels at the mesoscale-$γ$ neighborhood (~2 km) via deterministic channel permutation, providing a structured cross-field prior. An integrated multi-scale predictive attention module serves as the spatiotemporal translator, capturing dynamics from mesoscale-$β$ to mesoscale-$γ$ scales. A Meteorologically-Aware Dynamic Loss employs three-level asymmetric weighting -- adapting across training epochs, storm intensity, and forecast lead time -- to counteract regression-to-the-mean. Evaluated against seven baselines on a multi-source radar dataset over eastern China, IMPA-Net raises the Heidke Skill Score at $\geq$45 dBZ from 0.049 (SimVP baseline) to 0.143 under matched settings. Relative to pySTEPS, it provides a better trade-off between severe-event detection and false-alarm control. Spectral analysis confirms preserved energy across mesoscale bands where competing methods show progressive smoothing. These improvements are shown within a single domain and convective regime; generalizability to other orographic and climatic regions remains to be tested.