arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3194
2512.24310 2026-03-17 cs.RO

World In Your Hands: A Large-Scale and Open-Source Ecosystem for Learning Human-Centric Manipulation in the Wild

Yupeng Zheng, Jichao Peng, Weize Li, Yuhang Zheng, Xiang Li, Yujie Jin, Julong Wei, Guanhua Zhang, Ruiling Zheng, Ming Cao, Songen Gu, Zhenhong Zou, Kaige Li, Ke Wu, Mingmin Yang, Jiahao Liu, Pengfei Li, Hengjie Si, Feiyu Zhu, Wang Fu, Likun Wang, Ruiwen Yao, Jieru Zhao, Yilun Chen, Wenchao Ding

Comments This dataset represents the first large-scale collection of real-world, human-centric multimodal data integrating vision, language, tactile sensing, and action (VLTA) Github: https://github.com/tars-robotics/World-In-Your-Hands

详情
英文摘要

We introduce World In Your Hands (WIYH), a large-scale open-source ecosystem comprising over 1,000 hours of human manipulation data collected in-the-wild with millimeter-scale motion accuracy. Specifically, WIYH includes (1) the Oracle Suite, a wearable data collection kit with an auto-labeling pipeline for accurate motion capture; (2) the WIYH Dataset, featuring over 1,000 hours of multimodal manipulation data across hundreds of skills in diverse real-world scenarios; and (3) extensive annotations and benchmarks supporting tasks from perception to action. Furthermore, experiments based on the WIYH ecosystem show that integrating WIYH's human-centric data improves robotic manipulation success rates from 8% to 60% in cluttered scenes. World In Your Hands provides a foundation for advancing human-centric data collection and cross-embodiment policy learning. All data and hardware design will be open-source.

2512.22955 2026-03-17 cs.CL

Diversity or Precision? A Deep Dive into Next Token Prediction

Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu

详情
英文摘要

Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

2512.22705 2026-03-17 cs.CL cs.AI cs.LG

GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

Ahmed Abdullah, Sana Fatima, Haroon Mahmood

Comments Accepted and presented at the 15th International Arab Conference on Information Technology (ICAIT); proceedings not yet published

详情
Journal ref
Proceedings of the 25th International Arab Conference on Information Technology (ACIT), 2025
英文摘要

Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.

2512.22213 2026-03-17 cs.LG cs.AI cs.CL

On the Existence and Behavior of Secondary Attention Sinks

Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

详情
英文摘要

Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B. We open-sourced our findings at github.com/JeffreyWong20/Secondary-Attention-Sinks.

2512.20348 2026-03-17 cs.LG

Physics-guided Neural Network-based Shaft Power Prediction for Vessels

Dogan Altan, Hamza Haruna Mohammed, Glenn Terje Lines, Dusica Marijan, Arnbjørn Maressa

Comments This work has been accepted for publication in the 11th Special Session on Intelligent Data Mining at IEEE BigData 2025. The final published version of this work will be available through IEEE

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), 2025, pp. 2715-2723
英文摘要

Optimizing maritime operations, particularly fuel consumption for vessels, is crucial, considering its significant share in global trade. As fuel consumption is closely related to the shaft power of a vessel, predicting shaft power accurately is a crucial problem that requires careful consideration to minimize costs and emissions. Traditional approaches, which incorporate empirical formulas, often struggle to model dynamic conditions, such as sea conditions or fouling on vessels. In this paper, we present a hybrid, physics-guided neural network-based approach that utilizes empirical formulas within the network to combine the advantages of both neural networks and traditional techniques. We evaluate the presented method using data obtained from four similar-sized cargo vessels and compare the results with those of a baseline neural network and a traditional approach that employs empirical formulas. The experimental results demonstrate that the physics-guided neural network approach achieves lower mean absolute error, root mean square error, and mean absolute percentage error for all tested vessels compared to both the empirical formula-based method and the base neural network.

2512.20272 2026-03-17 cs.LG

HGAN-SDEs: Learning Neural Stochastic Differential Equations with Hermite-Guided Adversarial Training

Yuanjian Xu, Yuan Shuai, Jianing Hao, Guang Zhang

详情
英文摘要

Neural Stochastic Differential Equations (Neural SDEs) provide a principled framework for modeling continuous-time stochastic processes and have been widely adopted in fields ranging from physics to finance. Recent advances suggest that Generative Adversarial Networks (GANs) offer a promising solution to learning the complex path distributions induced by SDEs. However, a critical bottleneck lies in designing a discriminator that faithfully captures temporal dependencies while remaining computationally efficient. Prior works have explored Neural Controlled Differential Equations (CDEs) as discriminators due to their ability to model continuous-time dynamics, but such architectures suffer from high computational costs and exacerbate the instability of adversarial training. To address these limitations, we introduce HGAN-SDEs, a novel GAN-based framework that leverages Neural Hermite functions to construct a structured and efficient discriminator. Hermite functions provide an expressive yet lightweight basis for approximating path-level dynamics, enabling both reduced runtime complexity and improved training stability. We establish the universal approximation property of our framework for a broad class of SDE-driven distributions and theoretically characterize its convergence behavior. Extensive empirical evaluations on synthetic and real-world systems demonstrate that HGAN-SDEs achieve superior sample quality and learning efficiency compared to existing generative models for SDEs

2512.20074 2026-03-17 cs.AI cs.CL

Reason2Decide: Rationale-Driven Multi-Task Learning

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel

Comments Uploaded the camera-version of the paper accepted to LREC 2026

详情
英文摘要

Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.

2512.19554 2026-03-17 cs.LG cs.AI

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning

Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang

详情
英文摘要

Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.

2512.18991 2026-03-17 cs.CV cs.AI

Training-Free Global Geometric Association for 4D LiDAR Panoptic Segmentation

Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin, Sangpil Kim

详情
英文摘要

Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce \textsc{Geo-4D}, a simple yet effective training-free framework that unifies spatial and temporal reasoning, enabling holistic LiDAR perception over long time horizons. Specifically, we propose a global geometric association strategy that establishes consistent instance correspondences by estimating an optimal transformation between instance-level point sets. To mitigate instability caused by structural inconsistencies in point cloud observations, we propose a global geometry-aware soft matching mechanism that enforces spatially coherent point-wise correspondences grounded in the spatial distribution of instance point sets. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.

2512.17945 2026-03-17 cs.LG q-fin.RM q-fin.ST

What's the Price of Monotonicity? A Multi-Dataset Benchmark of Monotone-Constrained Gradient Boosting for Credit PD

Petr Koklev

Comments 56 pages. This version: December 2025. Includes multi-dataset benchmark results and diagnostic analyses; replication code and configuration files are available via the GitHub repository referenced in the paper

详情
英文摘要

Financial institutions face a trade-off between predictive accuracy and interpretability when deploying machine learning models for credit risk. Monotonicity constraints align model behavior with domain knowledge, but their performance cost - the price of monotonicity - is not well quantified. This paper benchmarks monotone-constrained versus unconstrained gradient boosting models for credit probability of default across five public datasets and three libraries. We define the Price of Monotonicity (PoM) as the relative change in standard performance metrics when moving from unconstrained to constrained models, estimated via paired comparisons with bootstrap uncertainty. In our experiments, PoM in AUC ranges from essentially zero to about 2.9 percent: constraints are almost costless on large datasets (typically less than 0.2 percent, often indistinguishable from zero) and most costly on smaller datasets with extensive constraint coverage (around 2-3 percent). Thus, appropriately specified monotonicity constraints can often deliver interpretability with small accuracy losses, particularly in large-scale credit portfolios.

2512.16357 2026-03-17 cs.CV

GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction

Tao Hu, Weiyu Zhou, Yanjie Tu, Peng Wu, Wei Dong, Qingsen Yan, Yanning Zhang

详情
英文摘要

Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.

2512.15746 2026-03-17 cs.LG cond-mat.mtrl-sci

A Unified Generative-Predictive Framework for Deterministic Inverse Design

Reza T. Batley, Sourav Saha

详情
Journal ref
AIAA SciTech Forum, 2026
英文摘要

Inverse design of heterogeneous material microstructures is a fundamentally ill-posed and famously computationally expensive problem. This is exacerbated by the high-dimensional design spaces associated with finely resolved images, multimodal input property streams, and a highly nonlinear forward physics. Whilst modern generative models excel at accurately modeling such complex forward behavior, most of them are not intrinsically structured to support fast, stable \emph{deterministic} inversion with a physics-informed bias. This work introduces Janus, a unified generative-predictive framework to address this problem. Janus couples a deep encoder-decoder architecture with a predictive KHRONOS head, a separable neural architecture. Topologically speaking, Janus learns a latent manifold simultaneously isometric for generative inversion and pruned for physical prediction; the joint objective inducing \emph{disentanglement} of the latent space. Janus is first validated on the MNIST dataset, demonstrating high-fidelity reconstruction, accurate classification and diverse generative inversion of all ten target classes. It is then applied to the inverse design of heterogeneous microstructures labeled with thermal conductivity. It achieves a forward prediction accuracy $R^2=0.98$ (2\% relative error) and sub-5\% pixelwise reconstruction error. Inverse solutions satisfy target properties to within $1\%$ relative error. Inverting a sweep through properties reveal smooth traversal of the latent manifold, and UMAP visualization confirms the emergence of a low-dimensional, disentangled manifold. By unifying prediction and generation within a single latent space, Janus enables real-time, physics-informed inverse microstructure generation at a lower computational cost typically associated with classical optimization-based approaches.

2512.15206 2026-03-17 cs.LG

Chorus: Harmonizing Context and Sensing Signals for Data-Free Model Customization in IoT

Liyu Zhang, Yejia Liu, Kwun Ho Liu, Runxi Huang, Xiaomin Ouyang

详情
英文摘要

A key bottleneck toward scalable IoT sensing is how to efficiently adapt AI models to new deployment conditions. In real-world IoT systems, sensor data is collected under diverse contexts, such as sensor placements or ambient environments, which alter signal patterns and degrade downstream performance. Traditional domain adaptation and generalization methods often ignore such contextual information or incorporate it in overly simplistic ways, making them ineffective under unseen context shifts after deployment. In this paper, we propose Chorus, a context-aware, data-free model customization approach that adapts models to unseen deployment conditions without requiring target-domain data. The key idea is to learn context representations that capture how contextual factors influence sensor data, and then use these representations as structured priors for context-aware customization under unseen shifts. Specifically, Chorus learns a shared sensor-context latent space through bidirectional cross-modal reconstruction on unlabeled sensor-context pairs, and regularizes the context embedding space to obtain compact and generalizable context representations. Building on the aligned representations, Chorus trains a lightweight gated head with limited labeled data to exploit context priors during inference, and introduces a context-caching mechanism that reuses cached context representations when no context shift is detected, thereby reducing inference overhead on smartphones. Experiments on IMU, speech enhancement, and WiFi sensing tasks under diverse context shifts show that Chorus outperforms state-of-the-art baselines by up to 20.2% in unseen contexts, with cached inference latency close to sensor-only deployment, while maintaining stable performance under continuous context transitions. A video demonstration of Chorus's performance in real world is available at https://youtu.be/ZBdro0jPNkE.

2512.14086 2026-03-17 cs.LG cs.NA math.NA

Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization

Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O'Leary-Roseberry, Omar Ghattas

详情
英文摘要

We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fréchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator's response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives. Then, we establish (i) simultaneous universal approximation of continuously differentiable operators and their Fréchet derivatives by FNOs on compact sets, and (ii) universal approximation of continuously differentiable operators by FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and for the solution of PDE-constrained optimization problems. Furthermore, we develop efficient training schemes that leverage dimensionality reduction and multi-resolution techniques to significantly reduce memory and computational costs in Fréchet derivative learning. Numerical examples on nonlinear diffusion--reaction, Helmholtz, and Navier--Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.

2512.12898 2026-03-17 cs.CV cs.GR cs.LG

Towards High-Fidelity Gaussian Splatting with Queried-Convolution Neural Networks

Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein, Radek Grzeszczuk, Suren Kumar

Comments 38 pages, 8 figures, Project Page: https://abhi1kumar.github.io/qonvolution/

详情
英文摘要

Gaussian Splatting has revolutionized the field of Novel View Synthesis (NVS) with faster training and real-time rendering. However, its reconstruction fidelity still trails behind the powerful radiance models such as Zip-NeRF. Motivated by our theoretical result that both queries (such as coordinates) and neighborhood are important to learn high-fidelity signals, this paper proposes Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolutions convolve a low-fidelity signal with queries to output residual and achieve high-fidelity reconstruction. We empirically demonstrate that combining Gaussian splatting with Qonvolution neural networks (QNNs) results in state-of-the-art NVS on real-world scenes, even outperforming Zip-NeRF on image fidelity. QNNs also enhance performance of 1D regression, 2D regression and 2D super-resolution tasks.

2512.12459 2026-03-17 cs.CV cs.GR

From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields

Jiachen Tao, Benjamin Planche, Van Nguyen Nguyen, Junyi Wu, Yuchun Liu, Haoxuan Wang, Zhongpai Gao, Gengyu Zhang, Meng Zheng, Feiran Wang, Anwesa Choudhuri, Zhenghao Zhao, Weitai Kang, Terrence Chen, Yan Yan, Ziyan Wu

详情
英文摘要

Accurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.

2512.11542 2026-03-17 cs.CV

Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah

Comments Accepted at the ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

详情
英文摘要

Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

2512.10793 2026-03-17 cs.CL cs.AI

LabelFusion: Fusing Large Language Models with Transformer Encoders for Robust Financial News Classification

Michael Schlee, Christoph Weisser, Timo Kivimäki, Melchizedek Mashiku, Benjamin Saefken

详情
英文摘要

Financial news plays a central role in shaping investor sentiment and short-term dynamics in commodity markets. Many downstream financial applications, such as commodity price prediction or sentiment modeling, therefore rely on the ability to automatically identify news articles relevant to specific assets. However, obtaining large labeled corpora for financial text classification is costly, and transformer-based classifiers such as RoBERTa often degrade significantly in low-data regimes. Our results show that appropriately prompted out-of-the-box Large Language Models (LLMs) achieve strong performance even in such settings. Furthermore, we propose LabelFusion, a hybrid architecture that combines the output of a prompt-engineered LLM with contextual embeddings produced by a fine-tuned RoBERTa encoder through a lightweight Multilayer Perceptron (MLP) voting layer. Evaluated on a ten-class multi-label subset of the Reuters-21578 corpus, LabelFusion achieves a macro F1 score of 96.0% and an accuracy of 92.3% when trained on the full dataset, outperforming both standalone RoBERTa (F1 94.6%) and the standalone LLM (F1 93.9%). In low- to mid-data regimes, however, the LLM alone proves surprisingly competitive, achieving an F1 score of 75.9% even in a zero-shot setting and consistently outperforming LabelFusion until approximately 80% of the training data is available. These results suggest that LLM-only prompting is the preferred strategy under annotation constraints, whereas LabelFusion becomes the most effective solution once sufficient labeled data is available to train the encoder component. The code is available in an anonymized repository.

2512.09944 2026-03-17 cs.AI cs.CV cs.LG eess.IV

Echo-CoPilot: A Multiple-Perspective Agentic Framework for Reliable Echocardiography Interpretation

Moein Heidari, Ali Mehrabian, Mohammad Amin Roohi, Wenjin Chen, David J. Foran, Jasmine Grewal, Ilker Hacihaliloglu

详情
英文摘要

Echocardiography interpretation requires integrating multi-view temporal evidence with quantitative measurements and guideline-grounded reasoning, yet existing foundation-model pipelines largely solve isolated subtasks and fail when tool outputs are noisy or values fall near clinical cutoffs. We propose Echo-CoPilot, an end-to-end agentic framework that combines a multi-perspective workflow with knowledge-graph guided measurement selection. Echo-CoPilot runs three independent ReAct-style agents, structural, pathological, and quantitative, that invoke specialized echocardiography tools to extract parameters while querying EchoKG to determine which measurements are required for the clinical question and which should be avoided. A self-contrast language model then compares the evidence-grounded perspectives, generates a discrepancy checklist, and re-queries EchoKG to apply the appropriate guideline thresholds and resolve conflicts, reducing hallucinated measurement selection and borderline flip-flops. On MIMICEchoQA, Echo-CoPilot provides higher accuracy compared to SOTA baselines and, under a stochasticity stress test, achieves higher reliability through more consistent conclusions and fewer answer changes across repeated runs. Our code is publicly available at~\href{https://github.com/moeinheidari7829/Echo-CoPilot}{\textcolor{magenta}{GitHub}}.

2512.09924 2026-03-17 cs.CV

Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo

详情
英文摘要

Unified video models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: (1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and (2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents understanding from guiding the editing process. To address this, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Aware Video Editing (RAVE) and In-Context Video-to-Video Generation (ICVG), spanning diverse reasoning dimensions across both editing and generation scenarios. Building upon this foundation, we propose ReViSE, a self-reflective learning framework that harnesses the model's internal VLM to evaluate and refine its own generation during training. Unlike prior reward-based approaches that rely on external critics, ReViSE leverages the model's internal VLM as a self-reflective evaluator, providing differentiable feedback that directly refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE enhances editing accuracy and visual fidelity, outperforming the finetuned counterpart by 10% in Overall score on the RAVE subset, demonstrating the effectiveness of self-reflective differentiable reward.

2512.04784 2026-03-17 cs.CV

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian

详情
英文摘要

Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

2512.04305 2026-03-17 cs.CV

How (Mis)calibrated is Your Federated CLIP and What To Do About It?

Mainak Singha, Masih Aminbeidokhti, Paolo Casari, Gianni Franchi, Elisa Ricci, Subhankar Roy

Comments Preprint

详情
英文摘要

While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

2512.03886 2026-03-17 cs.RO cs.SY eess.SY

A Modular Architecture Design for Autonomous Driving Racing in Controlled Environments

Brais Fontan-Costas, M. Diaz-Cacho, Ruben Fernandez-Boullon, Manuel Alonso-Carracedo, Javier Perez-Robles

详情
英文摘要

This paper presents a modular autonomous driving architecture for Formula Student Driverless competition vehicles operating in closed-circuit environments. The perception module employs YOLOv11 for real-time traffic cone detection, achieving 0.93 mAP@0.5 on the FSOCO dataset, combined with neural stereo depth estimation from a ZED 2i camera for 3D cone localization with sub-0.5 m median error at distances up to 7 m. State estimation fuses RTK-GNSS positioning and IMU measurements through an Extended Kalman Filter (EKF) based on a kinematic bicycle model, achieving centimeter-level localization accuracy with a 12 cm improvement over raw GNSS. Path planning computes the racing line via cubic spline interpolation on ordered track boundaries and assigns speed profiles constrained by curvature and vehicle dynamics. A regulated pure pursuit controller tracks the planned trajectory with a dynamic lookahead parameterized by speed error. The complete pipeline is implemented as a modular ROS 2 architecture on an NVIDIA Jetson Orin NX platform, with each subsystem deployed as independent nodes communicating through a dual-computer configuration. Experimental validation combines real-world sensor evaluation with simulation-based end-to-end testing, where realistic sensor error distributions are injected to assess system-level performance under representative conditions.

2512.01850 2026-03-17 cs.CV cs.RO

Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching

Yue Pan, Tao Sun, Liyuan Zhu, Lucas Nunes, Iro Armeni, Jens Behley, Cyrill Stachniss

详情
英文摘要

Point cloud registration aligns multiple unposed point clouds into a common reference frame and is a core step for 3D reconstruction and robot localization without initial guess. In this work, we cast registration as conditional generation: a learned, continuous point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike prior methods that perform correspondence matching to estimate pairwise transformations and then optimize a pose graph for multi-view registration, our model directly generates the registered point cloud, yielding both efficiency and point-level global consistency. By scaling the training data and conducting test-time rigidity enforcement, our approach achieves state-of-the-art results on existing pairwise registration benchmarks and on our proposed cross-domain multi-view registration benchmark. The superior zero-shot performance on this benchmark shows that our method generalizes across view counts, scene scales, and sensor modalities even with low overlap. Source code available at: https://github.com/PRBonn/RAP.

2511.22436 2026-03-17 cs.CV

ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection

Runzhi Deng, Yundi Hu, Xinshuang Zhang, Zhao Wang, Xixi Liu, Wang-Zhou Dai, Caifeng Shan, Fang Zhao

详情
英文摘要

Few-shot multi-class industrial anomaly detection identifies diverse defects across multiple categories using a single unified model and limited normal samples. Although vision-language models offer strong generalization, modeling multiple distinct category manifolds concurrently without actual anomalous data causes feature space collapse and cross-class interference. Consequently, existing methods often fail to balance scalability and precision, leading to either isolated single-class retraining or excessively loose decision margins. To address this limitation, we present a one-for-all learning framework called ABounD that unites semantic concept anchoring with geometric boundary optimization. This method employs two lightweight mechanisms to resolve multi-class ambiguity. First, the Dynamic Concept Fusion module generates class-adaptive semantic anchors via query-aware hierarchical calibration, disentangling overlapping category concepts. Second, using these anchors, the Adversarial Boundary Forging module constructs a tight, class-tailored decision margin by synthesizing adversarial boundary-level fence features to prevent cross-class boundary blurring. Optimized in a single stage, ABounD removes the requirement for isolated per-category retraining in few-shot settings. Experiments on seven industrial benchmarks show that the proposed method achieves state-of-the-art detection and localization performance for multi-class few-shot anomaly detection while maintaining low computational costs during training and inference.

2511.21945 2026-03-17 cs.CV

GENA3D: Generative Amodal 3D Modeling by Bridging 2D Priors and 3D Coherence

Junwei Zhou, Yu-Wing Tai

Comments 29 pages

详情
英文摘要

Generating complete 3D objects under partial occlusions (i.e., amodal scenarios) is a practically important yet challenging problem, as large portions of object geometry are unobserved in real-world scenarios. Existing approaches either operate directly in 3D, which ensures geometric consistency but often lacks generative expressiveness, or rely on 2D amodal completion, which provides strong appearance priors but does not guarantee reliable 3D structure. This raises a key question: how can we achieve both generative plausibility and geometric coherence in amodal 3D modeling? To answer this question, we introduce GENA3D (GENarative Amodal 3D), a framework that integrates learned 2D generative priors with explicit 3D geometric reasoning within a conditional 3D generation paradigm. The 2D priors enable the model to plausibly infer diverse occluded content, while the 3D representation enforces multi-view consistency and spatial validity. Our design incorporates a novel View-Wise Cross-Attention for multi-view alignment and a Stereo-Conditioned Cross-Attention to anchor generative predictions in 3D relationships. By combining generative imagination with structural constraints, GENA3D generates complete and coherent 3D objects from limited observations without sacrificing geometric fidelity. Experiments demonstrate that our method outperforms existing approaches in both synthetic and real-world amodal scenarios, highlighting the effectiveness of bridging 2D priors and 3D coherence in generating plausible and geometrically consistent 3D structures in complex environments.

2511.20629 2026-03-17 cs.CV cs.AI cs.LG

MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi

Comments CVPR 2026; Code: https://github.com/SHI-Labs/MapReduce-LoRA

详情
英文摘要

Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

2511.18519 2026-03-17 cs.LG

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

Xinlin Zhuang, Yichen Li, Xiwei Liu, Haolin Yang, Yifan Lu, Ziyun Zou, Yulong Li, Huifa Li, Dongliang Chen, Qinglei Wang, Weiyang Liu, Ying Qian, Jiangming Shi, Imran Razzak

Comments Accepted by CVPR 2026

详情
英文摘要

Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware and Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the least performance drop under all retention ratios.

2511.18254 2026-03-17 cs.CV

UniFlow: Zero-Shot LiDAR Scene Flow for Autonomous Vehicles

Siyi Li, Qingwen Zhang, Ishan Khatri, Kyle Vedder, Eric Eaton, Deva Ramanan, Neehar Peri

Comments Project Page: https://lisiyi777.github.io/UniFlow/

详情
英文摘要

LiDAR scene flow is the task of estimating per-point 3D motion between consecutive point clouds. Recent methods achieve centimeter-level accuracy on popular autonomous vehicle (AV) datasets, but are typically only trained and evaluated on a single sensor. In this paper, we aim to learn general motion priors that transfer to diverse and unseen LiDAR sensors. However, prior work in LiDAR semantic segmentation and 3D object detection demonstrate that naively training on multiple datasets yields worse performance than single dataset models. Interestingly, we find that this conventional wisdom does not hold for motion estimation, and that state-of-the-art scene flow methods greatly benefit from cross-dataset training without architectural modification. We posit that low-level tasks such as motion estimation may be less sensitive to sensor configuration; indeed, our analysis shows that models trained on fast-moving objects (e.g., from highway datasets) perform well on fast-moving objects, even across different datasets. Informed by our analysis, we propose UniFlow, a feedforward model that unifies and trains on multiple large-scale LiDAR scene flow datasets with diverse sensor placements and point cloud densities. Our frustratingly simple solution establishes a new state-of-the-art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2% respectively. Moreover, UniFlow achieves state-of-the-art accuracy on unseen datasets like TruckScenes and AEVAScenes, outperforming prior dataset-specific models by 30.1% and 22.5% respectively.

2511.18242 2026-03-17 cs.CV

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Yogesh Kulkarni, Pooyan Fazli

详情
英文摘要

Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce $\textbf{EgoVITA}$, a framework that decomposes egocentric video reasoning into a structured $\textit{plan-then-verify}$ process. The model first generates an $\textbf{egocentric plan}$: a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an $\textbf{exocentric verification}$ stage that validates spatiotemporal and logical consistency from a third-person viewpoint. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization (GRPO) with two dense reward signals: one that aligns intermediate plan steps with future visual states and another that reinforces consistent third-person verification. EgoVITA achieves state-of-the-art performance on egocentric reasoning benchmarks, outperforming Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks with only $47k$ training samples.