arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2862
专题追踪
2603.07130 2026-03-10 cs.SD eess.AS

Toward Multimodal Industrial Fault Analysis: A Single-Speed Chain Conveyor Dataset with Audio and Vibration Signals

Zhang Chen, Yucong Zhang, Xiaoxiao Miao, Ming Li

Comments Submitted to Interspeech 2026

详情
英文摘要

We introduce a multimodal industrial fault analysis dataset collected from a single-speed chain conveyor (SSCC) system, targeting system-level fault detection in production lines. The dataset consists of multimodal signals, including three audio and four vibration channels. It covers normal operation and four representative fault types under multiple speeds, loads, and both clean and realistic factory-noise conditions reproduced on-site. It is explicitly designed to support channel-wise analysis and multimodal fusion research. We establish standardized evaluation protocols for unsupervised fault detection with normal-only training and supervised fault classification with balanced dataset splits across different operating conditions and fault types. A unified channel-wise kNN baseline is provided to enable fair comparison of representation quality without task-specific training. The dataset offers a practical and extensible benchmark for robust multimodal industrial fault analysis.

2603.07122 2026-03-10 cs.LG stat.ML

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

Tao Shi, Liangming Chen, Long Jin, Mengchu Zhou

详情
英文摘要

In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element-wise multiplication of the first-order and second-order moments, while Adam computes the element-wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second-order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine-tuning. The results validate that DualAdam outperforms Adam and its state-of-the-art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin-lab/DualAdam.

2603.07120 2026-03-10 cs.CV

Inter-Image Pixel Shuffling for Multi-focus Image Fusion

Huangxing Lin, Rongrong Ma, Cheng Wang

详情
英文摘要

Multi-focus image fusion aims to combine multiple partially focused images into a single all-in-focus image. Although deep learning has shown promise in this task, its effectiveness is often limited by the scarcity of suitable training data. This paper introduces Inter-image Pixel Shuffling (IPS), a novel method that allows neural networks to learn multi-focus image fusion without requiring actual multi-focus images. IPS reformulates the task as a pixel-wise classification problem, where the goal is to identify the focused pixel from a pixel group at each spatial position. In this method, pixels from a clear optical image are treated as focused, while pixels from a low-pass filtered version of the same image are considered defocused. By randomly shuffling the focused and defocused pixels at identical spatial positions in the original and filtered images, IPS generates training data that preserves spatial structure while mixing focus-defocus information. The model is trained to select the focused pixel from each spatially aligned pixel group, thus learning to reconstruct an all-in-focus image by aggregating sharp content from the input. To further enhance fusion quality, IPS adopts a cross-image fusion network that integrates the localized representation power of convolutional neural networks with the long-range modeling capabilities of state space models. This design effectively leverages both spatial detail and contextual information to produce high-quality fused results. Experimental results indicate that IPS significantly outperforms existing multi-focus image fusion methods, even without training on multi-focus images.

2603.07119 2026-03-10 cs.CV

TIQA: Human-Aligned Text Quality Assessment in Generated Images

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova

详情
英文摘要

Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.

2603.07111 2026-03-10 cs.CL cs.AI

Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa Inaba

Comments Accepted to the 2nd International AIWolfDial Workshop at INLG 2024

详情
英文摘要

The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.

2603.07110 2026-03-10 cs.RO

Learning From Failures: Efficient Reinforcement Learning Control with Episodic Memory

Chenyang Miao

详情
英文摘要

Reinforcement learning has achieved remarkable success in robot learning. However, under challenging exploration and contact-rich dynamics, early-stage training is frequently dominated by premature terminations such as collisions and falls. As a result, learning is overwhelmed by short-horizon, low-return trajectories, which hinder convergence and limit long-horizon exploration. To alleviate this issue, we propose a technique called Failure Episodic Memory Alert (FEMA). FEMA explicitly stores short-horizon failure experiences through an episodic memory module. During interactions, it retrieves similar failure experiences and prevents the robot from recurrently relapsing into unstable states, guiding the policy toward long-horizon trajectories with greater long-term value. FEMA can be combined easily with model-free reinforcement learning algorithms, and yields a substantial sample-efficiency improvement of 33.11% on MuJoCo tasks across several classical RL algorithms. Furthermore, integrating FEMA into a parallelized PPO training pipeline demonstrates its effectiveness on a real-world bipedal robot task.

2603.07098 2026-03-10 cs.CV

NuNext: Reframing Nucleus Detection as Next-Point Detection

Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, Cheng Tan

详情
英文摘要

Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.

2603.07096 2026-03-10 cs.RO

Towards Scalable Probabilistic Human Motion Prediction with Gaussian Processes for Safe Human-Robot Collaboration

Jinger Chong, Xiaotong Zhang, Kamal Youcef-Toumi

Comments Submitted to IROS 2026

详情
英文摘要

Accurate human motion prediction with well-calibrated uncertainty is critical for safe human-robot collaboration (HRC), where robots must anticipate and react to human movements in real time. We propose a structured multitask variational Gaussian Process (GP) framework for full-body human motion prediction that captures temporal correlations and leverages joint-dimension-level factorization for scalability, while using a continuous 6D rotation representation to preserve kinematic consistency. Evaluated on Human3.6M (H3.6M), our model achieves up to 50 lower kernel density estimate negative log-likelihood (KDE NLL) than strong baselines, a mean continuous ranked probability score (CRPS) of 0.021 m, and deterministic mean angle error (MAE) that is 3-18% higher than competitive deep learning methods. Empirical coverage analysis shows that the fraction of ground-truth outcomes contained within predicted confidence intervals gradually decreases with horizon, remaining conservative for lower-confidence intervals and near-nominal for higher-confidence intervals, with only modest calibration drift at longer horizons. Despite its probabilistic formulation, our model requires only 0.24-0.35 M parameters, roughly eight times fewer than comparable approaches, and exhibits modest inference times, indicating suitability for real-time deployment. Extensive ablation studies further validated the choice of 6D rotation representation and Matern 3/2 + Linear kernel, and guided the selection of the number of inducing points and latent dimensionality. These results demonstrate that scalable GP-based models can deliver competitive accuracy together with reliable and interpretable uncertainty estimates for downstream robotics tasks such as motion planning and collision avoidance.

2603.07095 2026-03-10 cs.RO cs.SY eess.SY

ACLM: ADMM-Based Distributed Model Predictive Control for Collaborative Loco-Manipulation

Ziyi Zhou, Pengyuan Shu, Ruize Cao, Yuntian Zhao, Ye Zhao

详情
英文摘要

Collaborative transportation of heavy payloads via loco-manipulation is a challenging yet essential capability for legged robots operating in complex, unstructured environments. Centralized planning methods, e.g., holistic trajectory optimization, capture dynamic coupling among robots and payloads but scale poorly with system size, limiting real-time applicability. In contrast, hierarchical and fully decentralized approaches often neglect force and dynamic interactions, leading to conservative behavior. This study proposes an Alternating Direction Method of Multipliers (ADMM)-based distributed model predictive control framework for collaborative loco-manipulation with a team of quadruped robots with manipulators. By exploiting the payload-induced coupling structure, the global optimal control problem is decomposed into parallel individual-robot-level subproblems with consensus constraints. The distributed planner operates in a receding-horizon fashion and achieves fast convergence, requiring only a few ADMM iterations per planning cycle. A wrench-aware whole-body controller executes the planned trajectories, tracking both motion and interaction wrenches. Extensive simulations with up to four robots demonstrate scalability, real-time performance, and robustness to model uncertainty.

2603.07093 2026-03-10 cs.CV

Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

Xu Chen, Rui Gao, Xinjie Zhang, Haoyu Zhang, Che Sun, Zhi Gao, Yuwei Wu, Yunde Jia

详情
英文摘要

Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.

2603.07078 2026-03-10 cs.AI cs.CL

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang

详情
英文摘要

Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.

2603.07077 2026-03-10 cs.CV

Aligning What EEG Can See: Structural Representations for Brain-Vision Matching

Jingyi Tang, Shuai Jiang, Fei Su, Zhicheng Zhao

详情
英文摘要

Visual decoding from electroencephalography (EEG) has emerged as a highly promising avenue for non-invasive brain-computer interfaces (BCIs). Existing EEG-based decoding methods predominantly align brain signals with the final-layer semantic embeddings of deep visual models. However, relying on these highly abstracted embeddings inevitably leads to severe cross-modal information mismatch. In this work, we introduce the concept of Neural Visibility and accordingly propose the EEG-Visible Layer Selection Strategy, aligning EEG signals with intermediate visual layers to minimize this mismatch. Furthermore, to accommodate the multi-stage nature of human visual processing, we propose a novel Hierarchically Complementary Fusion (HCF) framework that jointly integrates visual representations from different hierarchical levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance, reaching an 84.6% accuracy (+21.4%) on zero-shot visual decoding on the THINGS-EEG dataset. Moreover, our method achieves up to a 129.8% performance gain across diverse EEG baselines, demonstrating its robust generalizability.

2603.07074 2026-03-10 cs.CV

Physics-Guided VLM Priors for All-Cloud Removal

Liying Xu, Huifang Li, Huanfeng Shen

详情
英文摘要

Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.

2603.07073 2026-03-10 cs.LG

Interpretable Maximum Margin Deep Anomaly Detection

Zhiji Yang, Mei Huang, Xinyu Li, Xianli Pan, Qi Wang, Jianhua Zhao

详情
英文摘要

Anomaly detection is a crucial machine-learning task with wide-ranging applications. Deep Support Vector Data Description (Deep SVDD) is a prominent deep one-class method, but it is vulnerable to hypersphere collapse, often relies on heuristic choices for hypersphere parameters, and provides limited interpretability. To address these issues, we propose Interpretable Maximum Margin Deep Anomaly Detection (IMD-AD), which leverages a small set of labeled anomalies and a maximum margin objective to stabilize training and improve discrimination. It is inherently resilient to hypersphere collapse. Furthermore, we prove an equivalence between hypersphere parameters and the network's final-layer weights, which allows the center and radius to be learned end-to-end as part of the model and yields intrinsic interpretability and visualizable outputs. We further develop an efficient training algorithm that jointly optimizes representation, margin, and final-layer parameters. Extensive experiments and ablation studies on image and tabular benchmarks demonstrate that IMD-AD empirically improves detection performance over several state-of-the-art baselines while providing interpretable decision diagnostics.

2603.07072 2026-03-10 cs.RO cs.LG

The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication

Hanlong Li, Karishma Kamalahasan, Jiahui Li, Kazuhiro Nakadai, Shreyas Kousik

详情
英文摘要

We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. Our system pairs a lightweight text-to-speech (TTS) transmitter (1.18M parameters) with a conformer-based automatic speech recognition (ASR) receiver (938K parameters), jointly optimized through a differentiable channel. Unlike human speech, robot-to-robot communication is paralinguistics-free: the system need not preserve timbre, prosody, or naturalness, only maximize decoding accuracy under channel distortion. Through a three-phase co-training curriculum, the TTS transmitter learns to produce distortion-robust acoustic encodings that surpass the baseline under noise, achieving 8.3% CER at 0 dB SNR. The entire system requires only 2.1M parameters (8.4 MB) and runs in under 13 ms end-to-end on a CPU, making it suitable for deployment on resource-constrained robotic platforms.

2603.07068 2026-03-10 cs.RO

Morphology-Independent Facial Expression Imitation for Human-Face Robots

Xu Chen, Rui Gao, Che Sun, Zhehang Liu, Yuwei Wu, Shuo Yang, Yunde Jia

详情
英文摘要

Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.

2603.07066 2026-03-10 cs.CV cs.AI

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan Le

详情
英文摘要

Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer

2603.07060 2026-03-10 cs.RO cs.HC cs.SY eess.SY

GuideTWSI: A Diverse Tactile Walking Surface Indicator Dataset from Synthetic and Real-World Images for Blind and Low-Vision Navigation

Hochul Hwang, Soowan Yang, Anh N. H. Nguyen, Parth Goel, Krisha Adhikari, Sunghoon I. Lee, Joydeep Biswas, Nicholas A. Giudice, Donghyun Kim

详情
英文摘要

Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.

2603.07054 2026-03-10 cs.AI eess.SP

Bi-directional digital twin prototype anchoring with multi-periodicity learning for few-shot fault diagnosis

Pengcheng Xia, Zhichao Dong, Yixiang Huang, Chengjin Qin, Qun Chao, Chengliang Liu

Comments 19 pages, 13 figures

详情
英文摘要

Intelligent fault diagnosis (IFD) has emerged as a powerful paradigm for ensuring the safety and reliability of industrial machinery. However, traditional IFD methods rely heavily on abundant labeled data for training, which is often difficult to obtain in practical industrial environments. Constructing a digital twin (DT) of the physical asset to obtain simulation data has therefore become a promising alternative. Nevertheless, existing DT-assisted diagnosis methods mainly transfer diagnostic knowledge through domain adaptation techniques, which still require a considerable amount of unlabeled data from the target asset. To address the challenges in few-shot scenarios where only extremely limited samples are available, a bi-directional DT prototype anchoring method with multi-periodicity learning is proposed. Specifically, a framework involving meta-training in the DT virtual space and test-time adaptation in the physical space is constructed for reliable few-shot model adaptation for the target asset. A bi-directional twin-domain prototype anchoring strategy with covariance-guided augmentation for adaptation is further developed to improve the robustness of prototype estimation. In addition, a multi-periodicity feature learning module is designed to capture the intrinsic periodic characteristics within current signals. A DT of an asynchronous motor is built based on finite element method, and experiments are conducted under multiple few-shot settings and three working conditions. Comparative and ablation studies demonstrate the superiority and effectiveness of the proposed method for few-shot fault diagnosis.

2603.07048 2026-03-10 cs.CV cs.AI

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Xiaochen Yang, Hao Fang, Jiawei Kong, Yaoxin Mao, Bin Chen, Shu-Tao Xia

详情
英文摘要

Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.

2603.07043 2026-03-10 cs.CV

Fine-Grained 3D Facial Reconstruction for Micro-Expressions

Che Sun, Xinjie Zhang, Rui Gao, Xu Chen, Yuwei Wu, Yunde Jia

详情
英文摘要

Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.

2603.07040 2026-03-10 cs.RO

TacDexGrasp: Compliant and Robust Dexterous Grasping with Tactile Feedback

Yubin Ke, Jiayi Chen, Hang Lv, Xiao Zhou, He Wang

Comments 8pages, 7 figures

详情
英文摘要

Multi-fingered hands offer great potential for compliant and robust grasping of unknown objects, yet their high-dimensional force control presents a significant challenge. This work addresses two key problems: (1) distributing forces across multiple contacts to counteract an object's weight, and (2) preventing rotational slip caused by gravitational torque when a grasp is distant from the object's center of mass. We address these challenges via tactile feedback and a Second-Order Cone Programming (SOCP)-based controller, without explicit torque modeling or slip detection. Our key insights are (1) rotational slip inevitably induces translational slip at some contact points for a multi-fingered grasp, and (2) the ratio of tangential to normal force at each contact is an effective early stability indicator. By actively constraining this ratio for each finger below the estimated friction coefficient, our controller maintains grasp stability against both translational and rotational slip. Real-world experiments on 12 diverse objects demonstrate the robustness and compliance of our approach.

2603.07039 2026-03-10 cs.AI

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang

Comments 8 pages, 5 figures, 1 table. Presented at 2026 World Modeling Workshop, Mila Quebec

详情
英文摘要

We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth

2603.07032 2026-03-10 cs.RO

SSP: Safety-guaranteed Surgical Policy via Joint Optimization of Behavioral and Spatial Constraints

Jianshu Hu, ZhiYuan Guan, Lei Song, Kantaphat Leelakunwet, Hesheng Wang, Wei Xiao, Qi Dou, Yutong Ban

详情
英文摘要

The paradigm of robot-assisted surgery is shifting toward data-driven autonomy, where policies learned via Reinforcement Learning (RL) or Imitation Learning (IL) enable the execution of complex tasks. However, these ``black-box" policies often lack formal safety guarantees, a critical requirement for clinical deployment. In this paper, we propose the Safety-guaranteed Surgical Policy (SSP) framework to bridge the gap between data-driven generality and formal safety. We utilize Neural Ordinary Differential Equations (Neural ODEs) to learn an uncertainty-aware dynamics model from demonstration data. This learned model underpins a robust Control Barrier Function (CBF) safety controller, which minimally alters the actions of a surgical policy to ensure strict safety under uncertainty. Our controller enforces two constraint categories: behavioral constraints (restricting the task space of the agent) and spatial constraints (defining surgical no-go zones). We instantiate the SSP framework with surgical policies derived from RL, IL and Control Lyapunov Functions (CLF). Validation on in both the SurRoL simulation and da Vinci Research Kit (dVRK) demonstrates that our method achieves a near-zero constraint violation rate while maintaining high task success rates compared to unconstrained baselines.

2603.07027 2026-03-10 cs.LG

Resource-Adaptive Federated Text Generation with Differential Privacy

Jiayi Wang, John Gounley, Heidi Hanson

Comments Accepted by DATA-FM workshop @ ICLR 2026

详情
英文摘要

In cross-silo federated learning (FL), sensitive text datasets remain confined to local organizations due to privacy regulations, making repeated training for each downstream task both communication-intensive and privacy-demanding. A promising alternative is to generate differentially private (DP) synthetic datasets that approximate the global distribution and can be reused across tasks. However, pretrained large language models (LLMs) often fail under domain shift, and federated finetuning is hindered by computational heterogeneity: only resource-rich clients can update the model, while weaker clients are excluded, amplifying data skew and the adverse effects of DP noise. We propose a flexible participation framework that adapts to client capacities. Strong clients perform DP federated finetuning, while weak clients contribute through a lightweight DP voting mechanism that refines synthetic text. To ensure the synthetic data mirrors the global dataset, we apply control codes (e.g., labels, topics, metadata) that represent each client's data proportions and constrain voting to semantically coherent subsets. This two-phase approach requires only a single round of communication for weak clients and integrates contributions from all participants. Experiments show that our framework improves distribution alignment and downstream robustness under DP and heterogeneity.

2603.07025 2026-03-10 cs.CL

Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng

Comments Submitted for Review to Interspeech 2026

详情
英文摘要

Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.

2603.07024 2026-03-10 cs.AI

Enhancing Web Agents with a Hierarchical Memory Tree

Yunteng Tan, Zhi Gao, Xinxiao Wu

详情
英文摘要

Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize across unseen websites. We identify that this challenge arises from the flat memory structures that entangle high-level task logic with site-specific action details. This entanglement induces a workflow mismatch in new environments, where retrieved contents are conflated with current web, leading to logically inconsistent execution. To address this, we propose Hierarchical Memory Tree (HMT), a structured framework designed to explicitly decouple logical planning from action execution. HMT constructs a three-level hierarchy from raw trajectories via an automated abstraction pipeline: the Intent level maps diverse user instructions to standardized task goals; the Stage level defines reusable semantic subgoals characterized by observable pre-conditions and post-conditions; and the Action level stores action patterns paired with transferable semantic element descriptions. Leveraging this structure, we develop a stage-aware inference mechanism comprising a Planner and an Actor. By explicitly validating pre-conditions, the Planner aligns the current state with the correct logical subgoal to prevent workflow mismatch, while the Actor grounds actions by matching the stored semantic descriptions to the target page. Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.

2603.07023 2026-03-10 cs.CL cs.AI

Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang

Comments 21 pages, 2 figures, 6 tables

详情
英文摘要

Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.

2603.07022 2026-03-10 cs.CV

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Leilei Wang, Longfei Liu, Xi Shen, Xuanlong Yu, Ying Tiffany He, Fei Richard Yu, Yingyi Chen

详情
英文摘要

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.

2603.07020 2026-03-10 cs.LG cs.AI

RESCHED: Rethinking Flexible Job Shop Scheduling from a Transformer-based Architecture with Simplified States

Xiangjie Xiao, Cong Zhang, Wen Song, Zhiguang Cao

详情
英文摘要

Neural approaches to the Flexible Job Shop Scheduling Problem (FJSP), particularly those based on deep reinforcement learning (DRL), have gained growing attention in recent years. However, existing methods rely on complex feature-engineered state representations (i.e., often requiring more than 20 handcrafted features) and graph-biased neural architectures. To reduce modeling complexity and advance a more generalizable framework for FJSP, we introduce \textsc{ReSched}, a minimalist DRL framework that rethinks both the scheduling formulation and model design. First, by revisiting the Markov Decision Process (MDP) formulation of FJSP, we condense the state space to just four essential features, eliminating historical dependencies through a subproblem-based perspective. Second, we employ Transformer blocks with dot-product attention, augmented by three lightweight but effective architectural modifications tailored to scheduling tasks. Extensive experiments show that \textsc{ReSched} outperforms classical dispatching rules and state-of-the-art DRL methods on FJSP. Moreover, \textsc{ReSched} also generalizes well to the Job Shop Scheduling Problem (JSSP) and the Flexible Flow Shop Scheduling Problem (FFSP), achieving competitive performance against neural baselines specifically designed for these variants.