arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪
2602.13571 2026-02-17 cs.CL cs.AI

LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

Zhipeng Song, Xiangyu Kong, Xinrui Bao, Yizhi Zhou, Jiulong Jiao, Sitong Liu, Yuhang Zhou, Heng Qi

Comments Published by ESWA

Journal ref Expert Systems with Applications. (2026) 131627

详情
英文摘要

Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.

2602.13568 2026-02-17 cs.AI

Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

Anooshka Bajaj, Zoran Tiganj

详情
英文摘要

Large language models (LLMs) increasingly operate in environments where they encounter social information such as other agents' answers, tool outputs, or human recommendations. In humans, such inputs influence judgments in ways that depend on the source's credibility and the strength of consensus. This paper investigates whether LLMs exhibit analogous patterns of influence and whether they privilege feedback from humans over feedback from other LLMs. Across three binary decision-making tasks, reading comprehension, multi-step reasoning, and moral judgment, we present four instruction-tuned LLMs with prior responses attributed either to friends, to human experts, or to other LLMs. We manipulate whether the group is correct and vary the group size. In a second experiment, we introduce direct disagreement between a single human and a single LLM. Across tasks, models conform significantly more to responses labeled as coming from human experts, including when that signal is incorrect, and revise their answers toward experts more readily than toward other LLMs. These results reveal that expert framing acts as a strong prior for contemporary LLMs, suggesting a form of credibility-sensitive social influence that generalizes across decision domains.

2602.13567 2026-02-17 cs.CL

DistillLens: Symmetric Knowledge Distillation Through Logit Lens

Manish Dhakal, Uthman Jinadu, Anjila Budathoki, Rajshekhar Sunderraman, Yi Ding

Comments Knowledge Distillation in LLMs

详情
英文摘要

Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at https://github.com/manishdhakal/DistillLens.

2602.13557 2026-02-17 cs.LG

Scenario-Adaptive MU-MIMO OFDM Semantic Communication With Asymmetric Neural Network

Chongyang Li, Tianqian Zhang, Shouyin Liu

详情
英文摘要

Semantic Communication (SemCom) has emerged as a promising paradigm for 6G networks, aiming to extract and transmit task-relevant information rather than minimizing bit errors. However, applying SemCom to realistic downlink Multi-User Multi-Input Multi-Output (MU-MIMO) Orthogonal Frequency Division Multiplexing (OFDM) systems remains challenging due to severe Multi-User Interference (MUI) and frequency-selective fading. Existing Deep Joint Source-Channel Coding (DJSCC) schemes, primarily designed for point-to-point links, suffer from performance saturation in multi-user scenarios. To address these issues, we propose a scenario-adaptive MU-MIMO SemCom framework featuring an asymmetric architecture tailored for downlink transmission. At the transmitter, we introduce a scenario-aware semantic encoder that dynamically adjusts feature extraction based on Channel State Information (CSI) and Signal-to-Noise Ratio (SNR), followed by a neural precoding network designed to mitigate MUI in the semantic domain. At the receiver, a lightweight decoder equipped with a novel pilot-guided attention mechanism is employed to implicitly perform channel equalization and feature calibration using reference pilot symbols. Extensive simulation results over 3GPP channel models demonstrate that the proposed framework significantly outperforms DJSCC and traditional Separate Source-Channel Coding (SSCC) schemes in terms of Peak Signal-to-Noise Ratio (PSNR) and classification accuracy, particularly in low-SNR regimes, while maintaining low latency and computational cost on edge devices.

2602.13555 2026-02-17 cs.CV cs.AI

Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

Song Wang, Lingling Li, Marcus Santos, Guanghui Wang

详情
英文摘要

Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy-leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end-to-end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.

2602.13549 2026-02-17 cs.CV

Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting

Tae-Kyeong Kim, Xingxin Chen, Guile Wu, Chengjie Huang, Dongfeng Bai, Bingbing Liu

Comments ICRA 2026

详情
英文摘要

This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.

2602.13540 2026-02-17 cs.CL cs.AI cs.LG

On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun

Comments preprint

详情
英文摘要

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

2602.13532 2026-02-17 cs.LG eess.AS eess.IV eess.SP

Fast Swap-Based Element Selection for Multiplication-Free Dimension Reduction

Nobutaka Ono

Comments 11 pages, 4 figures

详情
英文摘要

In this paper, we propose a fast algorithm for element selection, a multiplication-free form of dimension reduction that produces a dimension-reduced vector by simply selecting a subset of elements from the input. Dimension reduction is a fundamental technique for reducing unnecessary model parameters, mitigating overfitting, and accelerating training and inference. A standard approach is principal component analysis (PCA), but PCA relies on matrix multiplications; on resource-constrained systems, the multiplication count itself can become a bottleneck. Element selection eliminates this cost because the reduction consists only of selecting elements, and thus the key challenge is to determine which elements should be retained. We evaluate a candidate subset through the minimum mean-squared error of linear regression that predicts a target vector from the selected elements, where the target may be, for example, a one-hot label vector in classification. When an explicit target is unavailable, the input itself can be used as the target, yielding a reconstruction-based criterion. The resulting optimization is combinatorial, and exhaustive search is impractical. To address this, we derive an efficient formula for the objective change caused by swapping a selected and an unselected element, using the matrix inversion lemma, and we perform a swap-based local search that repeatedly applies objective-decreasing swaps until no further improvement is possible. Experiments on MNIST handwritten-digit images demonstrate the effectiveness of the proposed method.

2602.13531 2026-02-17 cs.LG quant-ph

QuaRK: A Quantum Reservoir Kernel for Time Series Learning

Abdallah Aaraba, Soumaya Cherkaoui, Ola Ahmad, Shengrui Wang

详情
英文摘要

Quantum reservoir computing offers a promising route for time series learning by modelling sequential data via rich quantum dynamics while the only training required happens at the level of a lightweight classical readout. However, studies featuring efficient and implementable quantum reservoir architectures along with model learning guarantees remain scarce in the literature. To close this gap, we introduce QuaRK, an end-to-end framework that couples a hardware-realistic quantum reservoir featurizer with a kernel-based readout scheme. Given a sequence of sample points, the reservoir injects the points one after the other to yield a compact feature vector from efficiently measured k-local observables using classical shadow tomography, after which a classical kernel-based readout learns the target mapping with explicit regularization and fast optimization. The resulting pipeline exposes clear computational knobs -- circuit width and depth as well as the measurement budget -- while preserving the flexibility of kernel methods to model nonlinear temporal functionals and being scalable to high-dimensional data. We further provide learning-theoretic generalization guarantees for dependent temporal data, linking design and resource choices to finite-sample performance, thereby offering principled guidance for building reliable temporal learners. Empirical experiments validate QuaRK and illustrate the predicted interpolation and generalization behaviours on synthetic beta-mixing time series tasks.

2602.13517 2026-02-17 cs.CL

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng

Comments Work in progress

详情
英文摘要

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

2602.13516 2026-02-17 cs.AI

SPILLage: Agentic Oversharing on the Web

Jaechul Roh, Eugene Bagdasarian, Hamed Haddadi, Ali Shahin Shamsabadi

详情
英文摘要

LLM-powered agents are beginning to automate user's tasks across the open web, often with access to user resources such as emails and calendars. Unlike standard LLMs answering questions in a controlled ChatBot setting, web agents act "in the wild", interacting with third parties and leaving behind an action trace. Therefore, we ask the question: how do web agents handle user resources when accomplishing tasks on their behalf across live websites? In this paper, we formalize Natural Agentic Oversharing -- the unintentional disclosure of task-irrelevant user information through an agent trace of actions on the web. We introduce SPILLage, a framework that characterizes oversharing along two dimensions: channel (content vs. behavior) and directness (explicit vs. implicit). This taxonomy reveals a critical blind spot: while prior work focuses on text leakage, web agents also overshare behaviorally through clicks, scrolls, and navigation patterns that can be monitored. We benchmark 180 tasks on live e-commerce sites with ground-truth annotations separating task-relevant from task-irrelevant attributes. Across 1,080 runs spanning two agentic frameworks and three backbone LLMs, we demonstrate that oversharing is pervasive with behavioral oversharing dominates content oversharing by 5x. This effect persists -- and can even worsen -- under prompt-level mitigation. However, removing task-irrelevant information before execution improves task success by up to 17.9%, demonstrating that reducing oversharing improves task success. Our findings underscore that protecting privacy in web agents is a fundamental challenge, requiring a broader view of "output" that accounts for what agents do on the web, not just what they type. Our datasets and code are available at https://github.com/jrohsc/SPILLage.

2602.13515 2026-02-17 cs.CV cs.LG

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang, Chendong Xiang, Weiqi Feng, Yuezhou Hu, Haocheng Xi, Jianfei Chen, Jun Zhu

详情
英文摘要

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.

2602.13504 2026-02-17 cs.CL cs.AI

From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

Ozancan Ozdemir

详情
英文摘要

The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.

2602.13502 2026-02-17 cs.AI q-bio.OT

Translating Dietary Standards into Healthy Meals with Minimal Substitutions

Trevor Chan, Ilias Tagkopoulos

Comments 49 pages, 4 figures

详情
英文摘要

An important goal for personalized diet systems is to improve nutritional quality without compromising convenience or affordability. We present an end-to-end framework that converts dietary standards into complete meals with minimal change. Using the What We Eat in America (WWEIA) intake data for 135,491 meals, we identify 34 interpretable meal archetypes that we then use to condition a generative model and a portion predictor to meet USDA nutritional targets. In comparisons within archetypes, generated meals are better at following recommended daily intake (RDI) targets by 47.0%, while remaining compositionally close to real meals. Our results show that by allowing one to three food substitutions, we were able to create meals that were 10% more nutritious, while reducing costs 19-32%, on average. By turning dietary guidelines into realistic, budget-aware meals and simple swaps, this framework can underpin clinical decision support, public-health programs, and consumer apps that deliver scalable, equitable improvements in everyday nutrition.

2602.13482 2026-02-17 cs.LG cs.AI

Comparing Classifiers: A Case Study Using PyCM

Sadra Sabouri, Alireza Zolanvari, Sepand Haghighi

Comments 13 pages, 3 figures, 2 tables

详情
英文摘要

Selecting an optimal classification model requires a robust and comprehensive understanding of the performance of the model. This paper provides a tutorial on the PyCM library, demonstrating its utility in conducting deep-dive evaluations of multi-class classifiers. By examining two different case scenarios, we illustrate how the choice of evaluation metrics can fundamentally shift the interpretation of a model's efficacy. Our findings emphasize that a multi-dimensional evaluation framework is essential for uncovering small but important differences in model performance. However, standard metrics may miss these subtle performance trade-offs.

2602.13479 2026-02-17 cs.CV cs.HC

GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

Akhil Ramachandran, Ankit Arun, Ashish Shenoy, Abhay Harpale, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Yichao Lu, Vikas Bhardwaj, Peyman Heidari

详情
英文摘要

Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.

2602.13476 2026-02-17 cs.RO cs.LG

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine

Comments 13 pages, 9 figures, 2 tables

详情
英文摘要

Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.

2602.13457 2026-02-17 cs.RO

Inferring Turn-Rate-Limited Engagement Zones with Sacrificial Agents for Safe Trajectory Planning

Grant Stagg, Cameron K. Peterson

Comments Submitted to the Journal of Aerospace Information Systems

详情
英文摘要

This paper presents a learning-based framework for estimating pursuer parameters in turn-rate-limited pursuit-evasion scenarios using sacrificial agents. Each sacrificial agent follows a straight-line trajectory toward an adversary and reports whether it was intercepted or survived. These binary outcomes are related to the pursuer's parameters through a geometric reachable-region (RR) model. Two formulations are introduced: a boundary-interception case, where capture occurs at the RR boundary, and an interior-interception case, which allows capture anywhere within it. The pursuer's parameters are inferred using a gradient-based multi-start optimization with custom loss functions tailored to each case. Two trajectory-selection strategies are proposed for the sacrificial agents: a geometric heuristic that maximizes the spread of expected interception points, and a Bayesian experimental-design method that maximizes the D-score of the expected Gauss-Newton information matrix, thereby selecting trajectories that yield maximal information gain. Monte Carlo experiments demonstrate accurate parameter recovery with five to twelve sacrificial agents. The learned engagement models are then used to generate safe, time-optimal paths for high-value agents that avoid all feasible pursuer engagement regions.

2602.13455 2026-02-17 cs.CL cs.AI cs.HC

Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile

Comments Accepted at the Second IJCAI AI for Good Symposium in Africa, hosted by Deep Learning Indaba, 7 pages, 1 figure

详情
英文摘要

The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.

2602.13452 2026-02-17 cs.CL cs.AI

LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

Belu Ticona, Antonis Anastasopoulos

详情
英文摘要

Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.

2602.13444 2026-02-17 cs.RO cs.AI cs.CV

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo

Comments Project Page: https://huajian-zeng.github.io/projects/flowhoi/

详情
英文摘要

Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.

2602.13440 2026-02-17 cs.CV cs.RO

Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

Sebastian-Ion Nae, Mihai-Eugen Barbu, Sebastian Mocanu, Marius Leordeanu

Comments Accepted at European Robotics Forum (ERF) 2026

详情
英文摘要

Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6\%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10\%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96\%$ with $5\%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl

2602.13418 2026-02-17 cs.LG

Text Has Curvature

Karish Grover, Hanqing Zeng, Yinglong Xia, Christos Faloutsos, Geoffrey J. Gordon

详情
英文摘要

Does text have an intrinsic curvature? Language is increasingly modeled in curved geometries - hyperbolic spaces for hierarchy, mixed-curvature manifolds for compositional structure - yet a basic scientific question remains unresolved: what does curvature mean for text itself, in a way that is native to language rather than an artifact of the embedding space we choose? We argue that text does indeed have curvature, and show how to detect it, define it, and use it. To this end, we propose Texture, a text-native, word-level discrete curvature signal, and make three contributions. (a) Existence: We provide empirical and theoretical certificates that semantic inference in natural corpora is non-flat, i.e. language has inherent curvature. (b) Definition: We define Texture by reconciling left- and right-context beliefs around a masked word through a Schrodinger bridge, yielding a curvature field that is positive where context focuses meaning and negative where it fans out into competing continuations. (c) Utility: Texture is actionable: it serves as a general-purpose measurement and control primitive enabling geometry without geometric training; we instantiate it on two representative tasks, improving long-context inference through curvature-guided compression and retrieval-augmented generation through curvature-guided routing. Together, our results establish a text-native curvature paradigm, making curvature measurable and practically useful.

2602.13416 2026-02-17 cs.LG

High-Resolution Climate Projections Using Diffusion-Based Downscaling of a Lightweight Climate Emulator

Haiwen Guan, Moein Darman, Dibyajyoti Chakraborty, Troy Arcomano, Ashesh Chattopadhyay, Romit Maulik

详情
英文摘要

The proliferation of data-driven models in weather and climate sciences has marked a significant paradigm shift, with advanced models demonstrating exceptional skill in medium-range forecasting. However, these models are often limited by long-term instabilities, climatological drift, and substantial computational costs during training and inference, restricting their broader application for climate studies. Addressing these limitations, Guan et al. (2024) introduced LUCIE, a lightweight, physically consistent climate emulator utilizing a Spherical Fourier Neural Operator (SFNO) architecture. This model is able to reproduce accurate long-term statistics including climatological mean and seasonal variability. However, LUCIE's native resolution (~300 km) is inadequate for detailed regional impact assessments. To overcome this limitation, we introduce a deep learning-based downscaling framework, leveraging probabilistic diffusion-based generative models with conditional and posterior sampling frameworks. These models downscale coarse LUCIE outputs to 25 km resolution. They are trained on approximately 14,000 ERA5 timesteps spanning 2000-2009 and evaluated on LUCIE predictions from 2010 to 2020. Model performance is assessed through diverse metrics, including latitude-averaged RMSE, power spectrum, probability density functions and First Empirical Orthogonal Function of the zonal wind. We observe that the proposed approach is able to preserve the coarse-grained dynamics from LUCIE while generating fine-scaled climatological statistics at ~28km resolution.

2602.13413 2026-02-17 cs.LG cs.NA math.NA math.OC stat.ML

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang, James Demmel, Javad Lavaei

详情
英文摘要

We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.

2602.13407 2026-02-17 cs.AI

On-Policy Supervised Fine-Tuning for Efficient Reasoning

Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen

详情
英文摘要

Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT-NLP/On-Policy-SFT.

2602.13398 2026-02-17 cs.LG q-bio.QM

Accelerated Discovery of Cryoprotectant Cocktails via Multi-Objective Bayesian Optimization

Daniel Emerson, Nora Gaby-Biegel, Purva Joshi, Yoed Rabin, Rebecca D. Sandlin, Levent Burak Kara

详情
英文摘要

Designing cryoprotectant agent (CPA) cocktails for vitrification is challenging because formulations must be concentrated enough to suppress ice formation yet non-toxic enough to preserve cell viability. This tradeoff creates a large, multi-objective design space in which traditional discovery is slow, often relying on expert intuition or exhaustive experimentation. We present a data-efficient framework that accelerates CPA cocktail design by combining high-throughput screening with an active-learning loop based on multi-objective Bayesian optimization. From an initial set of measured cocktails, we train probabilistic surrogate models to predict concentration and viability and quantify uncertainty across candidate formulations. We then iteratively select the next experiments by prioritizing cocktails expected to improve the Pareto front, maximizing expected Pareto improvement under uncertainty, and update the models as new assay results are collected. Wet-lab validation shows that our approach efficiently discovers cocktails that simultaneously achieve high CPA concentrations and high post-exposure viability. Relative to a naive strategy and a strong baseline, our method improves dominated hypervolume by 9.5\% and 4.5\%, respectively, while reducing the number of experiments needed to reach high-quality solutions. In complementary synthetic studies, it recovers a comparably strong set of Pareto-optimal solutions using only 30\% of the evaluations required by the prior state-of-the-art multi-objective approach, which amounts to saving approximately 10 weeks of experimental time. Because the framework assumes only a suitable assay and defined formulation space, it can be adapted to different CPA libraries, objective definitions, and cell lines to accelerate cryopreservation development.

2602.13378 2026-02-17 cs.CV cs.LG

LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery

Sohail Ali Farooqui, Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam

详情
英文摘要

Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160$\times$160 resolution extends localization to objects below 8$\times$8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1$\pm$0.3\% mAP@0.5 on VisDrone-DET2019 with 2.3\,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8$\pm$0.4\% mAP@0.5. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.

2602.13376 2026-02-17 cs.CV cs.AI cs.CL

An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang

Comments 9 pages, 4 tables. Under review

详情
英文摘要

Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.

2602.13367 2026-02-17 cs.AI cs.CL

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen

详情
英文摘要

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.