Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning
Comments 19 pages, 6 figures; accepted by ACL 2026 Findings
Zhiyuan Chang, Mingyang Li, Yuekai Huang, Ziyou Jiang, Xiaojun Jia, Qian Xiong, Junjie Wang, Zhaoyang Li, Qing Wang
Comments 19 pages, 6 figures; accepted by ACL 2026 Findings
Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation
Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li
Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.
Devang Kulshreshtha, Hang Su, Haibo Jin, Chinmay Hegde, Haohan Wang
We introduce \emph{self-jailbreaking}, a threat model in which an aligned LLM guides its own compromise. Unlike most jailbreak techniques, which often rely on handcrafted prompts or separate attacker models, self-jailbreaking requires no external red-team LLM: the target model's own internal knowledge suffices. We operationalize this via \textbf{Self-Jailbreaking via Lexical Insertion Prompting (\textsc{SLIP})}, a black-box algorithm that casts jailbreaking as breadth-first tree search over multi-turn dialogues, incrementally inserting missing content words from the attack goal into benign prompts using the target model as its own guide. Evaluations on AdvBench and HarmBench show \textsc{SLIP} achieves 90--100\% Attack Success Rate (ASR) (avg.\ 94.7\%) across most of the eleven tested models (including GPT-5.1, Claude-Sonnet-4.5, Gemini-2.5-Pro, and DeepSeek-V3), with only ${\sim}7.9$ LLM calls on average, 3--6$\times$ fewer than prior methods. We evaluate existing defenses, show that regex-based approaches are evaded by prompt paraphrasing, and propose the Semantic Drift Monitor (SDM) defense that tracks \textsc{SLIP}'s embedding-space trajectory, achieving 76\% detection at 5\% FPR. However, SDM remains insufficient against adaptive attack strategies, underscoring the need for more advanced defense mechanisms tailored to the self-jailbreaking threat surface. We release our code for reproducibility.
Fabian Retkowski, Alexander Waibel
Comments Accepted at LREC 2026
Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.
Kanghee Lee, Injae Lee, Minseok Kwak, Jungi Hong, Kwonyoung Ryu, Jaesik Park
The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling MLLMs to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under complex and diverse scenarios, consisting of 1M QA pairs across 6 tasks. Our proposed dataset spans both indoor and outdoor scenes, enabling comprehensive evaluation in diverse real-world scenarios. In addition, we introduce a new baseline for multi-view settings, SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset effectively enhances spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and challenging QAs. Code and dataset will be available soon.
Chenggong Zhang, Haopeng Wang, Hexi Meng
Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
Chihiro Noguchi, Takaki Yamamoto
Comments Accepted to CVPR Findings 2026
End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code is available at https://github.com/ToyotaInfoTech/PEBC.
Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem
Comments Project Page: https://yunhe24.github.io/langdrivectrl/
LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It represents each video as an explicit 3D scene graph, decomposing the scene into a static background and dynamic object nodes. To enable fine-grained editing and realism, it introduces a feedback-driven agentic pipeline. An Orchestrator converts user instructions into executable graphs that coordinate specialized multi-modal agents and tools. An Object Grounding Agent aligns free-form text with target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and harmonized using a video diffusion tool, and then further refined by a Video Reviewer Agent to ensure photorealism and appearance alignment. LangDriveCTRL supports both object node editing (removal, insertion, and replacement) and multi-object behavior editing from natural-language instructions. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior photorealism, structural preservation, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.
Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira, Gabriel Kronberger
Comments 8 pages, no figures; accepted in Royal Society Philosophical Transactions A special issue "Symbolic regression in the physical sciences"
Symbolic regression (SR) has emerged as a powerful method for uncovering interpretable mathematical relationships from data, offering a novel route to both scientific discovery and efficient empirical modelling. This article introduces the Special Issue on Symbolic Regression for the Physical Sciences, motivated by the Royal Society discussion meeting held in April 2025. The contributions collected here span applications from automated equation discovery and emergent-phenomena modelling to the construction of compact emulators for computationally expensive simulations. The introductory review outlines the conceptual foundations of SR, contrasts it with conventional regression approaches, and surveys its main use cases in the physical sciences, including the derivation of effective theories, empirical functional forms and surrogate models. We summarise methodological considerations such as search-space design, operator selection, complexity control, feature selection, and integration with modern AI approaches. We also highlight ongoing challenges, including scalability, robustness to noise, overfitting and computational complexity. Finally we emphasise emerging directions, particularly the incorporation of symmetry constraints, asymptotic behaviour and other theoretical information. Taken together, the papers in this Special Issue illustrate the accelerating progress of SR and its growing relevance across the physical sciences.
Joyjit Roy, Samaresh Kumar Singh
Comments 6 pages, 2 figures. Published in the 4th IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI 2026), IEEE
Financial sentiment analysis enhances market understanding. However, standard Natural Language Processing (NLP) approaches encounter significant challenges when applied to small datasets. This study presents a comparative evaluation of embedding-based techniques for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient boosting on a manually labeled dataset of 349 financial news headlines. Experimental results identify a substantial gap between validation and test performance. Despite strong validation metrics, models underperform relative to trivial baselines. The analysis indicates that pretrained embeddings yield diminishing returns below a critical data sufficiency threshold. Small validation sets contribute to overfitting during model selection. Practical application is illustrated through weekly sentiment aggregation and narrative summarization for market monitoring. Overall, the findings indicate that embedding quality alone cannot address fundamental data scarcity in sentiment classification. Practitioners with limited labeled data should consider alternative strategies, including few-shot learning, data augmentation, or lexicon-enhanced hybrid methods.
Xuwei Tan, Yao Ma, Xueru Zhang
Comments Accepted to ACL 2026 Main Conference
Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.
Hao Tang, Hao Chen, Chao Li
Comments ICLR 2026 (International Conference on Learning Representations)
Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a generalized operator-design framework based on the designable spherical Green's function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green's function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green's-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to non-equivariant systems while retaining spectral efficiency and grid invariance. To exploit GSNO, we develop SHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and SHNet consistently outperform state-of-the-art methods. The theoretical and experimental results position GSNO as a principled and generalized framework for spherical operator design and learning, bridging rigorous theory with real-world complexity. The code is available at: https://github.com/haot2025/GSNO.
Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang, Jun Peng, Hui Li, Yiyi Zhou, Rongrong Ji
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting Qwen3-VL 8B to the level of GPT-5 on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 1.2 minutes on a single 4090 GPU.
Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu
Agents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross-validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task-grounded capability metric). We identify a robust capability-saturation effect and additional patterns: (1) a coordination yields diminishing returns once single-agent baselines exceed certain performance; (2) tool-heavy tasks appear to incur multi-agent overhead; and (3) architectures without centralized verification tend to propagate errors more than those with centralized coordination. Relative performance change compared to single-agent baseline ranges from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, demonstrating that architecture-task alignment determines collaborative success. The framework identifies the best-performing architecture for 87% of held-out configurations and shows consistent relative architecture preferences on unseen frontier models. Agent effectiveness depends on alignment between coordination and task structure, and that mismatched coordination degrades the performance.
Longjie Zhao, Ziming Hong, Zhenyang Ren, Runnan Chen, Mingming Gong, Tongliang Liu
3D Gaussian Splatting (3DGS) has become a leading representation for high-fidelity 3D assets, yet protecting these assets via digital watermarking remains an open challenge. Existing 3DGS watermarking methods are robust only to classical distortions and fail under diffusion editing, which operates at both the 2D image level and the 3D scene level, covertly erasing embedded watermarks while preserving visual plausibility. We present RDSplat, the first 3DGS watermarking framework designed to withstand both 2D and 3D diffusion editing. Our key observation is that diffusion models act as low-pass filters that preserve low-frequency structures while regenerating high-frequency details. RDSplat exploits this by embedding 100-bit watermarks exclusively into low-frequency Gaussian primitives identified through Frequency-Aware Primitive Selection (FAPS), which combines the Mip score and directional balance score, while freezing all other primitives. Training efficiency is achieved through a surrogate strategy that replaces costly diffusion forward passes with Gaussian blur augmentation. A dedicated decoder, GeoMark, built on ViT-S/16 with spatially periodic secret embedding, jointly resists diffusion editing and the geometric transformations inherent to novel-view rendering. Extensive experiments on four benchmarks under seven 2D diffusion attacks and iterative 3D editing demonstrate strong classical robustness (bit accuracy 0.811) and competitive diffusion robustness (bit accuracy 0.701) at 100-bit capacity, while completing fine-tuning in 3 to 7 minutes on a single RTX 4090 GPU.
Jiaze Li, Yan Lu, Bin Liu, Guojun Yin, Mang Ye
Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model. The code is available at https://github.com/priester3/DMDL.
Shulei Wang, Longhui Wei, Xin He, Jianbo Ouyang, Hui Lu, Zhou Zhao, Qi Tian
Comments Accepted by CVPR 2026
Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR
Colin Doumont, Donney Fan, Natalie Maus, Jacob R. Gardner, Henry Moss, Geoff Pleiss
Existing high-dimensional Bayesian optimization (BO) methods aim to overcome the curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with >20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensions.
Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian
Comments CVPR 2026 (Oral)
Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. We introduce Perfection Gap Factor (PGF), a normalized metric that measures change in performance as a result of task transfer. We utilize PGF to compute Task Transferability, which captures both the breadth and the magnitude of transfer induced by a source task. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.
Zhiling Zheng
Comments 17 pages, 3 figures
Artificial intelligence (AI) and data science are transforming chemical research, yet few formal courses are tailored to synthetic and experimental chemists, who often face steep entry barriers due to limited coding experience and lack of chemistry-specific examples. We present the design and implementation of AI4CHEM, an introductory data-driven chem-istry course created for students on the synthetic chemistry track with no prior programming background. The curricu-lum emphasizes chemical context over abstract algorithms, using an accessible web-based platform to ensure zero-install machine learning (ML) workflow development practice and in-class active learning. Assessment combines code-guided homework, literature-based mini-reviews, and collaborative projects in which students build AI-assisted workflows for real experimental problems. Learning gains include increased confidence with Python, molecular property prediction, reaction optimization, and data mining, and improved skills in evaluating AI tools in chemistry. All course materials are openly available, offering a discipline-specific, beginner-accessible framework for integrating AI into synthetic chemistry training.
Maria Pilligua, David Serrano-Lozano, Pai Peng, Ramon Baldrich, Michael S. Brown, Javier Vazquez-Corral
Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.
Hangshuo Tian
Particle filters (PFs) are often combined with swarm intelligence (SI) algorithms, such as Chicken Swarm Optimization (CSO), for particle rejuvenation. Separately, Kullback--Leibler divergence (KLD) sampling is a common strategy for adaptively sizing the particle set. However, the theoretical interaction between SI-based rejuvenation kernels and KLD-based adaptive sampling is not yet fully understood. This paper investigates this specific interaction. We analyze, under a simplified modeling framework, the effect of the CSO rejuvenation step on the particle set distribution. We propose that the fitness-driven updates inherent in CSO can be approximated as a form of mean-square contraction. This contraction tends to produce a particle distribution that is more concentrated than that of a baseline PF, or in mathematical terms, a distribution that is plausibly more ``peaked'' in a majorization sense. By applying Karamata's inequality to the concave function that governs the expected bin occupancy in KLD-sampling, our analysis suggests a connection: under the stated assumptions, the CSO-enhanced PF (CPF) is expected to require a lower \emph{expected} particle count than the standard PF to satisfy the same statistical error bound. The goal of this study is not to provide a fully general proof, but rather to offer a tractable theoretical framework that helps to interpret the computational efficiency empirically observed when combining these techniques, and to provide a starting point for designing more efficient adaptive filters.
Maria-Teresa De Rosa Palmini, Eva Cetinic
The ambiguity between generalization and memorization in TTI diffusion models becomes pronounced when prompts invoke culturally shared visual references, a phenomenon we term multimodal iconicity. These are instances in which images and texts reflect established cultural associations, such as when a title recalls a familiar artwork or film scene. Such cases challenge existing approaches to evaluating memorization, as they define a setting in which instance-level memorization and culturally grounded generalization are structurally intertwined. To address this challenge, we propose an evaluation framework to assess a model's ability to remain culturally grounded without relying on visual replication. Specifically, we introduce the Cultural Reference Transformation (CRT) metric, which separates two dimensions of model behavior: Recognition, whether a model evokes a reference, from Realization, how it depicts it through replication or reinterpretation. We evaluate five diffusion models on 767 Wikidata-derived cultural references, covering both still and moving imagery, and find differences in how they respond to multimodal iconicity: some show weaker recognition, while others rely more heavily on replication. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, we find that cultural reference recognition correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our findings show that the behavior of diffusion models in culturally iconic settings cannot be reduced to simple reproduction, but depends on how references are recognized and realized, advancing evaluation beyond simple text-image matching toward richer contextual understanding.
Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani
Comments 8 pages, 2 figures, Accepted for publication in IEEE RA-L (2026)
This article presents HOTFLoc++, an end-to-end hierarchical framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts features at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose learnable multi-scale geometric verification to reduce re-ranking failures due to degraded single-scale correspondences. Our joint training protocol enforces multi-scale geometric consistency of the octree hierarchy via joint optimisation of place recognition with re-ranking and localisation, improving place recognition convergence. Our system achieves comparable or lower localisation errors to baselines, with runtime improvements of almost two orders of magnitude over RANSAC-based registration for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 97.9% on Wild-Places and MulRan, respectively. Our method achieves under 2m and 5$^{\circ}$ error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2x on average. The code is available at https://github.com/csiro-robotics/HOTFLoc.
Benjamin Davis, Hannah Stuart
Vision-based tactile sensors (VBTSs) are a promising technology for robots, providing them with dense signals that can be translated into a multi-faceted understanding of contact. However, existing VBTS tactile surfaces make use of silicone gels, which provide high sensitivity but easily deteriorate from loading and surface wear. We propose that polyurethane rubber, a typically harder material used for high-load applications like shoe soles, rubber wheels, and industrial gaskets, may provide improved physical gel resilience, potentially at the cost of sensitivity. To compare the resilience and sensitivity of two polyurethane gel formulations against a common silicone baseline, we propose a series of repeatable characterization protocols. Our resilience tests assess sensor durability across normal loading, shear loading, and abrasion. For sensitivity, we introduce learning-free assessments of force and spatial sensitivity to directly measure the physical capabilities of each gel without effects introduced from data and model quality. We also include a bottle cap loosening and tightening demonstration to validate the results of our controlled tests with a real-world example. Our results show that polyurethane yields a more robust sensor. While it sacrifices sensitivity at low forces, the effective force range is largely increased, revealing the utility of polyurethane VBTSs over silicone versions in more rugged, high-load applications.
Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Häggström, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Håkan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar
The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Comments 12 pages
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
Julia Wilkins, Jaehun Kim, Matthew E. P. Davies, Juan Pablo Bello, Matthew C. McCallum
Comments Preprint; under review
Music representations are the backbone of modern recommendation systems, powering playlist generation, similarity search, and personalized discovery. Yet most embeddings offer little control for adjusting a single musical attribute, e.g., changing only the mood of a track while preserving its genre or instrumentation. In this work, we address the problem of controllable music retrieval through embedding-based transformation, where the objective is to retrieve songs that remain similar to a seed track but are modified along one chosen dimension. We propose a novel framework for mood-guided music embedding transformation, which learns a mapping from a seed audio embedding to a target embedding guided by mood labels, while preserving other musical attributes. Because mood cannot be directly altered in the seed audio, we introduce a sampling mechanism that retrieves proxy targets to balance diversity with similarity to the seed. We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation. Extensive experiments on two datasets show strong mood transformation performance while retaining genre and instrumentation far better than training-free baselines, establishing controllable embedding transformation as a promising paradigm for personalized music retrieval.
Guanjie Cheng, Siyang Liu, Xinkui Zhao, Yishan Chen, Junqin Huang, Linghe Kong, Shiguang Deng
Comments 12 pages, 12 figures, 6 tables. Submitted to The International Conference on Web Services (ICWS)
Mobile edge crowdsensing (MECS) enables large-scale real-time sensing services, but its continuous data collection and transmission pipeline exposes terminal devices to dynamic privacy risks. Existing privacy protection schemes in MECS typically rely on static configurations or coarse-grained adaptation, making them difficult to balance privacy, data utility, and device overhead under changing channel conditions, data sensitivity, and resource availability. To address this problem, we propose ALPINE, a lightweight closed-loop framework for adaptive privacy budget allocation in MECS. ALPINE performs multi-dimensional risk perception on terminal devices by jointly modeling channel, semantic, contextual, and resource risks, and maps the resulting risk state to a privacy budget through an offline-trained TD3 policy. The selected budget is then used to drive local differential privacy perturbation before data transmission, while edge-side privacy-utility evaluation provides feedback for policy switching and periodic refinement. In this way, ALPINE forms a terminal-edge collaborative control loop that enables real-time, risk-adaptive privacy protection with low online overhead. Extensive experiments on multiple real-world datasets show that ALPINE achieves a better privacy-utility trade-off than representative baselines, reduces the effectiveness of membership inference, property inference, and reconstruction attacks, and preserves robust downstream task performance under dynamic risk conditions. Prototype deployment further demonstrates that ALPINE introduces only modest runtime overhead on resource-constrained devices.
Gaoxiang Huang, Songning Lai, Yutao Yue
Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical utility and undermines the reliability of concept-based strategies. To address these challenges, we propose a Lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without the need for region annotations. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. Complexity analysis reveals that the parameter count and FLOPs of LDCBM are less than 5% higher than those of Vanilla CBM. Furthermore, background mask intervention experiments validate the model's strong capability to suppress irrelevant image regions, further corroborating the high precision of the visual-concept mapping under LDCBM's lightweight design paradigm. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈