arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1107
2604.18637 2026-05-05 q-bio.NC cs.AI cs.CY

NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence

Anthony Zador, Jean-Marc Fellous, Terrence Sejnowski, Gina Adam, James B Aimone, Akwasi Akwaboah, Yiannis Aloimonos, Carmen Amo Alonso, Chiara Bartolozzi, Michael J. Bennington, Michael Berry, Bing W. Brunton, Gert Cauwenberghs, Hillel J. Chiel, Tobi Delbruck, John Doyle, Jason Eshraghian, Ralph Etienne-Cummings, Cornelia Fermuller, Matthew Jacobsen, Ali A. Minai, Barbara Oakley, Alexander G. Ororbia, Joe Paton, Blake Richards, Yulia Sandamirskaya, Abhronil Sengupta, Shihab Shamma, Michael P. Stryker, Seong Jong Yoo, Steven W. Zucker

详情
英文摘要

Neuroscience and Artificial Intelligence (AI) have made impressive progress in recent years but remain only loosely interconnected. Based on a workshop convened by the National Science Foundation in August 2025, we identify three fundamental capability gaps in current AI: the inability to interact with the physical world, inadequate learning that produces brittle systems, and unsustainable energy and data inefficiency. We describe the neuroscience principles that address each: co-design of body and controller, prediction through interaction, multi-scale learning with neuromodulatory control, hierarchical distributed architectures, and sparse event-driven computation. We present a research roadmap organized around these principles at near, mid, and long-term horizons. We argue that realizing this program requires a new generation of researchers trained across the boundary between neuroscience and engineering, and describe the institutional conditions: interdisciplinary training, hardware access, community standards, and ethics, needed to support them. We conclude that NeuroAI, neuroscience-informed artificial intelligence, has the potential to overcome limitations of current AI while deepening our understanding of biological neural computation.

2604.17815 2026-05-05 cs.HC cs.CL cs.CY

Navigating the Conceptual Multiverse

Andre Ye, Jenny Y. Huang, Alicia Guo, Rose Novick, Tamara Broderick, Mitchell L. Gordon

详情
英文摘要

When language models answer open-ended problems, they implicitly make hidden decisions that shape their outputs, leaving users with uncontextualized answers rather than a working map of the problem; drawing on multiverse analysis from statistics, we build and evaluate the conceptual multiverse, an interactive system that represents conceptual decisions such as how to frame a question or what to value as a space users can transparently inspect, intervenably change, and check against principled domain reasoning; for this structure to be worth navigating rather than misleading, it must be rigorous and checkable against domain reasoning norms, so we develop a general verification framework that enforces properties of good decision structures like unambiguity and completeness calibrated by expert-level reasoning; across three domains, the conceptual multiverse helped participants develop a working map of the problem, with philosophy students rewriting essays with sharper framings and reversed theses, alignment annotators moving from surface preferences to reasoning about user intent and harm, and poets identifying compositional patterns that clarified their taste.

2604.16323 2026-05-05 cs.SE cs.AI

Beyond the 'Diff': Addressing Agentic Entropy in Agentic Software Development

Matteo Casserini, Alessandro Facchini, Andrea Ferrario

Comments Camera-ready version of the position paper accepted to the Human-Centered Explainable AI (HCXAI) Workshop at CHI 2026

详情
英文摘要

As autonomous coding agents become deeply embedded in software development workflows, their high operational velocity introduces a critical oversight challenge: the accumulating divergence between agentic actions and architectural intent. We term this process agentic entropy: a systemic drift that traditional code diff-based and HCXAI methods fail to capture, as they address local outputs rather than global agentic behaviour. To close this gap, we propose a process-oriented explainability framework that exposes how agentic decisions unfold across time, tool calls, and architectural boundaries. Built around three pillars (conformity seeding, reasoning monitoring, and a causal graph interface) our approach provides intent-level telemetry that complements, rather than replaces, existing review practices. We demonstrate its relevance across two user profiles: lay users engaged in vibe coding, who gain structural visibility otherwise masked by functional success; and professional developers, who gain richer contextual grounding for code review without increased overhead. By treating cognitive drift as a first-class concern alongside code quality, our framework supports the minimum level of human comprehension required for agentic oversight to remain substantive.

2604.09111 2026-05-05 eess.AS cs.AI

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Ku, Do Hyun Lee, Hong Kook Kim

Comments Accepted to ICPR 2026

详情
英文摘要

Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.

2604.03401 2026-05-05 cs.HC cs.AI cs.CV

Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior

Nolan Platt, Sehrish Nizamani, Alp Tural, Elif Tural, Saad Nizamani, Andrew Katz, Yoonje Lee, Nada Basit

Comments 8 pages, 2 figures. Preprint

详情
英文摘要

Understanding student engagement usually requires time-consuming manual observation or invasive recording that raises privacy concerns. We present a privacy-preserving pipeline that analyzes classroom videos to extract insights about student attention, without storing any identifiable footage. Our system runs on a single GPU, using OpenPose for skeletal extraction and Gaze-LLE for visual attention estimation. Original video frames are deleted immediately after pose extraction, thus only geometric coordinates (stored as JSON) are retained, ensuring compliance with FERPA. The extracted pose and gaze data is processed by QwQ-32B-Reasoning, which performs zero-shot analysis of student behavior across lecture segments. Instructors access results through a web dashboard featuring attention heatmaps and behavioral summaries. Our preliminary findings suggest that LLMs may show promise for multimodal behavior understanding, although they still struggle with spatial reasoning about classroom layouts. We discuss these limitations and outline directions for improving LLM spatial comprehension in educational analytics contexts.

2604.00187 2026-05-05 cs.HC cs.AI cs.ET

Explainable AI for Blind and Low-Vision Users: Navigating Trust, Modality, and Interpretability in the Agentic Era

Abu Noman Md Sakib, Protik Dey, Zijie Zhang, Taslima Akter

Comments Proceedings of the CHI 2026 Workshop on Human-Centered Explainable AI (HCXAI), April 13-17, 2026, Barcelona, Spain

详情
Journal ref
Proceedings of the CHI 2026 Workshop on Human-Centered Explainable AI (HCXAI)
英文摘要

Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.

2603.27833 2026-05-05 math.OC cs.IT cs.MA cs.RO cs.SY eess.SY math.IT

Separation is Optimal for LQR under Intermittent Feedback

Abdullah Y. Etcibasi, C. Emre Koksal, Eylem Ekici

详情
英文摘要

In this work, we first prove that the separation principle holds for communication-constrained LQR problems under i.i.d. zero-mean disturbances with a symmetric distribution. We then solve the dynamic programming problem and show that the optimal scheduling policy is a symmetric threshold rule on the accumulated disturbance since the most recent update, while the optimal controller is a discounted linear feedback law independent of the scheduling policy.

2603.20999 2026-05-05 cs.NI cs.CV cs.MM cs.RO eess.IV

Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

Aizierjiang Aiersilan, Zhangfei Yang

Comments We are pleased to announce that this paper has been accepted by the 35th International Conference on Computer Communications and Networks (ICCCN 2026). We appreciate the valuable feedback from the reviewers and look forward to sharing our findings with the community

详情
英文摘要

Adaptive 360° video streaming for teleoperation faces two coupled challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over fluctuating wireless channels. While Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their lack of interpretability and dependence on offline training limit deployment in safety-critical systems. We propose OrbitStream, a training-free framework that formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract operator gaze, and employs a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves 94.7% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines (~98.5%). Across 3,600 Monte Carlo simulations, it ranks second among 12 algorithms (QoE 2.71 vs. BOLA-E's 2.80), outperforming FastMPC (1.84), with 1.01 ms decision latency and minimal rebuffering.

2603.18066 2026-05-05 cs.NE cs.AI cs.AR cs.LG

A Synthesizable RTL Implementation of Predictive Coding Networks

Timothy Oh

详情
英文摘要

Backpropagation has enabled modern deep learning but is difficult to realize as an online, fully distributed hardware learning system due to global error propagation, phase separation, and heavy reliance on centralized memory. Predictive coding offers an alternative in which inference and learning arise from local prediction-error dynamics between adjacent layers. This paper presents a digital architecture that implements a discrete-time predictive coding update directly in hardware. Each neural core maintains its own activity, prediction error, and synaptic weights, and communicates only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping primitive that enforces boundary conditions while leaving the internal update schedule unchanged. The design is a deterministic, synthesizable RTL substrate built around a sequential MAC datapath and a fixed finite-state schedule. Rather than executing a task-specific instruction sequence inside the learning substrate, the system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions. The contribution of this work is not a new learning rule, but a complete synthesizable digital substrate that executes predictive-coding learning dynamics directly in hardware.

2602.14012 2026-05-05 cs.CR cs.AI cs.SE

From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

Youpeng Li, Fuxun Yu, Xinda Wang

详情
英文摘要

The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, demonstrating that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. Our study further reveals VD-specific post-training guidelines and insights beyond common practices: (1) For data curation, contrary to the widespread use of rationalization-based supervision in prior VD work, SFT based on rejection sampling proves more effective, as rationalization can introduce hallucinations; in RL training, the inherently skewed difficulty distribution of vulnerabilities leads difficulty-aware data filtering to drastically reduce data coverage, causing non-negligible performance loss, and undermines curriculum learning, while pair-based data scheduling can partially mitigate this. (2) For stage interactions, unlike preference optimization typically applied to lightly trained SFT models, increasing SFT epochs consistently benefits off-policy preference optimization in VD tasks; however, excessive SFT suppresses self-exploration in on-policy RL, limiting its gains. (3) For reward mechanisms, naively treating vulnerability classification correctness as reward signals leads to reward hacking, whereas fine-grained root-cause judgments provide more reliable credit assignment; specification-based rewards further improve efficiency at the cost of additional design and generation effort. (4) For evaluation protocols, LLM-as-a-Judge based on root-cause analysis offers a more robust alternative, albeit with variability across judge models.

2601.07885 2026-05-05 cs.CR cs.AI cs.SE

False Friends in the Shell: Unveiling the Emoticon Semantic Confusion in Large Language Models

Weipeng Jiang, Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Yang Liu

详情
英文摘要

Emoticons are widely used in digital communication to convey affective intent, yet their safety implications for Large Language Models (LLMs) remain largely unexplored. In this paper, we identify emoticon semantic confusion, a vulnerability where LLMs misinterpret ASCII-based emoticons to perform unintended and even destructive actions. To systematically study this phenomenon, we develop an automated data generation pipeline and construct a dataset containing 3,757 code-oriented test cases spanning 21 meta-scenarios, four programming languages, and varying contextual complexities. Our study on six LLMs reveals that emoticon semantic confusion is pervasive, with an average confusion ratio exceeding 38%. More critically, over 90% of confused responses yield 'silent failures', which are syntactically valid outputs but deviate from user intent, potentially leading to destructive security consequences. Furthermore, we observe that this vulnerability readily transfers to popular agent frameworks, while existing prompt-based mitigations remain largely ineffective. We call on the community to recognize this emerging vulnerability and develop effective mitigation methods to uphold the safety and reliability of the LLM system.

2601.06035 2026-05-05 cs.GR cs.CV

Investigating Anthropometric Fidelity in SAM 3D Body

Aizierjiang Aiersilan, Ruting Cheng, James Hahn

详情
英文摘要

The release of SAM 3D Body is a recent development in human mesh recovery, demonstrating improved performance in producing clean, topologically coherent meshes from single images. By leveraging the Momentum Human Rig (MHR), it achieves robustness to occlusion and diverse poses. However, our evaluation reveals a specific and consistent limitation: the model struggles to reconstruct detailed anthropometric deviations, particularly in populations exhibiting distinctive morphological alterations such as geriatric muscle atrophy, scoliosis, or pregnancy, even when these features are prominent in the input image. In this paper, we investigate this phenomenon not as a failure of the model's capacity, but as a byproduct of the "perception-distortion trade-off". We posit that the architectural reliance on the low-dimensional parametric MHR representation, combined with semantic-invariant conditioning (DINOv3) and annotation-based alignment, creates a pervasive "regression to the mean" effect. We analyze these mechanisms to understand why individual biological details are smoothed out. Furthermore, we state our contributions by proposing specific, constructive pathways for future work, such as implicit-explicit hybrid representations and Medical-in-the-Loop alignment, to extend the baseline performance of SAM 3D Body into the high-precision medical domain.

2601.05254 2026-05-05 cs.IR cs.CL

TagRAG: Tag-guided Hierarchical Knowledge Graph Retrieval-Augmented Generation

Wenbiao Tao, Xinyuan Li, Yunshi Lan, Weining Qian

Comments Accepted by ACL 2026 Findings

详情
英文摘要

Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average winning rate of 78.36% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.

2512.12109 2026-05-05 cs.CY cs.AI cs.LO

A Neuro-Symbolic Framework for Accountability in Public-Sector AI

Allen Daniel Sunny, Ido Sivan-Sevilla

Comments Accepted at FAccT 2026 (The 2026 ACM Conference on Fairness, Accountability, and Transparency), June 25-28, Montreal, Canada

详情
英文摘要

Automated eligibility systems increasingly determine access to essential public benefits, but the explanations they generate often fail to reflect the legal rules that authorize those decisions. This thesis develops a legally grounded explainability framework that links system-generated decision justifications to the statutory constraints of CalFresh, California's Supplemental Nutrition Assistance Program. The framework combines a structured ontology of eligibility requirements derived from the state's Manual of Policies and Procedures (MPP), a rule extraction pipeline that expresses statutory logic in a verifiable formal representation, and a solver-based reasoning layer to evaluate whether the explanation aligns with governing law. Case evaluations demonstrate the framework's ability to detect legally inconsistent explanations, highlight violated eligibility rules, and support procedural accountability by making the basis of automated determinations traceable and contestable.

2512.11415 2026-05-05 cond-mat.stat-mech cs.LG

Emergence of Nonequilibrium Latent Cycles in Unsupervised Generative Modeling

Marco Baiesi, Alberto Rosso

Comments v2: 11 pages, 7 figures. Accepted in PRE

详情
英文摘要

We show that nonequilibrium dynamics can play a constructive role in unsupervised machine learning by inducing the spontaneous emergence of latent-state cycles. We introduce a model in which visible and hidden variables interact through two independently parametrized transition matrices, defining a Markov chain whose steady state is intrinsically out of equilibrium. Likelihood maximization drives this system toward nonequilibrium steady states with finite entropy production, reduced self-transition probabilities, and persistent probability currents in the latent space. These cycles are not imposed by the architecture but arise from training, and models that develop them avoid the low-log-likelihood regime associated with nearly reversible dynamics while more faithfully reproducing the empirical distribution of data classes. Compared with equilibrium approaches such as restricted Boltzmann machines, our model breaks the detailed balance between the forward and backward conditional transitions and relies on a log-likelihood gradient that depends explicitly on the last two steps of the Markov chain. Hence, this exploration of the interface between nonequilibrium statistical physics and modern machine learning suggests that introducing irreversibility into latent-variable models can enhance generative performance.

2511.20657 2026-05-05 cs.HC cs.AI

Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

Raziyeh Zall, Alireza Kheyrkhah, Erik Cambria, Zahra Naseri, M. Reza Kangavari

Comments Enhanced the quality of figures, incorporated additional and recent references, and improved the manuscript for better clarity and writing quality

详情
英文摘要

The development of agents with emotional intelligence is becoming increasingly vital due to their significant role in human-computer interaction and the growing integration of computer systems across various sectors of society. Affective computing aims to design intelligent systems that can recognize, evoke, and express human emotions, thereby emulating human emotional intelligence. While previous reviews have focused on specific aspects of this field, there has been limited comprehensive research that encompasses emotion understanding, elicitation, and expression, along with the related challenges. This survey addresses this gap by providing a holistic overview of core components of artificial emotion intelligence. It covers emotion understanding through multimodal data processing, as well as affective cognition, which includes cognitive appraisal, emotion mapping, and adaptive modulation in decision-making, learning, and reasoning. Additionally, it addresses the synthesis of emotional expression across text, speech, and facial modalities to enhance human-agent interaction. This paper identifies and analyzes the key challenges and issues encountered in the development of affective systems, covering state-of-the-art methodologies designed to address them. Finally, we highlight promising future directions, with particular emphasis on the potential of generative technologies to advance affective computing.

2511.06838 2026-05-05 cs.AR cs.LG

P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah

Comments Accepted to the 53rd IEEE/ACM International Symposium on Computer Architecture (ISCA), 2026

详情
英文摘要

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing units (NPUs) with DRAM-based processing-in-memory (PIM) for LLM acceleration. However, the high-precision PIM compute units incur significant area and power overhead in DRAM technology, limiting the effective computation throughput. In this paper, we introduce P3-LLM, a novel NPU-PIM integrated accelerator for edge LLM inference. Our approach is threefold: First, we propose a flexible mixed-precision quantization scheme, which leverages hybrid numerical formats to quantize different LLM operands with high compression efficiency and minimal accuracy loss. Second, we architect an efficient PIM accelerator for P3-LLM, featuring enhanced compute units to support hybrid numerical formats. Our careful choice of numerical formats allows to co-design low-precision PIM compute units that significantly boost the computation throughput under iso-area constraints. Third, we optimize the low-precision dataflow of different LLM modules by applying operator fusion to minimize the overhead of runtime dequantization. Evaluations on diverse LLMs and tasks demonstrate that P3-LLM achieves higher accuracy than state-of-the-art KV-cache quantization and weight-activation quantization algorithms. Combining the proposed quantization scheme with low-precision PIM architecture co-design, P3-LLM yields an average of $4.9\times$, $2.0\times$, and $3.4\times$ speedups over state-of-the-art LLM accelerators HBM-PIM, Ecco, and Pimba, respectively. Code is available at https://github.com/yc2367/P3-LLM.

2510.20103 2026-05-05 physics.chem-ph cs.LG

Extending machine learning model for implicit solvation to free energy calculations

Rishabh Dey, Michael Brocidiacono, Kushal Koirala, Alexander Tropsha, Konstantin I. Popov

详情
英文摘要

The implicit solvent approach offers a computationally efficient framework to model solvation effects in molecular simulations. However, its accuracy often falls short compared to explicit solvent models, limiting its use in precise thermodynamic calculations. Recent advancements in machine learning (ML) present an opportunity to overcome these limitations by leveraging neural networks to develop more precise implicit solvent potentials for diverse applications. A major drawback of current ML-based methods is their reliance on force-matching alone, which can lead to energy predictions that differ by an arbitrary constant and are therefore unsuitable for absolute free energy comparisons. Here, we introduce a novel methodology with a graph neural network (GNN)-based implicit solvent model, dubbed Lambda Solvation Neural Network (LSNN). In addition to force-matching, this network was trained to match the derivatives of alchemical variables, ensuring that solvation free energies can be meaningfully compared across chemical species. Trained on a dataset of approximately 300,000 small molecules, LSNN achieves free energy predictions with accuracy comparable to explicit-solvent alchemical simulations, while offering a computational speedup and establishing a foundational framework for future applications in drug discovery.

2510.08599 2026-05-05 eess.AS cs.AI cs.CL cs.SD

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy, Christophe Cerisara, Irina Illina

详情
英文摘要

Pruning large pre-trained transformers in a data-scarce scenario is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40 and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90 of the original performance while being 48 smaller and 2.15x faster on a MacBook Air M1.

2510.05109 2026-05-05 cs.DC cs.AI cs.CL eess.SP

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee

详情
英文摘要

Large Multimodal Models (LMMs) are inherently modular, comprising vision and audio encoders, a projector, and a language backbone. Yet existing systems execute them monolithically, underutilizing the heterogeneous accelerators (NPUs, GPUs, DSPs) on modern SoCs and inflating end-to-end latency. We present Nanomind, a hardware-software co-design inference framework that decomposes each LMM into modular "bricks"--vision, projector, language, and audio--and maps each brick to its best-suited compute units. A Token-Aware Buffer Manager (TABM) enables zero-copy embedding transfer across accelerators on unified-memory SoCs, bypassing CPU bottlenecks. Combined with customized hardware, a battery-aware scheduler, and fused low-bit GEMM kernels, Nanomind runs entirely on a compact, battery-powered prototype that operates fully offline. Nanomind reduces end-to-end energy by 42.3% against mainstream edge frameworks and devkits; in its on-demand low-power mode, the prototype runs LLaVA-OneVision-Qwen2-0.5B with a camera for nearly 18.8 hours on a single 2,000 mAh battery.

2509.17661 2026-05-05 eess.AS cs.SD

Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring

Jacob J Webber, Oliver Watts, Lovisa Wihlborg, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Cassia Valentini-Botinhao

Comments Accepted to Odyssey 2026

详情
英文摘要

Monitoring the progression of neurodegenerative disease (NDD) has important applications in planning treatment and evaluating new medications. Whereas much work has focused on discriminating patients from healthy controls, or predicting real-world health metrics, we propose a novel measure of disease progression: the severity score, derived from a model trained to minimize what we call the comparator loss. This loss ensures scores obey an ordering relation, based on diagnosis, clinical scores, or simply chronological order of recordings. The proposed comparator loss-based system has the potential to incorporate information from disparate health metrics, critical for making full use of small health-related datasets. We show that a model trained on lightly annotated data is capable of distinguishing between subjects with NDDs and healthy controls. Our score also correlates with annotations not observed in training, such as ALSFRS-R and those of speech and language therapists.

2509.10577 2026-05-05 cs.CR cs.AI

The Coding Limits of Robust Watermarking for Generative Models

Danilo Francati, Yevin Nikhel Goonatilake, Shubham Pawar, Daniele Venturi, Giuseppe Ateniese

Comments Accepted at IEEE EuroS&P 2026

详情
英文摘要

We study a basic question about cryptographic watermarking for generative models: how reliable can a watermark remain when an adversary is allowed to corrupt the encoded signal? To address this question, we introduce a minimal coding abstraction that we call a zero-bit tamper-detection code. This is a secret-key procedure that samples a pseudorandom codeword and, given a candidate word, decides whether it should be treated as unmarked content or as the result of tampering with a valid codeword. It captures the two core requirements of robust watermarking: soundness and tamper detection. Within this abstraction we prove a sharp unconditional limit on robustness to independent symbol corruption. For an alphabet of size $q$, there is a critical corruption rate of $1-1/q$ such that no scheme with soundness, even relaxed to allow a fixed constant false positive probability on random content, can reliably detect tampering once an adversary can change more than this fraction of symbols. In particular, in the binary case no cryptographic watermark can remain robust if more than half of the encoded bits are modified. We also show that this threshold is tight by giving simple information-theoretic constructions that achieve soundness and tamper detection for all strictly smaller corruption rates. We then test experimentally whether this limit appears in practice by looking at the recent watermarking for images of Gunn, Zhao, and Song (ICLR 2025). We show that a simple crop and resize operation reliably flipped about half of the latent signs and consistently prevented belief-propagation decoding from recovering the codeword, erasing the watermark while leaving the image visually intact.

2509.10468 2026-05-05 cs.IR cs.AI cs.CL

Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation

Yifan Liu, Yaokun Liu, Zelin Li, Zhenrui Yue, Gyuseok Lee, Ruichen Yao, Yang Zhang, Dong Wang

Comments Accepted to SIGIR 2026. Full author version

详情
英文摘要

Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn $\underline{DE}$composed $\underline{CO}$ntextual Token $\underline{R}$epresentations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance.

2508.12674 2026-05-05 stat.ML cs.LG cs.SI

Unfolded Laplacian Spectral Embedding: A Theoretically Grounded Approach to Dynamic Network Representation

Haruka Ezoe, Hiroki Matsumoto, Ryohei Hisano

详情
Journal ref
43rd International Conference on Machine Learning (ICML 2026)
英文摘要

Dynamic relational data arise in many machine learning applications, yet their evolving structure poses challenges for learning representations that remain consistent and interpretable over time. A common approach is to learn time varying node embeddings, whose usefulness depends on well defined stability properties across nodes and across time. We introduce Unfolded Laplacian Spectral Embedding (ULSE), a principled extension of unfolded adjacency spectral embedding to normalized Laplacian operators, a setting where stability guarantees have remained out of reach. We prove that ULSE satisfies both cross-sectional and longitudinal stability under a dynamic stochastic block model. Moreover, the Laplacian formulation yields a dynamic Cheeger-type inequality linking the spectrum of the unfolded normalized Laplacian to worst case conductance over time, providing structural insight into the embeddings. Empirical results on synthetic and real world dynamic networks validate the theory.

2508.09179 2026-05-05 eess.IV cs.CV

HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Hongli Chen, Pengcheng Fang, Yuxia Chen, Yingxuan Ren, Jing Hao, Fangfang Tang, Xiaohao Cai, Shanshan Shan, Feng Liu

详情
英文摘要

Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.

2506.24056 2026-05-05 cs.CR cs.CL cs.LG

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Tung-Ling Li, Hongliang Liu

详情
英文摘要

RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes ($<$10 tokens per component) whose cumulative effect closes the gap. The method requires ${\approx}26{,}000$ forward-pass equivalents per family (${\approx}2$~min on one A100), ${\approx}125\times$ less than a single GCG search. Suffixes discovered on 0.5B--2B models transfer without modification to 72B within family. An 8-suffix ensemble reaches 38-96\% True ASR across 13 models on AdvBench and HarmBench, with most suffixes having $10^{3}$-$10^{4}\times$ lower perplexity than GCG-meaning published perplexity-filter defenses that collapse GCG (64.7%$\to$1.0%) leave our suffixes nearly intact (76.9%$\to$76.0%). These results demonstrate that current alignment margins, while consistently present, can be thin and efficiently measurable, and that defense strategies must account for in-distribution suffixes.

2503.09559 2026-05-05 eess.IV cs.CV cs.LG eess.SP

Interlaced R2D2 DNN Series for Scalable Non-Cartesian MRI with Sensitivity Self-calibration

Shijie Chen, Yiwei Chen, Amir Aghabiglou, Motahare Torki, Chao Tang, Ruud B. van Heeswijk, Yves Wiaux

Comments 13 pages, 8 figures

详情
英文摘要

We introduce interlaced R2D2 (iR2D2), a DNN series paradigm for scalable image reconstruction from accelerated non-Cartesian k-space acquisitions in MRI with sensitivity map self-calibration. While unrolled DNN architectures provide robust image formation, embedding non-uniform fast Fourier transform operators within the backpropagation graph becomes impractical to train at large scale, e.g., in 2D MRI with a large number of coils, or for higher-dimensional imaging. To address this scalability challenge, we leverage the R2D2 paradigm as a learned version of the Matching Pursuit algorithm that was recently introduced in radio astronomy for fast large-scale Fourier imaging. R2D2's reconstruction is formed as a series of residual images iteratively estimated as outputs of DNN modules taking the previous iteration's data residual as input. Specific to MRI, precomputed sensitivity maps derived from undersampled data can yield an inaccurate measurement operator, which may adversely affect the performance of iterative algorithms such as R2D2. Thus, we extend the R2D2 framework to iR2D2 by introducing a bespoke interlaced architecture that alternates between two R2D2 DNN series to jointly self-calibrate sensitivity maps and form the MR image. We further enhance iR2D2 to operate as an adaptive solver governed by an error-controlled update condition that enforces a sufficient residual energy descent, a dynamic capability fundamentally incompatible with the predefined forward passes of unrolled architectures. Extensive experiments in simulation and on real data, targeting undersampled radial k-space sampling, demonstrate that iR2D2 significantly improves upon R2D2 and outperforms state-of-the-art benchmarks, delivering scalable, high-fidelity imaging with corrected sensitivity profiles.

2501.10859 2026-05-05 eess.SY cs.LG cs.SY math.OC

What price to pay? Auto-tuning a building MPC controller for optimal economic cost

Jiarui Yu, Jicheng Shi, Wenjie Xu, Colin N. Jones

Comments 11 pages, 5 figures

详情
Journal ref
Control Engineering Practice 174 107023 (2026)
英文摘要

Demand-side management (DSM) programs introduce complex pricing, requiring advanced control for cost minimization. Model Predictive Control (MPC) offers a solution but its performance hinges on appropriate hyperparameter tuning. We propose using Constrained Bayesian Optimization (CONFIG) to automate this process. In a case study, our optimized MPC reduced electricity costs by 26.90% compared to a rule-based controller and by 17.46% versus an manually tuned MPC. Analysis of real contracts further showed that optimal DSM program selection can lower monthly bills by up to 20.18%, demonstrating a data-driven path to significant consumer savings.

2412.02408 2026-05-05 cs.SI cs.LG q-fin.GN

Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions

Shabnam Fazliani, Mohammad Mowlavi Sorond, Arsalan Masoudifard

Comments 23 pages, 12 figures

详情
英文摘要

The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. This growth, however, is accompanied by significant security risks such as illicit accounts engaged in fraud. Effective detection is further limited by the scarcity of labeled data and the evolving tactics of malicious accounts. To address these challenges with a robust solution for safeguarding the DeFi ecosystem, we propose $\textbf{SLEID}$, a $\textbf{S}$elf-$\textbf{L}$earning $\textbf{E}$nsemble-based $\textbf{I}$llicit account $\textbf{D}$etection framework. SLEID uses an Isolation Forest model for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, enhancing detection accuracy. Experiments on 6,903,860 Ethereum transactions with extensive DeFi interaction coverage demonstrate that SLEID significantly outperforms supervised and semi-supervised baselines with $\textbf{+2.56}$ percentage-point precision, comparable recall, and $\textbf{+0.90}$ percentage-point F1 -- particularly for the minority illicit class -- alongside $\textbf{+3.74}$ percentage-points higher accuracy and improvements in PR-AUC, while substantially reducing reliance on labeled data.

2408.01914 2026-05-05 math.NA cs.AI cs.NA

Partial-differential-algebraic equations of nonlinear dynamics by Physics-Informed Neural-Network: (I) Operator splitting and framework assessment

Loc Vu-Quoc, Alexander Humer

Comments 70 pages, 52 figures

详情
Journal ref
International Journal for Numerical Methods in Engineering, 2024;e7586
英文摘要

Several forms for constructing novel physics-informed neural-networks (PINN) for the solution of partial-differential-algebraic equations based on derivative operator splitting are proposed, using the nonlinear Kirchhoff rod as a prototype for demonstration. The open-source DeepXDE is likely the most well documented framework with many examples. Yet, we encountered some pathological problems and proposed novel methods to resolve them. Among these novel methods are the PDE forms, which evolve from the lower-level form with fewer unknown dependent variables to higher-level form with more dependent variables, in addition to those from lower-level forms. Traditionally, the highest-level form, the balance-of-momenta form, is the starting point for (hand) deriving the lowest-level form through a tedious (and error prone) process of successive substitutions. The next step in a finite element method is to discretize the lowest-level form upon forming a weak form and linearization with appropriate interpolation functions, followed by their implementation in a code and testing. The time-consuming tedium in all of these steps could be bypassed by applying the proposed novel PINN directly to the highest-level form. We developed a script based on JAX. While our JAX script did not show the pathological problems of DDE-T (DDE with TensorFlow backend), it is slower than DDE-T. That DDE-T itself being more efficient in higher-level form than in lower-level form makes working directly with higher-level form even more attractive in addition to the advantages mentioned further above. Since coming up with an appropriate learning-rate schedule for a good solution is more art than science, we systematically codified in detail our experience running optimization through a normalization/standardization of the network-training process so readers can reproduce our results.