arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1618
专题追踪
2603.02345 2026-03-04 cs.SE cs.AI cs.MA

RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Sami Abuzakuk, Lucas Crijns, Anne-Marie Kermarrec, Rafael Pires, Martijn de Vos

详情
英文摘要

Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.

2603.02331 2026-03-04 econ.GN cs.LG q-fin.EC

Neural Demand Estimation with Habit Formation and Rationality Constraints

Marta Grzeskiewicz

详情
英文摘要

We develop a flexible neural demand system for continuous budget allocation that estimates budget shares on the simplex by minimizing KL divergence. Shares are produced via a softmax of a state-dependent preference scorer and disciplined with regularity penalties (monotonicity, Slutsky symmetry) to support coherent comparative statics and welfare without imposing a parametric utility form. State dependence enters through a habit stock defined as an exponentially weighted moving average of past consumption. Simulations recover elasticities and welfare accurately and show sizable gains when habit formation is present. In our empirical application using Dominick's analgesics data, adding habit reduces out-of-sample error by c.33%, reshapes substitution patterns, and increases CV losses from a 10% ibuprofen price rise by about 15-16% relative to a static model. The code is available at https://github.com/martagrz/neural_demand_habit .

2603.02297 2026-03-04 cs.CR cs.AI

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Nancy Lau, Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin, Evan Harris, Dylan Bowman, Mario Brajkovski, Jaideep Chawla, Dan Zhao

Comments Accepted to ICLR 2026 Workshop "Agents in the Wild"

详情
英文摘要

Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.

2603.02294 2026-03-04 eess.IV cs.CV

Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

Nikhileswara Rao Sulake

Comments This paper would be a part of the CXR Long Tail Challenge in ISBI 2026. This is my team report of it's work during the challenge

详情
英文摘要

Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. In this work, we present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest. Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst all 68 participating teams with total of 1528 submissions. We provide a candid analysis of the development-to-test performance gap and discuss practical insights for handling class imbalance in clinical imaging settings. Code is available at https://github.com/Nikhil-Rao20/Long_Tail.

2603.02289 2026-03-04 stat.ME cs.LG stat.ML

Topological Causal Effects

Kwangho Kim, Hajin Lee

Journal ref Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.

2603.02277 2026-03-04 cs.CR cs.AI

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, Harry Coppock

详情
英文摘要

Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce SANDBOXESCAPEBENCH, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SANDBOXESCAPEBENCH covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

2603.02271 2026-03-04 cs.PF cs.AI cs.AR cs.RO

Characterizing VLA Models: Identifying the Action Generation Bottleneck for Edge AI Architectures

Manoj Vishwanathan, Suvinay Subramanian, Anand Raghunathan

Comments 3 Pages 4 Figures for Workshop paper

详情
英文摘要

Vision-Language-Action (VLA) models are an emerging class of workloads critical for robotics and embodied AI at the edge. As these models scale, they demonstrate significant capability gains, yet they must be deployed locally to meet the strict latency requirements of real-time applications. This paper characterizes VLA performance on two generations of edge hardware, viz. the Nvidia Jetson Orin and Thor platforms. Using MolmoAct-7B, a state-of-the-art VLA model, we identify a primary execution bottleneck: up to 75% of end-to-end latency is consumed by the memory-bound action-generation phase. Through analytical modeling and simulations, we project the hardware requirements for scaling to 100B parameter models. We also explore the impact of high-bandwidth memory technologies and processing-in-memory (PIM) as promising future pathways in edge systems for embodied AI.

2603.02262 2026-03-04 cs.CR cs.AI cs.LG

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Jingyuan Xie, Wenjie Wang, Ji Wu, Jiandong Gao

详情
英文摘要

Supervised fine-tuning (SFT) is essential for the development of medical large language models (LLMs), yet prior poisoning studies have mainly focused on the detectable backdoor attacks. We propose a novel poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model performance on targeted medical topics. Results showed that knowledge overwriting was ineffective, while rationale poisoning caused significant decline on the accuracy of the target subject, as long as no correct samples of the same subject appear in the dataset. A minimum number and ratio of poisoned samples was needed to carry out an effective and stealthy attack, which was more efficient and accurate than catastrophic forgetting. We demonstrate though this study the risk of SFT-stage poisoning, hoping to spur more studies of defense in the sensitive medical domain.

2603.02261 2026-03-04 quant-ph cs.LG

Quantum AS-DeepOnet: Quantum Attentive Stacked DeepONet for Solving 2D Evolution Equations

Hongquan Wang, Hanshu Chen, Ilia Marchevsky, Zhuojia Fu

详情
英文摘要

DeepONet enables retraining-free inference across varying initial conditions or source terms at the cost of high computational requirements. This paper proposes a hybrid quantum operator network (Quantum AS-DeepOnet) suitable for solving 2D evolution equations. By combining Parameterized Quantum Circuits and cross-subnet attention methods, we can solve 2D evolution equations using only 60% of the trainable parameters while maintaining accuracy and convergence comparable to the classical DeepONet method.

2603.02248 2026-03-04 cs.DB cs.CL cs.IR cs.LG

HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval

Sungho Park, Joohyung Yun, Jongwuk Lee, Wook-Shin Han

Comments 9 pages, 6 figures. Accepted at ACL 2025 main. Project page: https://helios-projectpage.github.io/

Journal ref Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32424-32444, July 2025

详情
英文摘要

Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming "stars," which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6\% and 39.9\% in recall and nDCG, respectively, on the OTT-QA benchmark.

2603.02247 2026-03-04 eess.AS cs.LG cs.SD

OnDA: On-device Channel Pruning for Efficient Personalized Keyword Spotting

Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari

Comments Submitted for review at Interspeech2026

详情
英文摘要

Always-on keyword spotting (KWS) demands on-device adaptation to cope with user- and environment-specific distribution shifts under tight latency and energy budgets. This paper proposes, for the first time, coupling weight adaptation (i.e., on-device training) with architectural adaptation, in the form of online structured channel pruning, for personalized on-device KWS. Starting from a state-of-the-art self-learning personalized KWS pipeline, we compare data-agnostic and data-aware pruning criteria applied on in-field pseudo-labelled user data. On the HeySnips and HeySnapdragon datasets, we achieve up to 9.63x model-size compression with respect to unpruned baselines at iso-task performance, measured as the accuracy at 0.5 false alarms per hour. When deploying our adaptation pipeline on a Jetson Orin Nano embedded GPU, we achieve up to 1.52x/1.57x and 1.64x/1.77x latency and energy-consumption improvements during online training/inference compared to weights-only adaptation.

2603.02246 2026-03-04 eess.AS cs.SD

Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs

Marcin Pietroń, Szymon Piórkowski, Kamil Faber, Dominik Żurek, Michał Karwatowski, Jerzy Duda, Hubert Zieliński, Piotr Lipnicki, Mikołaj Leszczuk

详情
英文摘要

This article concerns comparative studies on the Automatic Speech Recognition (ASR) model incorporated with the Large Language Model (LLM) used for medical interviews. The proposed solution is tested on polish language benchmarks and dataset with medical interviews. The latest ASR technologies are based on convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. Most of them work as end-to-end solutions. The presented approach in the case of the Whisper model shows a two-stage solution with End-To-End ASR and LLM working together in a pipeline. The ASR output is an input for LLM. The LLM is a component by which the output from ASR is corrected and improved. Comparative studies for automatic recognition of the Polish language between modern End-To-End deep learning architectures and the ASR hybrid model were performed. The medical interview tests were performed with two state-of-the-art ASR models: OpenAI Whisper incorporated with LLM and Scribe ElevenLabs. Additionally, the results were compared with five more end-to-end models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR and ESPnet Model Zoo) on Mozilla Common Voice and VoxPopuli databases. Tests were conducted for clean audio signal, signal with bandwidth limitation, and degraded. The tested models were evaluated on the basis of Word Error Rate (WER) and Character Error Rate (CER). The results show that the Whisper model performs by far the best among the open-source models. ElevenLabs Scribe model, on the other hand, performs best for Polish on both general benchmark and medical data.

2603.02241 2026-03-04 q-bio.NC cs.AI

A Benchmark Analysis of Graph and Non-Graph Methods for Caenorhabditis Elegans Neuron Classification

Jingqi Lu, Keqi Han, Yun Wang, Lu Mi, Carl Yang

详情
英文摘要

This study establishes a benchmark for Caenorhabditis elegans neuron classification, comparing four graph methods (GCN, GraphSAGE, GAT, GraphTransformer) against four non-graph methods (Logistic Regression, MLP, LOLCAT, NeuPRINT). Using the functional connectome, we classified Sensory, Interneuron, and Motor neurons based on Spatial, Connection, and Neuronal Activity features. Results show that attention-based GNNs significantly outperform baselines on the Spatial and Connection features. The Neuronal Activity features yielded poor performance, likely due to the low temporal resolution of the underlying neuronal activity data. Our benchmark validates the use of GNNs and highlights that Spatial and Connection features are key predictors for Caenorhabditis elegans neuron classes. Code is available at: https://github.com/JingqiLuu/neuronclf-gnn-benchmark.

2603.02212 2026-03-04 cs.DB cs.AI

GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning

Qizhi Wang

Comments 8 pages, 6 figures for the main paper

详情
英文摘要

Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.

2603.00218 2026-03-04 eess.IV cs.CV

GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features

Yunzheng Zhu, Aichi Chien, Kimaya kulkarni, Luoting Zhuang, Stephen Park, Ricky Savjani, Daniel Low, William Hsu

详情
英文摘要

Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.

2603.00169 2026-03-04 eess.SY cs.RO cs.SY

Design Framework and Manufacturing of an Active Magnetic Bearing Spindle for Micro-Milling Applications

Kazi Sher Ahmed, Bekir Bediz

详情
英文摘要

Micro-milling spindles require high rotational speeds where conventional rolling element bearings face limitations such as friction and thermal expansion. Active magnetic bearings (AMBs) address these challenges by providing non-contact and lubrication-free operation at ultra-high speeds with the ability to actively regulate spindle dynamics. The existing literature on AMB spindles has mainly reported specific prototype realizations or control system implementations for specific spindle dynamics. Consequently, design knowledge remains fragmented across isolated successful studies. This paper addresses this gap by presenting a systematic and iterative framework to design and manufacture a micro-milling AMB spindle. The process involves a multidisciplinary design flow with a focus on critical practical aspects of manufacturing. The realized spindle is reported as a case study.

2603.00084 2026-03-04 cs.DL cs.AI cs.CL cs.IR

DeepXiv-SDK: An Agentic Data Interface for Scientific Literature

Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu

Comments Project at https://github.com/DeepXiv/deepxiv_sdk

详情
英文摘要

LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.

2603.00047 2026-03-04 econ.EM cs.AI cs.LG math.OC

What Is the Alignment Tax?

Robin Young

详情
英文摘要

The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component determined by data structure and a packing residual that vanishes as $O(m'/d)$ with model dimension $d$, and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives.

2602.22903 2026-03-04 cs.IR cs.LG

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

Yunpeng Hong, Chenyang Bu, Jie Zhang, Yi He, Di Wu, Xindong Wu

Comments 2026 SIGKDD Accept

详情
英文摘要

Multimodal Entity Alignment (MMEA) aims to identify equivalent entities across different data modalities, enabling structural data integration that in turn improves the performance of various large language model applications. To lift the requirement of labeled seed pairs that are difficult to obtain, recent methods shifted to an unsupervised paradigm using pseudo-alignment seeds. However, unsupervised entity alignment in multimodal settings remains underexplored, mainly because the incorporation of multimodal information often results in imbalanced coverage of pseudo-seeds within the knowledge graph. To overcome this, we propose PSQE (Pseudo-Seed Quality Enhancement) to improve the precision and graph coverage balance of pseudo seeds via multimodal information and clustering-resampling. Theoretical analysis reveals the impact of pseudo seeds on existing contrastive learning-based MMEA models. In particular, pseudo seeds can influence the attraction and the repulsion terms in contrastive learning at once, whereas imbalanced graph coverage causes models to prioritize high-density regions, thereby weakening their learning capability for entities in sparse regions. Experimental results validate our theoretical findings and show that PSQE as a plug-and-play module can improve the performance of baselines by considerable margins.

2602.18962 2026-03-04 cs.HC cs.AI cs.CY cs.IR cs.MA

NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

Albert Tang, Yifan Mo, Jie Li, Yue Su, Mengyuan Zhang, Sander L. Koole, Koen Hindriks, Jiahuan Pei

Comments Accepted to ACM CHI 2026

详情
英文摘要

The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize communication challenges as mutual.

2602.12410 2026-03-04 eess.IV cs.CV q-bio.NC

Proceedings for the Inaugural Meeting of the International Society for Tractography -- IST 2025 Bordeaux

Flavio Dell Acqua, Maxime Descoteaux, Graham Little, Laurent Petit, Dogu Baran Aydogan, Stephanie Forkel, Alexander Leemans, Simona Schiavi, Michel Thiebaut de Schotten

Comments Proceedings of the Inaugural Conference of the International Society for Tractography (IST Conference 2025). Held at the Institut des Maladies Neurodégénératives in Bordeaux, France, October 13-16, 2025. Society website: https://www.tractography.io

详情
英文摘要

This collection comprises the abstracts presented during poster, power pitch and oral sessions at the Inaugural Conference of the International Society for Tractography (IST Conference 2025), held in Bordeaux, France, from October 13-16, 2025. The conference was designed to foster meaningful exchange and collaboration between disparate fields. The overall focus was on advancing research, innovation, and community in the common fields of interest: neuroanatomy, tractography methods and scientific/clinical applications of tractography. The included abstracts cover the latest advancements in tractography, Diffusion MRI, and related fields including new work on; neurological and psychiatric disorders, deep brain stimulation targeting, and brain development. This landmark event brought together world-leading experts to discuss critical challenges and chart the future direction of the field.

2601.20346 2026-03-04 cs.CR cs.AI cs.LG

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Asifullah Khan, Aimen Wadood, Mubashar Iqbal, Umme Zahoora

Comments 46 pages, 11 figures and 10 tables

详情
英文摘要

Ransomware has become one of the most serious cybersecurity threats causing major financial losses and operational disruptions worldwide.Traditional detection methods such as static analysis, heuristic scanning and behavioral analysis often fall short when used alone. To address these limitations, this paper presents multimodal multi agent ransomware analysis framework designed for ransomware classification. Proposed multimodal multiagent architecture combines information from static, dynamic and network sources. Each data type is handled by specialized agent that uses auto encoder based feature extraction. These representations are then integrated through a fusion agent. After that fused representation are used by transformer based classifier. It identifies the specific ransomware family. The agents interact through an interagent feedback mechanism that iteratively refines feature representations by suppressing low confidence information. The framework was evaluated on large scale datasets containing thousands of ransomware and benign samples. Multiple experiments were conducted on ransomware dataset. It outperforms single modality and nonadaptive fusion baseline achieving improvement of up to 0.936 in Macro-F1 for family classification and reducing calibration error. Over 100 epochs, the agentic feedback loop displays a stable monotonic convergence leading to over +0.75 absolute improvement in terms of agent quality and a final composite score of around 0.88 without fine tuning of the language models. Zeroday ransomware detection remains family dependent on polymorphism and modality disruptions. Confidence aware abstention enables reliable real world deployment by favoring conservativeand trustworthy decisions over forced classification. The findings indicate that proposed approach provides a practical andeffective path toward improving real world ransomware defense systems.

2601.14337 2026-03-04 eess.IV cs.CV

Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition

Zhengyong Huang, Xingwen Sun, Xuting Chang, Ning Jiang, Yao Wang, Jianfei Sun, Hongbin Han, Yao Sui

详情
英文摘要

Deformable image registration is a critical technology in medical image analysis, with broad applications in clinical practice such as disease diagnosis, multi-modal fusion, and surgical navigation. Traditional methods often rely on iterative optimization, which is computationally intensive and lacks generalizability. Recent advances in deep learning have introduced attention-based mechanisms that improve feature alignment, yet accurately registering regions with high anatomical variability remains challenging. In this study, we proposed a novel unsupervised deformable image registration framework, LGANet++, which employs a novel local-global attention mechanism integrated with a unique technique for feature interaction and fusion to enhance registration accuracy, robustness, and generalizability. We evaluated our approach using five publicly available datasets, representing three distinct registration scenarios: cross-patient, cross-time, and cross-modal CT-MR registration. The results demonstrated that our approach consistently outperforms several state-of-the-art registration methods, improving registration accuracy by 1.39% in cross-patient registration, 0.71% in cross-time registration, and 6.12% in cross-modal CT-MR registration tasks. These results underscore the potential of LGANet++ to support clinical workflows requiring reliable and efficient image registration. The source code is available at https://github.com/huangzyong/LGANet-Registration.

2512.02474 2026-03-04 cs.IR cs.AI

Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation

Haofeng Huang, Ling Gai

Comments Submitted to KDD2026

详情
英文摘要

Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.

2511.19476 2026-03-04 stat.ML cs.AI cs.LG

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Jin Cui, Boran Zhao, Jiajun Xu, Jiaqi Guo, Shuo Guan, Pengju Ren

Comments 20 pages, 17 figures

详情
英文摘要

Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, CE, MMD) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

2510.18560 2026-03-04 cs.SE cs.AI

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu

详情
英文摘要

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at https://github.com/lcy2723/WebDevJudge.

2510.14686 2026-03-04 cs.DC cs.AI

xLLM Technical Report

Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang

Comments 39 pages

详情
英文摘要

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

2510.14086 2026-03-04 cs.CR cs.AI

Every Language Model Has a Forgery-Resistant Signature

Matthew Finlayson, Xiang Ren, Swabha Swayamdipta

详情
英文摘要

The ubiquity of closed-weight language models with public-facing APIs has generated interest in forensic methods, both for extracting hidden model details (e.g., parameters) and for identifying models by their outputs. One successful approach to these goals has been to exploit the geometric constraints imposed by the language model architecture and parameters. In this work, we show that a lesser-known geometric constraint -- namely, that language model outputs lie on the surface of a high-dimensional ellipse -- functions as a signature for the model and can be used to identify the source model of a given output. This ellipse signature has unique properties that distinguish it from existing model-output association methods like language model fingerprints. In particular, the signature is hard to forge: without direct access to model parameters, it is practically infeasible to produce log-probabilities (logprobs) on the ellipse using currently known methods. Secondly, the signature is naturally occurring, since all language models have these elliptical constraints. Thirdly, the signature is self-contained, in that it is detectable without access to the model inputs or the full weights. Finally, the signature is compact and redundant, as it is independently detectable in each logprob output from the model. We evaluate a novel technique for extracting the ellipse from small models and discuss the practical hurdles that make it infeasible for production-scale models. Finally, we use ellipse signatures to propose a protocol for language model output verification, analogous to cryptographic symmetric-key message authentication systems.

2509.20508 2026-03-04 stat.ML cs.LG

Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances

Khai Nguyen, Hai Nguyen, Nhat Ho

Comments Accepted to ICLR 2026, 34 pages, 30 figures, 6 tables

详情
英文摘要

We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixtures, point-cloud classification, and Wasserstein-space visualizations for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textit{RG-Wormhole}.

2507.08150 2026-03-04 stat.ML cs.LG stat.ME

CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk

Ilia Azizi, Juraj Bodik, Jakob Heiss, Bin Yu

Comments Project page: https://unco3892.github.io/clear/

详情
英文摘要

Accurate uncertainty quantification is critical for reliable predictive modeling. Existing methods typically address either aleatoric uncertainty due to measurement noise or epistemic uncertainty resulting from limited data, but not both in a balanced manner. We propose CLEAR, a calibration method with two distinct parameters, $γ_1$ and $γ_2$, to combine the two uncertainty components and improve the conditional coverage of predictive intervals for regression tasks. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.3\% and 17.5\% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. Similar improvements are observed when applying CLEAR to Deep Ensembles (epistemic) and Simultaneous Quantile Regression (aleatoric). The benefits are especially evident in scenarios dominated by high aleatoric or epistemic uncertainty. Project page: https://unco3892.github.io/clear/