arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1832
2604.26997 2026-05-01 cs.CR cs.AI cs.MA

Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes

Akshay Mittal, Elyson De La Cruz

Comments 9 pages, 2 figures

详情
英文摘要

Autonomous AI agent ecosystems require stronger mechanisms for secure discovery, identity verification, capability attestation, and policy governance. Current deployments frequently lack (1) uniform agent discovery, (2) cryptographic agent authentication, (3) capability proofs that protect secrets, and (4) enforceable policy controls. This paper presents an implementation-oriented proof of concept for the Agent Name Service (ANS), a DNS-inspired trust layer for AI agent discovery and interoperability in Kubernetes, grounded in the ANS protocol specification~\cite{huang2025ans}. The implementation uses Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), policy-as-code enforcement with Open Policy Agent (OPA), and Kubernetes-native integration patterns (CRDs, admission controls, service mesh integration). In a demo research environment (3-node cluster, 50-agent workflow simulation), we observe sub-10ms response in demonstrated service paths and full success for scripted demo deployment scenarios. We explicitly scope these findings as proof-of-concept evidence rather than production certification. We further provide a threat model, assumptions, and limitations to separate implemented evidence from protocol-defined and roadmap capabilities. The result is an evidence-grounded pathway from ANS protocol concepts to reproducible engineering practice for secure multi-agent systems.

2604.26983 2026-05-01 cs.IR cs.LG stat.ML

Value-Aware Product Recommendation by Customer Segmentation using a suitable High-Dimensional Similarity Measure

María Florencia Acosta, Rodrigo García Arancibia, Pamela Llop, Mariel Lovatto, Lucas Mansilla

详情
英文摘要

This paper presents a novel value-aware approach to product recommendation that simultaneously addresses the high dimensionality and sparsity of user-item data while explicitly incorporating the contribution of each product and user to overall sales revenue. The proposed framework encodes revenue contributions in the user-item matrix and computes customer similarity directly on this basis using suitable distance measures. This enables the segmentation of users according to the revenue-based similarity of their purchase baskets and supports recommendations aligned with profitability objectives. We compare conventional similarity metrics with a novel alternative tailored to high-dimensional contexts and propose three recommendation strategies based on revenue share, product popularity, and expected profit generation. The effectiveness of the proposed method is validated through simulation experiments and a real-world application using the UCI Online Retail dataset.

2604.26981 2026-05-01 cs.IR cs.LG

Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

Shawqi Al-Maliki, Ammar Gharaibeh, Mohamed Rahouti, Mohammad Ruhul Amin, Mohamed Abdallah, Junaid Qadir, Ala Al-Fuqaha

详情
英文摘要

Large Language Models (LLMs) have revolutionized the field of natural language processing. However, they exhibit some limitations, including a lack of reliability and transparency: they may hallucinate and fail to provide sources that support the generated output. Retrieval-Augmented Generation (RAG) was introduced to address such limitations in LLMs. One popular implementation, RAG-as-a-Service (RaaS), has shortcomings that hinder its adoption and accessibility. For instance, RaaS pricing is based on the number of submitted prompts, without considering whether the prompts are enriched by relevant chunks, i.e., text segments retrieved from a vector database, or the quality of the utilized chunks (i.e., their degree of relevance). This results in an opaque and less cost-effective payment model. We propose Chunk-as-a-Service (CaaS) as a transparent and cost-effective alternative. CaaS includes two variants: Open-Budget CaaS (OB-CaaS) and Limited-Budget CaaS (LB-CaaS), which is enabled by our ``Utility-Cost Online Selection Algorithm (UCOSA)''. UCOSA further extends the cost-effectiveness and the accessibility of the OB-CaaS variant by enriching, in an online manner, a subset of the submitted prompts based on budget constraints and utility-cost tradeoff. Our experiments demonstrate the efficacy of the proposed UCOSA compared to both offline and relevance-greedy selection baselines. In terms of the performance metric-the number of enriched prompts (NEP) multiplied by the Average Relevance (AR)-UCOSA outperforms random selection by approximately 52% and achieves around 75% of the performance of offline selection methods. Additionally, in terms of budget utilization, LB-CaaS and OB-CaaS achieve higher performance-to-budget ratios of 140% and 86%, respectively, compared to RaaS, indicating their superior efficiency.

2604.26979 2026-05-01 cs.AR cs.AI cs.ET

Multibit neural inference in a N-ary crossbar architecture

Anatole Moureaux, Anthony Lopes Temporao, Flavio Abreu Araujo

Comments 24 pages, 7 figures, 3 tables

详情
英文摘要

In-memory computing (IMC) enables energy-efficient neural network inference by computing analog matrix-vector multiplications (MVM) in memory crossbar arrays. In this work we present a simulation framework for N-ary crossbar architectures that retrieves MVM results with minimal implementation assumptions. The XOR and MNIST classification tasks were successfully inferred using a simulated crossbar array of (4x4) 4-states magnetic tunnel junctions (MTJ). MNIST accuracy reached 94.48% (vs. 97.56% software baseline). The software-hardware performance gap was further reduced using PCA dimensionality reduction. We identified weight quantization as the primary error source, and studied its impact alongside systematic nonidealities and random noise. We find that cell-specific random noise is less detrimental than systematic errors due to averaging across the array. Finally, we demonstrate an optimal number of states per cell that balances quantization error against resistance state resolution to minimize total MVM error.

2604.26976 2026-05-01 cs.LO cs.AI

Fitting Horn DL Ontologies to ABox and Query Examples: A Tale of Simulation Quantifiers and Finite Models

Marvin Grosser, Carsten Lutz

Comments Submitted to the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR2026)

详情
英文摘要

We study the problem of fitting a description logic (DL) ontology to a given set of positive and negative examples that take the form of an ABox and a Boolean query. While previous work has investigated this problem for the expressive DLs ALC and ALCI, we here focus on the Horn DLs EL and ELI, as well as their extensions with the bottom concept. As the query language, we consider atomic queries (AQs), conjunctive queries (rooted CQs), and unions thereof (rooted UCQs). We provide characterization of the existence of a fitting ontology based on simulations, use them to develop decision procedures, and clarify the exact computational complexity. For AQs, the problem is in PTime for both EL and ELI. For rooted CQs and UCQ, it is Sigma_P^2-complete for EL and ExpTime-complete for ELI. Adding the bottom concept does not change any of these complexities. Interestingly, moving from ALC and ALCI to EL and ELI introduces additional technical challenges rather than simplifying the matter.

2604.26973 2026-05-01 cs.NE cs.LG stat.CO

MAEO: Multiobjective Animorphic Ensemble Optimization for Scalable Large-scale Engineering Applications

Omer F. Erdem, Dean Price, Paul Seurin, Majdi I. Radaideh

Comments 33 pages, 9 figures, 5 tables, under peer review

详情
英文摘要

Multiobjective optimization remains challenging for many scientific and engineering problems due to the need to balance convergence, diversity, and computational efficiency across high-dimensional objective landscapes. This work presents the Multiobjective Animorphic Ensemble Optimization (MAEO) framework, a parallelizable ensemble strategy that unifies state-of-the-art evolutionary algorithms within an island-based architecture, overcoming the limitations of relying on a single optimizer, as implied by the No Free Lunch theorem. MAEO uses a parameter-free hypervolume indicator for island performance assessment and a strict Pareto-rank-based individual scoring formulation that incorporates crowding distance and nadir-point proximity to ensure consistent selection pressure within each front. The framework is initiated using four algorithms (NSGA-III, CTAEA, AGEMOEA2, SPEA2) and evaluated through extensive benchmarking on 12 DTLZ/ZDT functions under 36 dimensionality settings using Wilcoxon signed-rank tests with both hypervolume and inverse generational distance metrics. Results show that MAEO achieves balanced convergence-diversity performance, outperforming or matching some of the leading multiobjective optimization algorithms across different benchmark problems. To demonstrate practical applicability, MAEO is applied to the equilibrium-cycle optimization of a small modular nuclear reactor. Eight discrete design variables (and three objectives (levelized cost of electricity, peak soluble boron concentration, fuel cycle length) are optimized under two safety constraints. The algorithm carried out roughly 40000 evaluations using computer simulations. MAEO identifies core designs that lower both the levelized cost of electricity and the peak boron concentration, while preserving fuel cycle length and meeting all safety constraints.

2604.26972 2026-05-01 cs.CC cs.LG

How Hard Is Continuous Clustering? Lower Bounds from the Existential Theory of the Reals

Angshul Majumdar

详情
英文摘要

This paper studies the computational difficulty of clustering problems that are defined directly on a continuous probability density. Rather than working with finite samples, we assume the density is given as a polynomial and ask whether it contains certain cluster structures. Four natural questions are examined. First, do there exist several points with high density that are far apart from each other. Second, do two high density points have a midpoint with low density, creating a valley between them. Third, does the region where the density is above a threshold have at least a given number of separate connected pieces. Fourth, does that same region contain a hole, meaning a loop that cannot be shrunk to a point. We prove that the first two problems, separated points and valley detection, are exactly as hard as the existential theory of the reals, a complexity class that contains NP and is believed to be strictly larger. In contrast, the topological problems of counting connected pieces and detecting holes are at least as hard as the existential theory of the reals, but their exact complexity remains open. Placing them inside that class would need a major advance in real algebraic geometry. These results give the first rigorous classification of exact continuous clustering inside the real polynomial hierarchy. They also show that even basic clustering criteria are not NP complete unless unexpected collapses occur.

2604.26970 2026-05-01 cs.IR cs.AI cs.LG q-bio.QM

Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs

Mandar Karhade

Comments 27 pages, 2 figures, 19 tables (including appendix). Preprint under review

详情
英文摘要

Knowledge graphs used for retrieval treat all facts as equally current. Existing temporal approaches apply uniform decay, using a single forgetting curve regardless of knowledge type. We show this is fundamentally misspecified: different knowledge types exhibit different temporal dynamics, and the core retrieval problem is not latency or throughput but identifying what is important at query time. We propose a hierarchical framework that replaces uniform decay with a continuous decay surface parameterized by two orthogonal signals: velocity (how frequently a concept is observed) and volatility (how much the value changes between observations, measured via embedding distance). The decay surface is decomposed into three learnable levels: domain-level parameters capture universal patterns (some predicates are inherently permanent, others inherently transient), context-level parameters capture setting-dependent variation, and entity-level adaptation personalizes decay to specific subjects. All parameters emerge from data through survival analysis on observed value lifetimes, requiring no predefined taxonomies or domain expertise. We formulate edge lifetime as a survival problem where the event is value supersession (a meaningfully different value replacing the current one), distinct from mere re-observation. Experiments on synthetic temporal knowledge graphs demonstrate recovery of planted hierarchical parameters (HDBSCAN ARI = 1.0). Validation on 107 Wikipedia articles and 1,163 patient records from the Synthea clinical EHR simulator shows that velocity-volatility clusters emerge naturally, align with observable persistence patterns, and near-universally exhibit the Lindy effect (Weibull shape k < 1). Uniform decay performs 18x worse than no temporal weighting. Heterogeneous decay recovers from this, with each hierarchy level contributing measurable improvement.

2604.26968 2026-05-01 cs.AR cs.AI cs.DC cs.PF

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Sanjeev Rao Ganjihal

Comments 9 pages, 9 tables, 1 figure. Under review at a systems conference

详情
英文摘要

Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures--particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per attention type, enabling up to 7.4x higher batch sizes. A six-tier memory hierarchy extends effective KV cache capacity from 40 GB to over 38 TB per node while maintaining sub-millisecond time-to-first-token (TTFT) for hot entries. A Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs achieves 70-84% cache hit rates, combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level validation on trace replay using ShareGPT, LMSYS-Chat-1M, and agentic workloads demonstrates 70-84% cache hit rates. Analytical projections combining validated component behavior with published hardware specifications indicate 1.4-2.1x projected TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction compared to state-of-the-art baselines.

2604.26965 2026-05-01 cs.CY cs.AI cs.SI

The Impact of AI-Generated Text on the Internet

Jonas Dolezal, Sawood Alam, Mark Graham, Maty Bohacek

详情
英文摘要

The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments (sometimes subsumed under the Dead Internet Theory). What has hindered answering these questions is that it has not been understood just how much of the internet is actually AI-generated or AI-edited. To this end, we construct a representative sample of websites published on the internet between 2022 and 2025 using the Internet Archive, and apply a state-of-the-art AI text detector on them. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT's launch in late 2022. We also find statistically significant evidence for some of the identified hypotheses; for example, that increases in AI-generated text on the internet correlate negatively with semantic diversity and positively with the prevalence of positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity. Notably, this diverges from public perception, which we measure in a user study, where the majority of US adults turned out to believe in all four of the above-mentioned hypotheses. Individuals who do not use AI or use it infrequently tend to believe in these negative impacts more than those who use it frequently; similarly, individuals who hold negative views of AI tend to believe in these hypotheses more than those with favorable views of the technology.

2604.26964 2026-05-01 cs.CY cs.AI cs.LG

Learning-to-Explain through 20Q Gaming: An Explainable Recommender for Cybersecurity Education

Mary Nusrat, Sarfuddin Bhuiyan, Gahangir Hossain

详情
英文摘要

The growing sophistication of contemporary cyber threats necessitates a more effective and adaptive approach to cybersecurity training. Intuitive and adaptive approaches to learning, which are often required, are not provided in traditional learning methods. In this article, we present a new educational framework, "Learning to Explain Cybersecurity with Q20 Game", based on explainable AI (XAI), an educational game to enhance interactivity in learning. We propose a novel, game-inspired framework - the Explainable Q20 Cybersecurity Recommender (EQ-20CR), that learns to elicit the minimal set of evidential facts needed to justify cybersecurity defensive action. By casting "Why should I execute this mitigation?" as a 20 questions (Q20) game, a policy-based reinforcement-learning (RL) agent actively queries an environment until it can both (i) recommend the optimal security education and (ii) explain that decision with a concise dialogue trace. The article draws from "Playing 20 Question Game with Policy-Based Reinforcement Learning" [1] and "Learning-to-Explain: Recommendation Reason Determination through Q20 Gaming" [2]. The framework uses a policy-based reinforcement learning (RL) agent that leads the user through a sequence of questions to recognize and articulate a targeted cybersecurity concept, attack vector, or defense strategy. Furthermore, users are gradually exposed to informative questions by the system, revealing complicated, structured way at an adaptive difficulty level. In this paper, we design the architecture, its application to various concepts of cybersecurity through illustrative case studies, and its transformative potential on the training and awareness of cybersecurity recommendations.

2604.26960 2026-05-01 cs.CY cs.AI

LLM Biases

Jinhui Han, Ming Hu, Xilin Zhang

详情
英文摘要

Transformer-based agentic AI is rapidly being deployed on major platforms to help users shop, watch, and navigate content with less effort. While these systems can deliver impressive performance, a key concern is whether they may be less reliable than they appear. We ask a simple but fundamental question: whether the mechanisms that make transformer-based agents effective can also induce systematic biases or distortions? We study this question through a theoretical analysis of transformer-based generative recommenders, in which the next user interaction is generated sequentially from the user history. Focusing on how the model allocates attention across historical evidence, we identify four bias channels: (i) Positional bias: stronger positional encoding shifts influence toward recent history, improving responsiveness but potentially reducing stability and long-term diversity; (ii) Popularity amplification: small frequency differences in data can be magnified into disproportionate exposure, contributing to Matthew effects and echo chambers; (iii) Latent driver bias: when important drivers of user choices are not directly observed, the model can place overly concentrated weight on a small subset of past events, creating overconfident attributions. (iv) Synthetic data bias: when users increasingly follow AI suggestions and platforms retrain on model-shaped synthetic logs, outputs can concentrate over time, and long-tail alternatives can disappear first. Our analysis highlights mechanism-level reliability risks that may not be visible in offline performance metrics. The four bias channels indicate that large-scale deployment may systematically distort exposure and choice. For managers, the immediate implication is to treat these as operational risk factors and to monitor concentration and drift over time, rather than assuming that performance gains alone guarantee reliability.

2604.26959 2026-05-01 cs.CY cs.AI cs.MA

CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs

Elham Nasarian, Abhilash Neog, Kwok-Leung Tsui, Niyousha HosseiniChimeh

详情
英文摘要

Integrating large language models (LLMs) into patient-facing healthcare systems offers significant potential to improve access to medical information. However, ensuring clinical safety and factual reliability remains a critical challenge. In practice, AI-generated responses may be conditionally correct yet medically inappropriate, as models often fail to interpret patient context and tend to produce agreeable responses rather than challenge unsafe assumptions. Unlike clinicians, who infer risk from incomplete information, LLMs frequently lack contextual awareness. Moreover, real-world patient interactions are open-ended and underspecified, unlike structured benchmark settings. We present CareGuardAI, a risk-aware safety framework for patient-facing medical question answering that addresses two key failure modes: clinical safety risk and hallucination risk. The framework introduces Clinical Safety Risk Assessment (SRA), inspired by ISO 14971, and Hallucination Risk Assessment (HRA) to evaluate medical risk and factual reliability. At inference time, CareGuardAI employs a multi-stage pipeline consisting of a controller agent, safety-constrained generation, and dual risk evaluation, followed by iterative refinement when necessary. Responses are released only when both SRA and HRA are less than or equal to 2, ensuring clinically acceptable outputs with bounded latency. We evaluate CareGuardAI on PatientSafeBench, MedSafetyBench, and MedHallu, covering both safety and hallucination detection. Across these benchmarks, the framework consistently outperforms strong baseline models, including GPT-4o-mini, demonstrating the importance of context-aware, risk-based, inference-time safety mechanisms for reliable deployment in healthcare.

2604.26958 2026-05-01 cs.CY cs.AI

Designing Ethical Learning for Agentic AI: Toegye Yi Hwang's Ethical Emotion Regulation Framework

Ji Yeon Kim

详情
英文摘要

Agentic AI systems capable of autonomous goal setting and proactive intervention introduce new challenges for regulating moral-emotional processes in learning environments. Existing frameworks typically treat emotion as reactive feedback or engagement optimization, overlooking the need for normative regulation across autonomous decision cycles.This paper proposes an ethical emotion regulation framework for agentic AI learning design inspired by Toegye Yi Hwang's moral-emotional philosophy. The Ethical Emotion Feedback System (EEFS) is reconstructed as a five-stage architecture aligned with agentic cycles, articulating stage-specific design principles and scenario classifications.An EEFS Evaluation Instrument is introduced to enable systematic assessment of moral-emotional alignment in agentic AI systems.

2604.26957 2026-05-01 cs.CY cs.AI

Simulating Validity: Modal Decoupling in MLLM Generated Feedback on Science Drawings

Arne Bewersdorff, Nejla Yuruk, Xiaoming Zhai

Comments Accepted as AIED Short Paper 2026, Seoul, South Korea. Submission #1147. This is the long paper version

详情
英文摘要

In science education, students frequently construct hand-drawn visual models of scientific phenomena. These drawings rely on a visual structure where information is encoded through visual objects, their attributes, and relationships. Multimodal large language models (MLLMs) are increasingly used to generate feedback on students' hand-drawn scientific models. However, the validity of such feedback depends on whether model claims are grounded in the specific visual evidence of the student drawing. This study uncovers grounding failures, consistent with modal decoupling, in off-the-shelf MLLM feedback, where outputs remain pedagogically plausible in form while contradicting the drawing or treating depicted elements as missing. Using N = 150 middle school drawings from a kinetic molecular theory unit spanning five modeling tasks and three competence levels, we generated N = 300 feedback instances with GPT-5.1. All outputs were coded for four grounding error types: object mismatch, attribute mismatch, relation mismatch, and false absence. Grounding failures were common: 41.3% of feedback instances contained at least one error. An inventory-list-first workflow reduced several error categories and lowered the overall error rate, but it did not resolve the underlying limitation: approximately one in three outputs remained flawed, with false absence as the dominant failure mode. Moreover, feedback that appears visually grounded offered little diagnostic value for identifying invalid instances. The findings indicate that modal decoupling is a substantial limitation and that valid feedback will require grounding mechanisms beyond common prompting strategies.

2604.26956 2026-05-01 cs.CY cs.AI cs.HC

Can AI be a moral victim? The role of moral patiency and ownership perceptions in ethical judgments of using AI-generated content

Hyesun Choung, Soojong Kim

Comments Honourable Mention Award, ACM CHI 2026

详情
英文摘要

The growing use of generative AI raises ethical concerns about authorship and plagiarism. This study examines how people judge the reuse of AI-generated content, focusing on moral patiency and ownership perceptions. In an experiment, participants evaluated two substantively similar manuscripts in which the original source was described as authored by a human, an AI system, or an AI agent with a human-like name. Results showed that copying AI-generated work was judged less unethical, less plagiaristic, and less guilt-inducing than copying human-authored work. Mediation analyses revealed that this leniency stemmed from lower perceptions of AI's capacity to suffer harm (moral patiency) and greater ownership attributed to the human writer reusing AI-generated content. Anthropomorphic cues shaped moral evaluations indirectly by reducing perceived ownership. These findings shed light on how people morally disengage when using AI-generated work and highlight differences in how ethical judgments are applied to human versus AI-created content.

2604.26955 2026-05-01 cs.CY cs.AI

Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories

Emmanuel A. Olowe, Danial Chitnis

Comments IEEE EduCon

详情
英文摘要

AI tutoring systems in engineering labs face a tension between providing sufficient assistance and preserving learning opportunities. Existing systems typically offer instructors limited control over assistance timing, content, or cost. This paper describes a routing and governance system for LLM-based lab assistance comprising two components: Routiium, an OpenAI-compatible gateway that manages multiple LLM backends with configurable prompt modifications and usage logging, and EduRouter, a policy-aware routing service that enforces per-lab budgets, approval workflows, and embedding-based question matching. We evaluated the system using trace-driven simulation calibrated from two engineering labs (LED characterization, RC circuit analysis) and a 100-query replay through live models. In simulations, governed policies (P1/P2) increased challenge-alignment index from 0.90 to 0.98 and overlay-adherence score from 0.69 to 0.87 compared to ungoverned operation (P0). The productive-struggle window metric increased from 1.4 to 3.6 simulated turns before high-scaffold hints appeared. In the 100-query replay, EduRouter routed 75% of queries to a local model, reducing token costs by 66% ($0.087 vs. $0.26 for all-premium routing) while maintaining canonical hit rate of 1.0 for the curated 89-intent question bank. We release Routiium, EduRouter, canonical-task tooling, and simulator configurations to support replication and future classroom studies.

2604.26954 2026-05-01 cs.CY cs.AI

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

Scott Frohn

Comments 14 pages, 10 tables, 2 figures. Presented at the 2026 National Council on Measurement in Education (NCME) Annual Meeting, April 11, 2026, Los Angeles, CA

详情
英文摘要

Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.

2604.26103 2026-05-01 cs.AR cs.AI cs.DC cs.LG

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yufei Ding

详情
英文摘要

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

2604.25977 2026-05-01 econ.EM cs.AI cs.LG q-fin.PM

Auditing Marketing Budget Allocation with Hindsight Regret

Nilavra Pathak, Olivier Jeunen, Eric Lambert

Comments 6 pages, 8 figures

详情
英文摘要

Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized allocations were close to the best feasible choices in hindsight. We present a retrospective auditing framework based on hindsight regret, defined as the opportunity cost of the realized allocation relative to a constraint-faithful benchmark under the same budget and stability guardrails. The framework estimates regime-specific spend--response functions from historical logs, computes feasible hindsight allocations via constrained optimization, and propagates uncertainty through Monte Carlo evaluation to produce regret distributions, expected lift, and probability-of-improvement summaries. This separates allocation inefficiency from uncertainty in the estimated response surfaces. Experiments on real marketing allocation logs show that the framework yields interpretable post-hoc diagnostics and reveals a practical trade-off between allocation flexibility and detectability: moderate feasible reallocations often capture most measurable gain, while larger shifts move into weak-support regions with higher uncertainty. The result is a practical method for auditing historical budget decisions when online experimentation is costly or infeasible.

2604.25711 2026-05-01 cs.SE cs.AI

Learning Generalizable Multimodal Representations for Software Vulnerability Detection

Zeming Dong, Yuejun Guo, Qiang Hu, Yao Zhang, Maxime Cordy, Hao Liu, Mike Papadakis, Yongqiang Lyu

详情
英文摘要

Source code and its accompanying comments are complementary yet naturally aligned modalities-code encodes structural logic while comments capture developer intent. However, existing vulnerability detection methods mostly rely on single-modality code representations, overlooking the complementary semantic information embedded in comments and thus limiting their generalization across complex code structures and logical relationships. To address this, we propose MultiVul, a multimodal contrastive framework that aligns code and comment representations through dual similarity learning and consistency regularization, augmented with diverse code-text pairs to improve robustness. Experiments on widely adopted DiverseVul and Devign datasets across four large language models (LLMs) (i.e., DeepSeek-Coder-6.7B, Qwen2.5-Coder-7B, StarCoder2-7B, and CodeLlama-7B) show that MultiVul achieves up to 27.07% F1 improvement over prompting-based methods and 13.37% over code-only Fine-Tuning, while maintaining comparable inference efficiency.

2604.23341 2026-05-01 cs.CR cs.AI

Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

Taha Hammadia, Lucas Rea, Ahmad Mohammad Saber, Amr Youssef, Deepa Kundur

详情
英文摘要

The deployment of Large Language Models (LLMs) as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignments to produce outputs violating regulatory standards, assuming threats from authorized users, such as operators, who craft malicious prompts to elicit non-compliant guidance. Three state-of-the-art LLMs (OpenAI's GPT-4o mini, Google's Gemini 2.0 Flash-Lite, and Anthropic's Claude 3.5 Haiku) were tested against Baseline, BitBypass, and DeepInception jailbreaking methods across scenarios derived from nine NERC Reliability Standards (EOP, TOP, and CIP). In the initial broad experiment, the overall Attack Success Rate (ASR) was 33.1%, with DeepInception proving most effective at 63.17% ASR. Claude 3.5 Haiku exhibited complete resistance (0% ASR), while Gemini 2.0 Flash-Lite was most vulnerable (55.04% ASR) and GPT-4o mini moderately susceptible (44.34% ASR). A follow-up experiment refining malicious wording in Baseline and BitBypass attacks yielded a 30.6% ASR, confirming that subtle prompt adjustments can enhance simpler methods' efficacy.

2604.20577 2026-05-01 cs.SE cs.LG

Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis

Fariz Ikhwantri, Dusica Marijan

Comments 10 pages, 4 figures, 8 tables. Accepted to EASE 2026 AI Models / Data track, Glasgow, United Kingdom Fix the captions of tables 7 and 8

详情
英文摘要

An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.

2604.11817 2026-05-01 quant-ph cs.CV

QMC-Net: Data-Aware Quantum Representations for Remote Sensing Image Classification

Md Aminur Hossain, Ayush V. Patel, Biplab Banerjee

Comments 15 pages

详情
英文摘要

Hybrid quantum-classical models offer a promising route for learning from complex data; however, their application to multi-band remote sensing imagery often relies on generic, data-agnostic quantum circuits that fail to account for channel-specific statistical variability. In this work, we propose a data-driven framework that maps band-level statistics such as Shannon Entropy, Variance, Spectral Flatness, and Edge Density to the hyperparameters of customized quantum circuits. Building on this framework, we introduce QMC-Net, a hybrid architecture that processes six data channels using band-specific quantum circuits, enabling adaptive quantum feature encoding and transformation across channels. Experiments on the EuroSAT and SAT-6 datasets demonstrate that QMC-Net achieves accuracies of 93.80 % and 99.34 %, respectively, while a residual-enhanced variant further improves performance to 94.69 % and 99.39 %. These results consistently outperform strong classical baselines and monolithic hybrid quantum models, highlighting the effectiveness of data-aware quantum circuit design under NISQ constraints.

2604.11119 2026-05-01 stat.ML cs.LG

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang

Comments 8 pages, 4 figures

详情
英文摘要

Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate decision-optimization method that converts reward scores into an explicit target distribution. Unlike PPO-based RLHF or DPO, DDO-RM performs a KL-regularized mirror-descent update to project the policy toward a reward-improved distribution over a candidate set. Preliminary experiments on Pythia-410M show that DDO-RM outperforms DPO in pair accuracy (0.52 to 0.56) and mean margin (0.13 to 0.53). Our framework provides a principled connection between reward learning and mirror-descent policy improvement.

2604.09718 2026-05-01 cs.DC cs.AI cs.PL

Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

Jagadeesh Chundru

Comments 12 pages, 4 figures, 2 tables. v2: Expanded literature review and clarified architecture limitations

详情
英文摘要

LLM-driven web agents operating through continuous inference loops -- repeatedly querying a model to evaluate browser state and select actions -- exhibit a fundamental scalability constraint for repetitive tasks. We characterize this as the Rerun Crisis: the linear growth of token expenditure and API latency relative to execution frequency. For a 5-step workflow over 500 iterations, a continuous agent incurs approximately 150.00 USD in inference costs; even with aggressive caching, this remains near 15.00 USD. We propose a Compile-and-Execute architecture that decouples LLM reasoning from browser execution, reducing per-workflow inference cost to under 0.10 USD. A one-shot LLM invocation processes a token-efficient semantic representation from a DOM Sanitization Module (DSM) and emits a deterministic JSON workflow blueprint. A lightweight runtime then drives the browser without further model queries. We formalize this cost reduction from O(M x N) to amortized O(1) inference scaling, where M is the number of reruns and N is the sequential actions. Empirical evaluation across data extraction, form filling, and fingerprinting tasks yields zero-shot compilation success rates of 80-94%. Crucially, the modularity of the JSON intermediate representation allows minimal Human-in-the-Loop (HITL) patching to elevate execution reliability to near-100%. At per-compilation costs between 0.002 USD and 0.092 USD across five frontier models, these results establish deterministic compilation as a paradigm enabling economically viable automation at scales previously infeasible under continuous architectures.

2604.06616 2026-05-01 cs.DB cs.AI cs.IR

CubeGraph: Efficient Retrieval-Augmented Generation for Spatial and Temporal Data

Mingyu Yang, Wentao Li, Wei Wang

Comments Updated Report

详情
英文摘要

Hybrid queries combining high-dimensional vector similarity search with spatio-temporal filters are increasingly critical for modern retrieval-augmented generation (RAG) systems. Existing systems typically handle these workloads by nesting vector indices within low-dimensional spatial structures, such as R-trees. However, this decoupled architecture fragments the vector space, forcing the query engine to invoke multiple disjoint sub-indices per query. This fragmentation destroys graph routing connectivity, incurs severe traversal overhead, and struggles to optimize for complex spatial boundaries. In this paper, we propose CubeGraph, a novel indexing framework designed to natively integrate vector search with arbitrary spatial constraints. CubeGraph partitions the spatial domain using a hierarchical grid, maintaining modular vector graphs within each cell. During query execution, CubeGraph dynamically stitches together adjacent cube-level indices on the fly whenever their spatial cells intersect with the query filter. This dynamic graph integration restores global connectivity, enabling a unified, single-pass nearest-neighbor traversal that eliminates the overhead of fragmented sub-index invocations. Extensive evaluations on real-world datasets demonstrate that CubeGraph significantly outperforms state-of-the-art baselines, offering superior query execution performance, scalability, and flexibility for complex hybrid workloads.

2603.22366 2026-05-01 quant-ph cs.AI cs.LG

Modeling Quantum Federated Autoencoder for Anomaly Detection in IoT Networks

Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel

Comments This paper has been accepted at ICOIN 2026

详情
Journal ref
Proc. 2026 40th International Conference on Information Networking (ICOIN), Hanoi, Vietnam, Jan. 14-16, 2026, IEEE, 2026
英文摘要

We propose a Quantum Federated Autoencoder for Anomaly Detection, a framework that leverages quantum federated learning for efficient, secure, and distributed processing in IoT networks. By harnessing quantum autoencoders for high-dimensional feature representation and federated learning for decentralized model training, the approach transforms localized learning on edge devices without requiring transmission of raw data, thereby preserving privacy and minimizing communication overhead. The model leverages quantum advantage in pattern recognition to enhance detection sensitivity, particularly in complex and dynamic IoT network traffic. Experiments on a real-world IoT dataset show that the proposed method delivers anomaly detection accuracy and robustness comparable to centralized approaches, while ensuring data privacy.

2603.13566 2026-05-01 stat.ML cs.LG

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

En-Ya Kuo, Sebastien Motsch

Comments Updated the first page to include the IEEE submission notice required for previously posted electronic preprint versions

详情
英文摘要

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.

2603.10252 2026-05-01 stat.ML cs.LG physics.data-an stat.ME

Bayesian Hierarchical Models and the Maximum Entropy Principle

Brendon J. Brewer

Comments 6 pages, 2 figures. To appear in the proceedings of the 44th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2025), held in Auckland, New Zealand

详情
英文摘要

Bayesian hierarchical models are frequently used in practical data analysis contexts. One interpretation of these models is that they provide an indirect way of assigning a prior for unknown parameters, through the introduction of hyperparameters. The resulting marginal prior for the parameters (integrating over the hyperparameters) is usually dependent, so that learning one parameter provides some information about the others. In this contribution, I will demonstrate that, when the prior given the hyperparameters is a canonical distribution (a maximum entropy distribution with moment constraints), the dependent marginal prior also has a maximum entropy property, with a different constraint. This constraint is on the marginal distribution of some function of the unknown quantities. The results shed light on what information is actually being assumed when we assign a hierarchical model.