arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.05278 2026-04-08 cs.SE cs.AI cs.MA

Spec Kit Agents: Context-Grounded Agentic Workflows

Pardis Taghavi, Santosh Bhavani

详情

英文摘要

Spec-driven development (SDD) with AI coding agents provides a structured workflow, but agents often remain "context blind" in large, evolving repositories, leading to hallucinated APIs and architectural violations. We present Spec Kit Agents, a multi-agent SDD pipeline (with PM and developer roles) that adds phase-level, context-grounding hooks. Read-only probing hooks ground each stage (Specify, Plan, Tasks, Implement) in repository evidence, while validation hooks check intermediate artifacts against the environment. We evaluate 128 runs covering 32 features across five repositories. Context-grounding hooks improve judged quality by +0.15 on a 1-5 composite LLM-as-judge score (+3.0 percent of the full score; Wilcoxon signed-rank, p < 0.05) while maintaining 99.7-100 percent repository-level test compatibility. We further evaluate the framework on SWE-bench Lite, where augmentation hooks improve baseline by 1.7 percent, achieving 58.2 percent Pass@1.

URL PDF HTML ☆

赞 0 踩 0

2604.05253 2026-04-08 cs.IR cs.LG

Spike Hijacking in Late-Interaction Retrieval

Karthik Suresh, Tushar Vatsa, Tracy King, Asim Kadav, Michael Friedrich

Comments Accepted at the 1st Late Interaction Retrieval Workshop (LIR 2026) at ECIR 2026. Published in CEUR Workshop Proceedings

2604.05175 2026-04-08 eess.SP cs.IT cs.LG math.IT

Graph Signal Diffusion Models for Wireless Resource Allocation

Yigit Berkay Uslu, Samar Hadou, Shirin Saeedi Bidokhti, Alejandro Ribeiro

Comments Under review for SPAWC'26

2604.05166 2026-04-08 cs.HC cs.AI

From Use to Oversight: How Mental Models Influence User Behavior and Output in AI Writing Assistants

Shalaleh Rismani, Su Lin Blodgett, Q. Vera Liao, Alexandra Olteanu, AJung Moon

2604.05159 2026-04-08 cs.SE cs.AI cs.CL

Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

Alfonso Amayuelas, Firas Laakom, Piotr Piękos, Wenyi Wang, Yifan Xu, Yuhui Wang, Jürgen Schmidhuber, William Wang

2604.05156 2026-04-08 eess.SY cs.RO cs.SY

Synchronous Observer Design for Landmark-Inertial SLAM with Magnetometer and Intermittent GNSS Measurements

Arkadeep Saha, Pieter van Goor, Ravi Banavar

Comments 8 pages, 2 figures, This work has been submitted to CDC 2026

2604.05150 2026-04-08 cs.SE cs.AI

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Max Van Puyvelde, Matthew Young, John Thickstun, Gil Alterovitz, Walter A. De Brouwer

Comments 14 pages, 2 figures, 3 tables

2604.05137 2026-04-08 cs.PL cs.AI cs.CL cs.LG cs.SE

EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback

Samira Hajizadeh, Suman Jana

2604.05125 2026-04-08 cs.IR cs.AI cs.CL cs.LG

Offline RL for Adaptive Policy Retrieval in Prior Authorization

Ruslan Sharifullin, Maxim Gorshkov, Hannah Clay

Comments 9 pages, 7 figures, 6 tables

2604.05119 2026-04-08 cs.MA cs.LG

Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

Anshul Pathak, Nishant Jain

2604.05115 2026-04-08 cs.ET cs.LG

Probabilistic Tree Inference Enabled by FDSOI Ferroelectric FETs

Pengyu Ren, Xingtian Wang, Boyang Cheng, Jiahui Duan, Giuk Kim, Xuezhong Niu, Halid Mulaosmanovic, Stefan Duenkel, Sven Beyer, X. Sharon Hu, Ningyuan Cao, Kai Ni

2604.05113 2026-04-08 cs.IR cs.AI

CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation

Zezhong Fan, Ziheng Chen, Luyi Ma, Jin Huang, Lalitesh Morishetti, Kaushiki Nag, Sushant Kumar, Kannan Achan

Comments Generative Recommendation

2604.05108 2026-04-08 eess.SY cs.RO cs.SY

Differentiable Invariant Sets for Hybrid Limit Cycles with Application to Legged Robots

Varun Madabushi, Akash Harapanahalli, Samuel Coogan, Maegan Tucker

2604.05102 2026-04-08 eess.SY cs.RO cs.SY

Finite-Step Invariant Sets for Hybrid Systems with Probabilistic Guarantees

Varun Madabushi, Elizabeth Dietrich, Hanna Krasowski, Maegan Tucker

2604.05100 2026-04-08 cs.SE cs.AI

Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

Amir M. Ebrahimi, Gopi Krishnan Rajbahadur

详情

英文摘要

Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90\% of evaluation on Python while TypeScript, GitHub's most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing activity, are largely missing, and documentation, testing, and maintenance edits (31.4% of human PRs) have zero representation. Both benchmarks have modest test counts (CanItEdit median 13, EDIT-Bench median 4), though CanItEdit compensates with near-complete whole-file coverage and fail-before/pass-after validation. 59\% of EDIT-Bench's low-coverage suites would not detect modifications outside the edit region. EDIT-Bench has 15 problems that are not solved by any of 40 LLMs and 11 of these problems trace failures to poor benchmark artifacts rather than model limitations. Further, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem within the benchmark. In summary, these benchmarks measure a narrower construct than deployment decisions require. We therefore propose six empirically grounded desiderata and release all audit artifacts so the community can build instructed code-editing benchmarks whose scores reliably reflect real-world editing capability.

URL PDF HTML ☆

赞 0 踩 0

2604.05088 2026-04-08 eess.SY cs.LG cs.SY

Scalar Federated Learning for Linear Quadratic Regulator

Mohammadreza Rostami, Shahriar Talebi, Solmaz S. Kia

2604.05080 2026-04-08 cs.SE cs.AI cs.LO cs.MA

Nidus: Externalized Reasoning for AI-Assisted Engineering

Danil Gorinevski

Comments 19 pages, 3 figures, 5 tables. Evaluated on self-hosting deployment. Patent pending (CH000371/2026)

2604.05076 2026-04-08 cs.MA cs.MM cs.SD

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin, Haibo Wang, Zhiyang Xu, Siyao Dai, Huanjie Dong, Xiaohan Wang, Yolo Y. Tang, Yixin Wang, Qifan Wang, Lifu Huang

Comments 14 pages, 4 figures, under review

详情

英文摘要

Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the "Observe-Think-Act-Verify" flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

URL PDF HTML ☆

赞 0 踩 0

2604.05066 2026-04-08 cs.PL cs.AI cs.PF

AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels

Yifan Zhu, Yekai Pan, Yanghui Wu, Chen Ding

2604.05034 2026-04-08 hep-ph cs.LG hep-th

Learning to Unscramble Feynman Loop Integrals with SAILIR

David Shih

Comments 16 pages, 3 figures, 5 tables, work done in collaboration with Claude Code

2604.05012 2026-04-08 cs.AR cs.AI

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

2604.05008 2026-04-08 stat.ML cs.LG q-fin.MF q-fin.ST

Generative Path-Law Jump-Diffusion: Sequential MMD-Gradient Flows and Generalisation Bounds in Marcus-Signature RKHS

Daniel Bloch

2604.05000 2026-04-08 cs.SE cs.AI

Closed-Loop Autonomous Software Development via Jira-Integrated Backlog Orchestration: A Case Study in Deterministic Control and Safety-Constrained Automation

Elias Calboreanu

Comments 27 pages, 7 figures, 5 tables. Submitted to Automated Software Engineering (Springer)

2604.04997 2026-04-08 cs.IR cs.AI cs.CL cs.CV cs.LG

Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

Rong Lu, Hao Liu, Song Hou

Comments Accepted at the IMAGE'25 Workshop (PCW-11), Society of Exploration Geophysicists (SEG). Published version available at https://doi.org/10.1190/image2025-w11-03.1

2604.04993 2026-04-08 stat.ML cs.CR cs.LG stat.ME

The Hiremath Early Detection (HED) Score: A Measure-Theoretic Evaluation Standard for Temporal Intelligence

Prakul Sunil Hiremath

Comments 11 pages. Introduces a measure-theoretic framework for predictive velocity including the Hiremath Standard Table. Dedicated to the Hiremath lineage

详情

英文摘要

We introduce the Hiremath Early Detection (HED) Score, a principled, measure-theoretic evaluation criterion for quantifying the time-value of information in systems operating over non-stationary stochastic processes subject to abrupt regime transitions. Existing evaluation paradigms, chiefly the ROC/AUC framework and its downstream variants, are temporally agnostic: they assign identical credit to a detection at t + 1 and a detection at t + tau for arbitrarily large tau. This indifference to latency is a fundamental inadequacy in time-critical domains including cyber-physical security, algorithmic surveillance, and epidemiological monitoring. The HED Score resolves this by integrating a baseline-neutral, exponentially decaying kernel over the posterior probability stream of a target regime, beginning precisely at the onset of the regime shift. The resulting scalar simultaneously encodes detection acuity, temporal lead, and pre-transition calibration quality. We prove that the HED Score satisfies three axiomatic requirements: (A1) Temporal Monotonicity, (A2) Invariance to Pre-Attack Bias, and (A3) Sensitivity Decomposability. We further demonstrate that the HED Score admits a natural parametric family indexed by the Hiremath Decay Constant (lambda_H), whose domain-specific calibration constitutes the Hiremath Standard Table. As an empirical vehicle, we present PARD-SSM (Probabilistic Anomaly and Regime Detection via Switching State-Space Models), which couples fractional Stochastic Differential Equations (fSDEs) with a Switching Linear Dynamical System (S-LDS) inference backend. On the NSL-KDD benchmark, PARD-SSM achieves a HED Score of 0.0643, representing a 388.8 percent improvement over a Random Forest baseline (0.0132), with statistical significance confirmed via block-bootstrap resampling (p < 0.001). We propose the HED Score as the successor evaluation standard to ROC/AUC.

URL PDF HTML ☆

赞 0 踩 0

2604.04992 2026-04-08 cs.CR cs.AI

FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

Daniel Kuznetsov, Ofir Cohen, Karin Shistik, Rami Puzis, Asaf Shabtai

2604.04990 2026-04-08 cs.SE cs.AI

Architecture Without Architects: How AI Coding Agents Shape Software Architecture

Phongsakon Mark Konrad, Tim Lukas Adam, Riccardo Terrenzi, Serkan Ayvaz

2604.04982 2026-04-08 cs.IR cs.AI cs.CL cs.LG

CURE:Circuit-Aware Unlearning for LLM-based Recommendation

Ziheng Chen, Jiali Cheng, Zezhong Fan, Hadi Amiri, Yunzhi Yao, Xiangguo Sun, Yang Zhang

2604.04979 2026-04-08 cs.SE cs.AI

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

Ádám Kovács

Comments 7 pages

2604.04977 2026-04-08 cs.SE cs.CR cs.LG

Towards Predicting Multi-Vulnerability Attack Chains in Software Supply Chains from Software Bill of Materials Graphs

Laura Baird, Armin Moin

Comments Accepted for the ACM International Conference on the Foundations of Software Engineering (FSE) 2026 Ideas, Visions and Reflections (IVR) Track