arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2605.27494 2026-05-28 cs.CR cs.AI cs.CL cs.IR cs.LG

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

基于证据的缓存路由用于检索增强生成:何时可以安全地重用答案?

Syed Huma Shah

AI总结 提出GroundedCache,一种通过四个廉价门控(查询相似性、检索证据重叠、源版本有效性和词汇支持)验证缓存答案安全性的路由方法,显著降低不安全服务率。

Comments 19 pages, 9 figures, 10 tables. Code: https://github.com/syedhumarahim/grounded-cache-router

详情
AI中文摘要

现代检索增强生成(RAG)部署越来越依赖缓存来降低令牌成本和首令牌时间(TTFT)。在vLLM等服务栈中,前缀级KV重用已成为标准,而最近的系统(RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache)进一步推动了块级和位置无关的重用。相比之下,输出级语义答案缓存仍然脆弱:相似的提示可能映射到不同的正确答案,检索到的证据随着语料库更新而漂移,并且对抗性碰撞攻击已被证明可以劫持缓存的响应。我们认为,缓存答案重用的正确框架不是如何更快地重用,而是何时重用是安全的。我们提出了GroundedCache,一种经过证据验证的缓存路由器,仅当四个廉价门控同时成立时才允许缓存答案:查询相似性、检索证据重叠、源版本有效性以及新检索证据对缓存答案的词汇(或基于判断的)支持。我们构建了一个六区域工作负载,用于压力测试缓存安全性而不仅仅是命中率,并引入了一个面向操作员的指标——不安全服务率(USR),即收到错误缓存答案的查询比例。在两个数据集和12,000个真实LLM生成(在vLLM上使用自动前缀缓存的Qwen2.5-7B-Instruct)中,GroundedCache在每个HotpotQA区域上将USR降至0.0%(而朴素缓存为15-35%),在mtRAG文档漂移上降至1.5%(而朴素缓存为51.5%),在设计点对抗区域上减少了34倍,在其他mtRAG区域上减少了3-10倍,同时端到端p50延迟保持在无缓存RAG基线的1.04-1.07倍以内。逐门控消融实验表明,词汇支持门控是两个数据集上的主要安全机制,其余门控以近乎零成本提供纵深防御。我们发布了实现、工作负载和评估工具。

英文摘要

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

2605.27492 2026-05-28 cs.SE cs.AI

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

基准测试还不够:RAMP——生产系统中代理模型的运行时评估

Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu, Xianwei Zhang

AI总结 针对现有基准测试无法反映真实生产环境动态复杂性的问题,提出RAMP框架,通过统一运行时评估架构、编译器构建工作负载和多维效用指标,揭示模型在长序列工作流中的性能退化与资源效率差异。

Comments 16 pages, 8 figures. Project homepage: http://ramp.yatcc-ai.com/

详情
AI中文摘要

LLM代理正迅速从编码助手演变为自主软件工程系统。然而,现有的评估方法仍然主要集中于静态、孤立和短视界的基准测试,无法捕捉真实生产工作流的动态复杂性。因此,基准测试性能可能无法很好地反映在涉及长执行链、工具交互、依赖管理和迭代反馈循环的现实运行时环境下的实际能力。为此,我们提出了RAMP,一个面向生产的评估长视界软件工程代理的基础设施。基于YatCC集成平台,RAMP通过标准化的编排和执行接口提供了统一的运行时评估架构。RAMP引入了具有串行依赖和复杂工具链交互的现实编译器构建工作负载,以及用于分析部分工作流失败下执行行为的分阶段恢复机制。该框架进一步整合了面向效用的多维指标,共同评估结果质量和过程效率。我们对15个主流模型进行了运行时评估,观察到在传统孤立基准测试中基本不可见的显著能力退化。任务完成率在串行工作流中逐步崩溃,从初始阶段的100%下降到最终阶段的仅20%,而没有一个评估模型成功完成整个流水线。运行时分析揭示了系统性的故障传播和显著的资源低效,在可比模型之间计算成本差异高达三个数量级。这些发现表明,RAMP将代理模型评估推向持续、运行时可观察和面向生产的评估。

英文摘要

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

2605.27489 2026-05-28 cs.CR cs.AI cs.LG

HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

HARP: 多智能体大语言模型系统中的危害放大测量

Md Hafizur Rahman, Zafaryab Haider, Tanzim Mahfuz, Prabuddha Chakraborty

AI总结 提出HARP方法,通过比较清洁与扰动执行轨迹,量化多智能体LLM系统中局部扰动如何传播为全局危害,并在金融七智能体系统中验证了不同攻击和防御的效果。

Comments 39 pages, 12 figures, 12 tables, and 1 algorithm

详情
AI中文摘要

多智能体大语言模型系统将工作流分解为智能体、工具、共享上下文、记忆和决策门。这种模块化提高了可解释性,但也带来了传播风险:对一个组件的有限扰动可能被其他智能体重用并放大为系统级危害。我们提出了HARP(通过角色扰动导致的危害放大),一种用于研究多智能体LLM系统中局部到全局危害放大的轨迹优先方法。HARP比较成对的清洁和扰动执行,记录专家输出、工具调用、记忆读/写、防护事件、预言日志、延迟、令牌成本和决策。我们将局部危害定义为对目标智能体或受损通道的偏离,全局危害定义为对整个轨迹的偏离,危害放大为(H_global/H_local)。这补充了攻击成功率,衡量编排如何将危害传播到攻击点之外。我们在一个面向金融的七智能体系统中实例化HARP,该系统具有确定性决策门和可配置的攻击框架,用于专家妥协、合谋、共享上下文破坏以及时间或记忆持久攻击。在五种防御中,仅提示防御保持了良性效用但留下高成功率和隐蔽性;工具前和步骤级防护以效用或延迟成本减少了部分失败;而IntegrityGuard,一种轨迹一致性防御,实现了最低的攻击成功率和全局危害,但引入了效用/成本权衡。结果表明,单一专家妥协产生最强的放大,共享上下文破坏产生最高的攻击成功率,时间持久性产生最大的恶意影响。HARP认为,安全的多智能体评估不仅必须衡量绕过,还必须衡量传播。

英文摘要

Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves interpretability, but creates a propagation risk: a bounded perturbation to one component can be reused by other agents and amplified into system-level harm. We introduce HARP (Harm Amplification through Role Perturbation), a trace-first methodology for studying local-to-global harm amplification in multi-agent LLM systems. HARP compares paired clean and perturbed executions and records specialist outputs, tool calls, memory reads/writes, guard events, oracle logs, latency, token cost, and decisions. We define local harm as deviation from targeted agents or corrupted channels, global harm as deviation over the full trace, and harm amplification as (H_global/H_local). This complements attack success rate with a measure of how strongly orchestration spreads harm beyond the attack point. We instantiate HARP in a finance-oriented seven-agent system with a deterministic decision gate and configurable attack harness for specialist compromise, collusion, shared-context corruption, and temporal or memory-persistent attacks. Across five defenses, prompt-only defenses preserve benign utility but leave high success and stealth; pre-tool and step-level guards reduce some failures with utility or latency costs; and IntegrityGuard, a trace-consistency defense, achieves the lowest attack success and global harm but introduces utility/cost trade-offs. Results show that single-specialist compromise produces the strongest amplification, shared-context corruption yields the highest attack success, and temporal persistence produces the largest malicious impact. HARP argues that secure multi-agent evaluation must measure not only bypass, but propagation.

2605.27485 2026-05-28 cs.LO cs.LG cs.SE

Automating Formal Verification with Agent-Guided Tree Search

利用智能体引导的树搜索自动化形式验证

Leo Yao

AI总结 本文提出智能体引导的树搜索方法,通过状态和上下文两种编排器改进基于大语言模型的Lean形式验证代码生成性能,在基准测试中达到95.0%的通过率。

Comments 78 pages, 8 figures

详情
AI中文摘要

形式验证为可证明正确的软件提供了一条路径,但编写经过验证的代码仍然成本高昂,以至于该技术很少在生产中使用。近期的大语言模型可以加速这一工作,最近的基准测试衡量了它们将规范翻译成代码和机器检查的正确性证明的能力。本论文评估了此类LLM驱动的验证代码生成(“vericoding”)在Lean中的现状,并开发了基于搜索的方法以提高验证性能。我们首先在当前跨供应商模型池上复现了vericoding-benchmark Lean排行榜的一个子集,发现非推理性能在美国闭源模型上大致保持稳定,而开放权重模型略有提升。我们使用配备mathlib搜索的智能体循环更新了vericoding-benchmark的迭代方法,发现模型性能大幅提升并随智能体预算扩展。GPT-5.4在423个规范上以K=50次LLM调用几乎饱和了基准测试,达到95.0%。然后我们设计了两种智能体引导的树搜索公式:基于状态的编排器,在部分证明状态上分支;以及基于上下文的编排器,在完整子智能体上下文上分支。与智能体基线相比,基于上下文的设计以更低的token成本解决了更广泛的中等难度规范,而智能体基线在最困难的规范上保持优势,这些规范中不间断的迭代最为重要。我们得出结论,搜索结构相对于强智能体基线具有选择性优势,并且从现代代码中提取的更具挑战性的基准测试对于衡量和推动自动形式验证的进一步进展至关重要。代码可通过联系作者leoy@mit.edu获取。

英文摘要

Formal verification offers a path to provably correct software, but writing verified code remains expensive enough that the technique is rarely used in production. Recent large language models can accelerate this work, and recent benchmarks measure their ability to translate specifications into code and machine-checked proofs of correctness. This thesis evaluates the state of such LLM-driven verified-code generation ("vericoding") in Lean and develops search-based methods for improving verification performance. We first reproduce a subset of the vericoding-benchmark Lean leaderboard on a current cross-vendor model pool, finding that non-reasoning performance remains roughly steady on US closed-source models while open-weight models have slightly improved. We update the iterative methodology of vericoding-benchmark with an agentic loop equipped with mathlib search, finding that model performance greatly improves and scales with agent budget. GPT-5.4 nearly saturates the benchmark at 95.0% on 423 specs with $K=50$ LLM calls. We then design two agent-directed tree-search formulations: a state-based orchestrator that branches on partial-proof states, and a context-based orchestrator that branches on full subagent contexts. Compared against the agent baseline, the context-based design solves a wider range of intermediate-difficulty specs at lower token cost, while the agent baseline retains an advantage on the hardest specs, where uninterrupted iteration matters most. We conclude that search structure has selective advantages over a strong agent baseline, and that more challenging benchmarks drawn from modern code are important to measure and drive further progress in automated formal verification. Code available upon request by contacting the author at leoy@mit.edu.

2605.27477 2026-05-28 stat.ML cs.LG

Iterative Causal Discovery: Per-Edge Impossibility Certificates, Tier-Aware Oracle Queries, and the $1+K$ Lower Bound

迭代因果发现:每边不可能性证书、分层感知的Oracle查询以及$1+K$下界

Eichi Uehara

AI总结 提出一种迭代因果发现协议,通过为每条候选边分配不可能性证书(RESOLVED/IMPOSSIBLE代码)和五层门控可识别性层级(LSNM、IGCI、Stein、MDL、PEIT),结合两种Oracle原语(元枢纽查询和子节点查询),在理想Oracle假设下实现了最多$1+K$次专家交互即可恢复任意DAG的上界。

Comments Contains 10 figures and 5 tables

详情
AI中文摘要

因果发现算法返回一个有向图,但无法原则性地区分由数据确定的边方向和在没有识别假设的情况下分配的边方向。在标准马尔可夫性和忠实性条件下,观测分布仅识别一个马尔可夫等价类;该类内的方向不由联合分布决定,且无法仅通过额外样本恢复,而是需要功能限制或干预。我们提出一种针对连续数据的观测因果发现协议,该协议为每个候选边附加一个离散的不可能性证书:RESOLVED代码记录提交方向所依据的可识别性定理,而IMPOSSIBLE代码记录失败模式以及领域专家必须回答以解决该问题的具体问题。双变量级联扩展了五个门控可识别性层级:LSNM、IGCI、Stein、MDL和PEIT,当它们的前提条件检验被拒绝时,这些层级会弃权。两种Oracle原语——元枢纽查询和子节点查询——共同建立了最多$1+K$次专家交互的上界,足以恢复任意DAG,其中$K$表示非叶节点的数量。在理想Oracle假设下,该界在asia、sachs、child和alarm基准上被精确达到。

英文摘要

Causal-discovery algorithms return a directed graph, yet provide no principled means of distinguishing edge directions identified by the data from those assigned without an identifying assumption. Under the standard Markov and faithfulness conditions, the observational distribution identifies only a Markov equivalence class; orientations within that class are not determined by the joint distribution and cannot be recovered from additional samples alone, but require either a functional restriction or an intervention. We introduce a protocol for observational causal discovery on continuous data that attaches to each candidate edge a discrete impossibility certificate: a RESOLVED code records the identifiability theorem under which the direction was committed, while an IMPOSSIBLE code records the failure mode together with the specific question a domain expert must answer to resolve it. The bivariate cascade is extended with five gated identifiability tiers LSNM, IGCI, Stein, MDL, and PEIT that abstain when their precondition test rejects. Two oracle primitives, the meta-hub query and the node-children query, jointly establish an upper bound of $1+K$ expert interactions sufficient to recover any DAG, where $K$ denotes the number of non-leaf vertices. Under an ideal-oracle assumption, the bound is met exactly on the asia, sachs, child, and alarm benchmarks.

2605.27473 2026-05-28 stat.ML cs.LG

Calibrated Inference for the Conditional Average Treatment Effect in the Few-Placebo Regime via Gaussian Processes

在少安慰剂条件下通过高斯过程对条件平均处理效应的校准推断

Eichi Uehara

AI总结 针对少安慰剂条件下条件平均处理效应估计的校准不确定性,提出GP-CATE方法,通过高斯过程直接建模每个臂的结果曲面,实现校准覆盖。

Comments 14 pages, 1 figure, 5 tables

详情
AI中文摘要

估计干预对给定个体的帮助程度——条件平均处理效应(CATE)——在医学、经济学和政策决策中日益重要,当估计值伴随校准的不确定性区间时最为有用。我们研究少安慰剂条件,即一个治疗臂远小于另一个,如出现在非均衡分配试验和小样本保留的A/B测试中。该设置下的标准估计器是X-Learner,获得可信区间的自然方法是使其第二阶段贝叶斯化。我们表明这些区间覆盖不足:它们包含真实效应的频率低于名义水平。我们将其归因于结构性原因——X-Learner的回归目标继承了拟合小臂的干扰模型的偏差,因此后验中心偏离真实效应。我们发现标准补救措施——回归正交双稳健得分——在此也不可靠,因为该条件的有限重叠使得估计器要么高度可变,要么一旦稳定后再次有偏。这两种后果反映了超越因果推断的模式:单独估计的方差附加到难以学习的量的点估计上,而点估计的偏差未被该方差捕获。我们提出GP-CATE,它用高斯过程建模每个臂的结果曲面,因此稀缺臂的不确定性直接进入后验,而不是作为未建模的偏差。在合成和半合成基准测试中,GP-CATE实现了校准覆盖,而我们比较的估计器(包括Causal Forest和BART)未能做到,代价是当数据无信息时区间适当变宽。

英文摘要

Estimating how much an intervention helps a given individual the conditional average treatment effect (CATE) is increasingly central to decision-making in medicine, economics, and policy, where an estimate is most useful when accompanied by a calibrated uncertainty interval. We study the few-placebo regime, in which one treatment arm is much smaller than the other, as arises in unequal-allocation trials and small-holdout $A/B$ tests. The standard estimator in this setting is the X-Learner, and a natural way to obtain credible intervals is to make its second stage Bayesian. We show that these intervals under-cover: they contain the true effect less often than their nominal level. We trace this to a structural cause the X-Learner's regression target inherits the bias of a nuisance model fitted to the small arm, so the posterior is centered away from the true effect and we find that the standard remedy, regressing an orthogonal doubly-robust score, is also unreliable here, since the regime's limited overlap leaves the estimator either highly variable or, once stabilized, biased once more. Both consequences reflect a pattern that extends beyond causal inference: a separately estimated variance is attached to a point estimate of a hard-to-learn quantity, and the point estimate's bias is not captured by that variance. We propose GP-CATE, which models each arm's outcome surface with a Gaussian process, so the scarce arm's uncertainty enters the posterior directly rather than as an unmodelled bias. Across synthetic and semi-synthetic benchmarks, GP-CATE attains calibrated coverage where the estimators we compare against including Causal Forest and BART do not, at the cost of intervals that are appropriately wide when the data are uninformative.

2605.27472 2026-05-28 cs.AR cs.AI

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

AssertLLM2: 从设计规格生成断言的全面的LLM基准测试

Yuchao Wu, Wenji Fang, Jing Wang, Wenkai Li, Ziyan Guo, Zhiyao Xie

AI总结 提出AssertLLM2基准,包含83个真实设计,通过结构化规格、黄金RTL和变异RTL支持缺陷预防和缺陷狩猎两种实际场景,采用语法有效性、形式可证明性、覆盖率和基于突变的缺陷检测等严格评估框架。

详情
AI中文摘要

基于断言的验证(ABV)是现代硬件设计的基石,但手动将设计意图转化为正式的SystemVerilog断言(SVA)仍然劳动密集且容易出错。虽然大型语言模型(LLMs)显示出自动化这一过程的潜力,但现有基准测试仍然受到不现实的任务制定、弱的规格输入和过于简化的评估的限制。为了解决这些限制,我们引入了AssertLLM2,一个用于硬件验证中真实断言生成的开源基准测试。AssertLLM2包含83个跨13个功能类别的真实设计。对于每个设计,基准测试提供了结构化的设计规格、经过验证的依赖完整的黄金RTL以及系统变异的错误RTL变体。这些支持两种实际设置:缺陷预防,其中从规格生成断言以防止设计错误;以及缺陷狩猎,其中生成断言以暴露预期行为与错误实现之间的差异。据我们所知,AssertLLM2是第一个明确使用错误RTL作为输入来评估缺陷检测能力的基准测试。AssertLLM2进一步采用了更严格的评估框架,涵盖语法有效性、形式可证明性、覆盖率和基于突变的缺陷检测。我们的基准测试使得对断言生成进行更真实和广泛的评估成为可能,并为实际硬件验证中的最先进LLMs建立了严格的基线。

英文摘要

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

2605.27466 2026-05-28 cs.MA cs.AI cs.LG stat.ML

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

AgensFlow:多智能体系统的协调策略基础

Nicole Koenigstein

AI总结 提出AgensFlow框架,将多智能体协调视为在线策略学习问题,通过可学习路由优化协调流程,在分布式系统事件和安全咨询任务上验证了其优于固定管道基线。

Comments 7 pages, 4 figures, 4 tables. Code and reproducible evaluations available at: https://github.com/Nicolepcx/AgensFlow

详情
AI中文摘要

基于大语言模型(LLM)构建的多智能体系统需要许多难以先验固定的协调选择:调用哪个技能协议、哪个智能体角色应执行子任务、每个角色绑定哪个模型、角色之间如何交互、何时使用检索或验证,以及何时完全省略某个步骤。这些选择与任务机制和操作约束相互影响,因此静态管道和一次性模型比较只能提供设计空间的有限视角。本文介绍AgensFlow,一个开源框架,将多智能体协调视为部分可观测下的在线策略学习问题。该框架使协调决策可观测且可从重复轨迹中学习,而不是将技能、角色、模型、拓扑和评估选择视为固定的管道设计。AgensFlow在两个语料库上进行了评估:分布式系统事件任务和安全咨询任务。评估展示了三个主要结果:在协调密集型任务上,学习路由比固定管道基线达到更高质量的操作点;skip:X将拓扑压缩隔离为基础的有意义部分;热启动策略图可以在保持平台质量的同时减少探索成本。总体而言,结果支持学习型可审计路由可以改善静态布线下的协调密集型多智能体工作流。

英文摘要

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring.

2605.27463 2026-05-28 stat.ME cs.AI stat.AP

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

当提示扰动破坏你的A/B测试:一种用于生成式调查的有效统计检验

Hayden Helm, Carey Priebe

AI总结 针对生成式调查中LLM对提示设计敏感的问题,提出一种置换检验方法,在包含扰动结构的统计模型下保持有效性,并给出预算分配建议。

详情
AI中文摘要

生成式调查——利用基于LLM的角色集合对消息提供反馈——已成为传统市场研究的廉价且可扩展的替代方案。然而,LLM对提示设计中的微小变化很敏感,从生成式调查中得出的结论可能依赖于任意的措辞选择。控制这种敏感性需要在分析中包含语义等价的扰动。在本文中,我们表明,在包含现实扰动结构的生成式调查统计模型下,标准假设检验(包括符号检验和Wilcoxon符号秩检验)是无效的。我们提出了一种在该模型下有效的置换检验,并正式刻画了标准检验失效的条件。将我们的框架应用于一个简单的生成式调查问题,我们估计了相关参数,刻画了置换检验在现实条件下的功效,并提供了关于在角色、扰动和重复之间分配预算的实用指导。最后,我们表明,即使在同一个模型家族内,估计效应的大小和方向都对模型选择敏感。

英文摘要

Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.

2605.27450 2026-05-28 cs.IR cs.LG

Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems

上下文特征很廉价:用于推荐系统中高效特征交互的秩感知分解

Yevgeny Tkach

AI总结 提出一种秩感知分解方法,通过将上下文相关计算从每个候选一次减少为每个请求一次,在不改变模型预测的情况下显著提升工业推荐系统的吞吐量。

详情
AI中文摘要

现代工业推荐系统使用深度排序模型对N个候选与相同的用户和上下文特征进行评分。标准实现在前向传播早期广播上下文特征,每个请求冗余计算N次上下文相关操作。我们提出了一种秩感知分解方法,适用于现代推荐架构中的主要交互机制——因子分解机(FM)成对乘积、深度交叉网络(DCNv2)交叉层、自注意力和全连接(FC)投影层——基于一个简单的代数原理:对秩划分输入的任何线性或双线性操作都允许精确的块分解,将上下文相关计算从每个候选一次移动到每个请求一次,与原始模型恒等等价。闭式分析和受控消融实验验证了节省量随上下文特征数量呈二次方增长。将该分解应用于生产级DLRM风格排序器,无需任何架构更改,在相同模型预测下,每个pod的吞吐量提高了87.5%(峰值pod数量减少47%)。恒等等价分解仅适用于交叉网络和自注意力的第一层,因为每一层在其输出中混合了秩。为了在深度上扩展节省量,我们进一步引入了rDCN,一种DCNv2的架构变体,它在深度上保持秩纪律,并在训练噪声内匹配DCNv2的精度,同时总FLOPs减少67%,并勾勒了自注意力的类似架构变体。

英文摘要

Modern industrial recommender systems use a deep ranking model to score N candidates against the same user and context features. Standard implementations broadcast context features early in the forward pass, redundantly computing context-only operations N times per request. We present a rank-aware decomposition applicable to the dominant interaction mechanisms in modern recommender architectures-Factorization Machine (FM) pairwise products, Deep Cross Network (DCNv2) cross layers, self-attention, and fully connected (FC) projection layers-built on a single algebraic principle: any linear or bilinear operation over a rank-partitioned input admits an exact block decomposition that moves context-only computation from once-per-candidate to once-per-request, identity-equivalent to the original model. Closed-form analysis and controlled ablation verify that savings scale quadratically with the number of context features. Applied to a production DLRM-style ranker without any architectural change, the decomposition increases per-pod throughput by 87.5% (a 47% reduction in peak pod count) at identical model predictions. The identity-equivalent decomposition applies only at the first layer of cross networks and self-attention, since each layer mixes ranks in its output. To extend savings across depth, we further introduce rDCN, an architectural variant of DCNv2 that maintains rank discipline across depth and matches DCNv2 accuracy within training noise at 67% fewer total FLOPs, and sketch an analogous architectural variant for self-attention.

2605.27449 2026-05-28 cs.IR cs.AI

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

用更好的检索核查事实:用于证据检索的动态对比学习

Zhongtian Hua, Yi Luo, Meijia Yu, Yingjie Han

AI总结 提出动态自适应对比学习方法DACLR,通过事件级特征提取、两阶段检索和动态对比损失优化,提升多模态证据检索的准确性。

详情
AI中文摘要

在多模态事实核查领域,从不同模态检索证据的准确性对下游声明验证过程有显著影响。现有的通用多模态检索方法通常基于语义构建,导致检索到的证据与声明相似但不相关。本文提出了一种用于证据检索的动态自适应对比学习方法(DACLR)来解决这些问题。DACLR首先使用多模态大语言模型(MLLM)将多模态证据和声明统一转换为文本模态,并在事件级别提取这些信息的特征。然后,通过召回-重排序的两阶段检索方法进行证据检索。DACLR通过优化对比损失和挖掘难负样本,增强了检索阶段模型的事件感知能力。具体而言,DACLR基于InfoNCE损失在语义和事件两个层次设计了三个损失函数,并对应设置了三组难负样本候选。模型根据批内样本的准确性监督信号动态调整比例,使模型在不遗忘语义检索能力的情况下,学习声明与正样本在事件层面的相关性。大量的对比和消融实验证明了DACLR及其内部优化方法的有效性。进一步的研究也证明了DACLR在多模态证据检索领域的优势。

英文摘要

In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the downstream claim verification process. Existing general multimodal retrieval methods are often constructed based on semantics, resulting in the retrieved evidence being similar but not relevant to the claim. This paper proposes a \textbf{D}ynamic \textbf{A}daptive \textbf{C}ontrastive \textbf{L}earning method for evidence \textbf{R}etrieval called DACLR to address these issues. DACLR first uses a Multimodal Large Language Model (MLLM) to uniformly convert multimodal evidence and claims into text modalities, and extracts the features of these information at event level. Then, it conducts evidence retrieval through a two-stage retrieval method of recall-rerank. DACLR enhances the model's event perception ability of the retrieval stage by optimizing the contrastive loss and mining hard negative samples. Specifically, DACLR designs three loss functions at two levels (semantic and event) based on the InfoNCE loss.Corresponding to these, three sets of hard negative sample candidates are set up. The model dynamically adjusts the ratio based on the accuracy supervision signal of intra-batch samples, allowing the model to learn the correlation between claims and positive samples at the event level without forgetting the semantic retrieval ability. Extensive comparison and ablation experiments demonstrates the effectiveness of DACLR and its internal optimization methods. Further research also prove the advantages of DACLR in the field of multimodal evidence retrieval.

2605.27445 2026-05-28 cs.IR cs.AI

RAGe: A Retrieval-Augmented Generation Evaluation Framework

RAGe:一种检索增强生成评估框架

Larissa Guder, João Pedro de Moura, Arthur Accorsi, Gustavo Losch do Amaral, Maurício Cecílio Magnaguagno, Felipe Meneguzzi, Marcio Sorraglia Pinho, Dalvan Griebler

AI总结 提出模块化框架RAGe,通过资源遥测和组件推荐,评估检索增强生成应用在准确性、效率和可扩展性之间的权衡,支持领域特定数据集的最佳组件选择。

详情
AI中文摘要

部署大型语言模型(LLM)应用,特别是那些依赖检索增强生成(RAG)的应用,仍然具有挑战性,原因是计算需求高、知识库过时以及需要手动选择最优流水线组件。在这项工作中,我们提出了一个模块化框架,通过关注资源遥测和组件推荐,为基准测试和指导RAG应用的高效开发提供支持,建议针对特定领域数据集的最佳组件。我们的方法利用LLM应用中的核心技术,包括文档分块、向量数据库、嵌入模型和检索器,来评估准确性、效率和可扩展性之间的权衡。通过将检索和生成质量与底层硬件约束直接关联,RAGe帮助研究人员识别最有效、特定领域的RAG设置,以满足其特定操作需求,即使在消费级硬件上也能促进快速原型开发。

英文摘要

Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

2605.27444 2026-05-28 cs.IR cs.AI

A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

检索增强生成与语言模型在太空操作中的系统评估

Ruben Belo, Marta Guimarães, Cláudia Soares

AI总结 本文系统评估了结合大语言模型与信息检索技术的检索增强生成管道在太空操作中提取和综合领域知识的效果,比较了不同检索策略、嵌入模型和LLM回答对信息准确性、相关性和可靠性的影响。

详情
AI中文摘要

太空活动的迅速扩展导致了技术文档、操作指南和科学文献的空前积累,给太空操作中的及时决策带来了挑战。太空操作中的有效管理需要能够高效处理庞大且异构信息源的工具。本文系统评估了检索增强生成(RAG)管道的性能,该管道结合了大语言模型(LLM)与信息检索技术,用于从领域特定文档中提取和综合可操作的知识。我们比较了各种检索策略、嵌入模型和LLM回答,以评估它们对信息准确性、相关性和可靠性的影响。我们的结果表明,RAG管道可以显著增强知识访问、减少不确定性,并支持复杂太空操作中的决策。

英文摘要

The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and scientific literature, creating challenges for timely decision-making in space operations. Effective management in space operations requires tools capable of efficiently processing vast and heterogeneous information sources. This paper systematically evaluates the performance of Retrieval Augmented Generation (RAG) pipelines, combining Large Language Models (LLMs) with information retrieval techniques for extracting and synthesizing actionable knowledge from domain-specific documents. We compare various retrieval strategies, embedding models, and LLM answers to assess their impact on information accuracy, relevance, and reliability. Our results demonstrate that RAG pipelines can significantly enhance knowledge access, reduce uncertainty, and support decision-making in complex space operations.

2605.27440 2026-05-28 cs.IR cs.AI

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

生产检索增强商业推荐中的释义脆弱性:低于重运行稳定性基线的可重复性

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 研究发现AI助手对买家问题的细微措辞变化(如“最佳CRM” vs “顶级CRM”)产生显著不同的品牌推荐,其推荐集相似度(Jaccard)远低于相同提示的重运行基线,挑战了当前AEO/GEO实践的有效性。

详情
AI中文摘要

买家提问方式的小变化——例如“最佳CRM” vs “顶级CRM” vs “SaaS初创公司的最佳CRM”——会导致AI助手推荐截然不同的品牌。在OpenAI和Anthropic模型上进行的约6,000次释义运行和约6,000次相同提示重运行对照中,相同购买意图的两个释义之间的推荐集相似度(Jaccard)对于措辞性改写为0.288(聚类95% CI [0.215, 0.361]),对于添加约束的改写为0.135([0.098, 0.175],合并区域/语言和特异性阶梯轴)——两者均远低于0.50-0.61的相同提示重运行基线。提示字符串(而非底层购买意图)是决定哪些品牌出现的主要输入。增加推理努力并未缩小差距(界限为+/-0.05)。这对日益流行的AEO/GEO实践构成了直接挑战。通过固定提示集上统计品牌提及次数来追踪品牌的“AI可见性”,会产生一个度量,其方差的主要来源是追踪器恰好发出的释义,而非模型对品牌的行为:相同购买意图的两个自然释义产生的推荐集Jaccard重叠率为14-29%,而相同提示重运行则为50-61。原则上,对每个意图采样更多释义可以减少这种伪影,学术界也存在高效的多提示评估方法,但自然买家措辞空间远大于这些方法已验证的基准规模提示集,且远超任何商业追踪器对每个品牌-意图组合发出的提示数量。因此,逐提示的提及追踪作为测量单位在结构上是不稳定的;有意义的改进可能需要不同的单位,而非更大的提示集。

英文摘要

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

2605.27439 2026-05-28 cs.IR cs.AI

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

检索增强的商业推荐中的显著性分层失败模式:一项37,000次运行的审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 通过对37,000次生产运行进行审计,研究了AI助手在商业推荐中按品牌显著性层级分层的失败模式,发现不同层级品牌面临不同的挑战(如可见性、转化率、替代效应),并指出营销策略需根据品牌在显著性阶梯上的位置定制。

详情
AI中文摘要

像ChatGPT和Claude这样的AI助手是推荐引擎,而非搜索引擎:它们通过直接提名品牌来回答商业查询,而不是返回链接列表。因此,面向AI的营销是一个比“出现在搜索中”更广泛的问题——定位、内容和产品适配性与可发现性同样重要。我们对四种模型配置和215个商业框架提示(涵盖19个行业)进行了约37,000次生产运行审计,并对照一个包含533个品牌的参考目录(分为五个显著性层级:L1类别领导者到L5区域玩家)进行评估,该目录来自外部权威列表。这个阶梯代理了品牌在其行业内的认知度足迹,而非收入或市场份额。失败模式因层级而异。L1品牌几乎出现在所有相关检索中,但仅赢得25-41%的推荐位置——杠杆在于差异化,而非可见性。L2挑战者拥有所有层级中最高的转化率(37-52%),但在Anthropic模型上因角色中介的替代而失败。L3中端市场品牌是转折点:总覆盖率降至88%,转化率降至34-40%,角色效应达到峰值。L4专家和L5区域玩家面临灾难性的不可见性——48-52%从未在37,000次运行中出现。没有统一的优化方案能胜出;正确的营销投资取决于品牌在显著性阶梯上的位置。

英文摘要

AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating brands rather than returning a list of links. Marketing to AI is therefore a broader problem than "show up in search" -- positioning, content, and product fit matter as much as discoverability. We audit ~37,000 production runs across four model configurations and 215 commercially-framed prompts spanning 19 sectors, evaluated against a 533-brand reference catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) sourced from external authority lists. The ladder proxies a brand's awareness footprint within its sector, not revenue or market share. The failure mode differs sharply by tier. L1 brands appear in nearly every relevant retrieval but win only 25-41% of the recommendation slots they reach -- the leverage is differentiation, not visibility. L2 challengers carry the highest conversion rates of any tier (37-52%) but lose to persona-mediated substitution on the Anthropic models. L3 mid-market brands are the inflection level: aggregate coverage drops to 88%, conversion to 34-40%, and persona effects peak. L4 specialists and L5 regional players face catastrophic invisibility -- 48-52% never surface in any of the 37,000 runs. No uniform optimization recipe wins; the right marketing investment depends on where the brand sits on the prominence ladder.

2605.27437 2026-05-28 cs.IR cs.AI

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

MGRetrieval: 面向长期对话代理的记忆引导反思检索

Tan Wang, Yunwei Dong

AI总结 提出MGRetrieval方法,通过记忆引导的反思检索构建精确检索路径,逐步积累关键记忆,提升长期对话代理中相关证据的检索效果。

详情
AI中文摘要

大型语言模型(LLMs)在对话方面取得了显著进展,但冗余的记忆上下文严重限制了它们在长期对话代理中的有效性。外部记忆系统已被提出以改善记忆维护。然而,这些系统主要依赖一次性检索,限制了它们检索充分且相关证据的能力。尽管最近的方法将反思引入检索,但其检索路径由LLM基于有限证据生成,导致检索不稳定和额外的延迟开销。为了解决这些限制,我们提出了MGRetrieval,一种将反思检索基于历史记忆语义结构的检索策略。具体来说,MGRetrieval包含两个步骤:(1)它参考历史记忆的结构来构建更精确的检索路径。(2)LLM保留关键记忆,并判断累积的记忆是否足以停止进一步的迭代检索。这使得检索过程能够遵循语义上有意义的路径。通过记忆引导检索和关键记忆传播,MGRetrieval逐步构建简洁且充分的记忆上下文。在LoCoMo上的大量实验表明,在Qwen2.5-14B和Qwen3-14B上,MGRetrieval在F1和BLEU-1上平均分别比最强基线高出8.91%和11.11%,同时保持实用的令牌和延迟成本。代码可在https://anonymous.4open.science/r/MGRetrieval找到。

英文摘要

Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness in long-term dialogue agents. External memory systems have been proposed to improve memory maintenance. However, these systems mainly rely on one-shot retrieval, which limits their ability to retrieve sufficient and relevant evidence. Although recent methods introduce reflection into retrieval, their retrieval paths are generated by the LLM from limited evidence, leading to unstable retrieval and additional latency overhead. %These limitations highlight the need for effective retrieval mechanisms. To address these limitations, we propose MGRetrieval, a retrieval strategy that grounds reflective retrieval in the semantic structure of historical memories. Specifically, MGRetrieval consists of two steps: (1) It references the structure of historical memories to construct a more precise retrieval path. (2) The LLM retains critical memories and determines whether accumulated memories are sufficient to stop further iterative retrieval. This allows the retrieval process to follow semantically meaningful paths. Through memory-guided retrieval and critical memory propagation, MGRetrieval gradually constructs concise and sufficient memory contexts. Extensive experiments on LoCoMo show that MGRetrieval outperforms the strongest baseline by 8.91\% in F1 and 11.11\% in BLEU-1 on average across Qwen2.5-14B and Qwen3-14B, while maintaining practical token and latency costs. The code can be found in https://anonymous.4open.science/r/MGRetrieval.

2605.27436 2026-05-28 cs.IR cs.AI cs.CV

RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

RE-TRIANGLE:TRIANGLE 在检索中能否实现超越余弦相似度的多模态对齐?

Arijit Ghosh, Aritra Bandyopadhyay, Chiranjeev Bindra, Jingfen Qiao

AI总结 本文复现 TRIANGLE 框架,验证其通过最小化超球面上模态三元组面积实现多模态对齐的几何目标在检索任务中的鲁棒性,发现其在零样本设置下优于成对基线,但存在优化不稳定和领域依赖问题。

详情
AI中文摘要

多模态对齐对于弥合信息检索中的语义鸿沟至关重要。然而,传统的成对策略存在几何盲点:虽然它们将锚定模态(如文本)与其他模态对齐,但缺乏强制外围模态(如视频和音频)之间相互一致性的约束。TRIANGLE 框架通过最小化超球面上模态三元组的面积来实现整体对齐。在这项可重复性研究中,我们验证了该几何目标在检索任务中的鲁棒性。我们确认 TRIANGLE 在零样本设置下优于成对基线,Recall@1 提升高达 +8.7 个百分点,但收益依赖于领域。然而,我们未能复现报告中的从头学习结果。使用合成玩具数据集的分析表明,这是由于联合优化几何对齐与数据-文本匹配(DTM)损失时的不稳定性。此外,我们发现余弦正则化主要稳定文本到视频检索,而使用领域监督进行微调会放大几何收益但降低跨数据集泛化能力。我们的发现支持了几何对齐的有效性,同时突出了关键的优化敏感性。代码可在 https://github.com/ARIJIT00171/RE-TRIANGLE 获取。

英文摘要

Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at https://github.com/ARIJIT00171/RE-TRIANGLE.

2605.27435 2026-05-28 cs.AR cs.AI

When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

当NPU并非总是更快:移动LLM推理的阶段级分析

Pu Li, Jiawen Qi, Qinyu Chen

AI总结 通过OPMASK控制管道分解方法,在CPU-NPU异构SoC上对移动LLM推理进行阶段感知的多级基准测试,发现Prefill阶段CPU比NPU快1.6倍,Decode阶段NPU仅提供1.05-1.2倍加速,且NPU卸载增加能耗高达51%。

详情
AI中文摘要

在移动设备上部署大型语言模型(LLM)越来越依赖于异构执行,然而,尚无先前研究在算子和管道级别系统性地描述NPU的有效性。我们首次在CPU-NPU异构SoC上对移动LLM推理进行了阶段感知的多级基准测试研究。我们引入了一种基于OPMASK的受控管道分解方法,该方法隔离了NPU执行路径中的通信、量化和计算开销。我们的结果揭示了反直觉的阶段级性能反转:在计算密集型的Prefill阶段,CPU性能优于NPU(高达1.6倍),而在内存受限的Decode阶段,NPU仅提供有限的加速(1.05-1.2倍)。我们进一步表明,调度开销和跨后端回退降低了NPU卸载的实际收益。在能耗趋势方面,增加NPU卸载会导致更高的能耗(高达51%)。基于这些发现,我们为面向设备上LLM推理的NPU架构师推导出了设计指南。

英文摘要

Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.

2605.27433 2026-05-28 cs.MA cs.AI

Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

数据服务市场的异构多智能体建模与测量及网络分析

Deyu Zhou, Yuwei Guo, Xudong Lu, Linhao Zhang, Wei Guo, Lizhen Cui

AI总结 本文提出一种基于异构多智能体建模的数据服务市场测量与网络分析方法,通过引入服务生态系统理论明确参与者与外部因素,基于价值创造对三级实体进行效用测量,并设计分析框架评估异构网络对效用的影响。

详情
AI中文摘要

随着各种社会实体之间协作以及用户需求的日益复杂,影响数据服务市场稳定发展的因素也在增加。这些因素包括信息广泛传播增强主观意识、智能水平持续提升以及结构关系复杂化。为了实现数据服务市场的有效治理和监管,在做出监管决策之前进行仿真实验至关重要。然而,当前对数据服务市场的研究和分析主要集中在数据层面的性能,在涉及数据服务市场中多个异构实体的测量与分析以及各种社会要素的整合时显得不足。基于此,本文创新性地提出了一种基于异构多智能体建模的数据服务市场测量与网络分析方法。通过引入服务生态系统理论,我们明确了数据服务市场的参与者和外部因素,并基于价值创造对三级实体进行效用测量。此外,设计了一种分析方法来精确评估异构网络对效用的影响。最后,通过实验结果分析验证了所提方法的有效性。

英文摘要

With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable development of the data service market are also growing. These factors include the widespread dissemination of information enhancing subjective consciousness, the continuous improvement in intelligence, and the complexification of structural relationships. To achieve effective governance and regulation of the data service market, it is crucial to conduct simulation experiments before making regulatory decisions. However, current research and analysis of the data service market primarily focus on data-level performance, proving inadequate when it comes to measurement and analysis of multiple heterogeneous entities and the integration of various social elements within the data service market. Based on this, this paper innovatively proposes a data service market measurement and network analysis method based on heterogeneous multi-agent modeling. By introducing the service ecosystem theory, we clarify the participants and external factors of the data service market and conduct utility measurements for three-level entities based on value creation. Furthermore, an analytical methodology is devised to precisely assess the influence of heterogeneous networks on utility. Finally, the paper verifies the effectiveness of the proposed method through the analysis of experimental results.

2605.27432 2026-05-28 cs.IR cs.AI

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG: 联邦双系统检索增强生成

Tianhao Gao, Kai Yang, Yiyang Li

AI总结 提出FD-RAG框架,通过解耦轻量级记忆访问与按需LLM推理,并利用语义感知自适应超图蒸馏为紧凑QA记忆,在联邦设置下实现高效、隐私保护的检索增强生成。

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的范式,然而现有大多数RAG系统假设集中式知识访问和充足的计算资源。这些假设在边缘环境中失效,因为知识分散在设备上,原始数据无法共享,且重复调用LLM成本过高。我们提出FD-RAG,一种联邦双系统RAG框架,将轻量级记忆访问与按需LLM推理解耦,以实现去中心化部署。具体而言,FD-RAG在本地语料库上学习语义感知的自适应超图,并将其蒸馏为紧凑的QA记忆。在推理时,它通过直接记忆匹配回答覆盖良好的查询,仅在必要时调用基于LLM的推理,同时将检索到的记忆追溯至超图支撑的证据。为了缓解跨设备的知识碎片化,FD-RAG在不暴露原始文档的情况下聚合跨设备的匿名记忆。在QA基准上的实验表明,与强本地和联邦基线相比,FD-RAG将准确率提升高达7.8%,同时延迟降低8.4倍。我们还提供了理论分析,建立了所提出的超图学习的$\mathcal{O}(1/ε^{2})$收敛率,支持其在边缘环境中的可处理部署。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing RAG systems assume centralized knowledge access and ample computation. These assumptions break down in edge environments, where knowledge is fragmented across devices, raw data cannot be shared, and repeated LLM calls are prohibitively expensive. We propose FD-RAG, a federated dual-system RAG framework that decouples lightweight memory access from on-demand LLM reasoning for decentralized deployment. Specifically, FD-RAG learns semantic-aware adaptive hypergraphs over local corpora and distills them into compact QA memories. At inference time, it answers well-covered queries via direct memory matching and invokes LLM-based reasoning only when necessary, while tracing retrieved memories to hypergraph-grounded evidence. To mitigate cross-device knowledge fragmentation, FD-RAG aggregates anonymized memories across devices without exposing raw documents. Experiments on QA benchmarks show that FD-RAG improves accuracy by up to 7.8\% while reducing latency by 8.4$\times$ compared with strong local and federated baselines. We also provide theoretical analysis establishing an $\mathcal{O}(1/ε^{2})$ convergence rate for the proposed hypergraph learning, supporting its tractable deployment in edge settings.

2605.27429 2026-05-28 cs.IR cs.AI

Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

Ocean4Rec:离线LLM生成的OCEAN画像用于请求时VOD重排序

Wonkyun Kim, Sehyun Bae, Kwanki Ahn, Mungyu Bae, Saeun Choi, Soyeon You, Chandra Prabhakar, Sehyun Kim

AI总结 提出Ocean4Rec重排序层,利用LLM离线生成物品OCEAN画像,在请求时无需LLM调用,通过数值计算提升VOD推荐性能。

详情
AI中文摘要

工业视频点播(VOD)推荐系统需要更丰富的内容理解,但LLM作为重排序器的设计在每次请求中重复进行提示构建、令牌生成、模型调用、输出解析和回退处理。在高流量、延迟敏感的服务中,这些请求时操作使吞吐量规划、尾延迟控制、容量隔离和可预测运维复杂化。本文提出Ocean4Rec,一种重排序层,仅离线使用LLM从内容元数据中物化物品的OCEAN画像。物品被映射为开放性、尽责性、外向性、宜人性和神经质分数,而用户画像则通过最近点击和深度链接物品在同一五维空间中的时间衰减聚合构建。在请求时,Ocean4Rec连接预计算的物品画像、用户画像、基础推荐器分数和目录新鲜度,然后执行数值重排序,无需LLM调用。在匿名的三星智能电视VOD日志上,相同候选集的Top1000时间留出离线评估显示,对于NCF生成器,Ocean4Rec在NDCG@20上比更强的非OCEAN基础+新鲜度排序提升7.6%,对于LightGCN生成器提升61.5%。HR@20对于NCF不显著,对于LightGCN提升67.3%,反映了稀疏的精确物品回放标签和新鲜度作为工业基线的强度。该结果应被视为一种有界辅助内容品味特征的离线回放证据,该特征保留了无请求时LLM的服务路径的可部署性优势。

英文摘要

Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, token generation, model invocation, output parsing, and fallback handling for each request. In high-volume latency-sensitive services, these request-time operations complicate throughput planning, tail-latency control, capacity isolation, and predictable operation. This paper presents Ocean4Rec, a reranking layer that uses an LLM only offline to materialize item OCEAN profiles from content metadata. Items are mapped into Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores, while user profiles are built by time-decayed aggregation of recently clicked and deep-linked items in the same five-dimensional space. At request time, Ocean4Rec joins precomputed item profiles, user profiles, base recommender scores, and catalog recency, then performs numeric reranking without an LLM call. On anonymized Samsung Smart TV VOD logs, same-candidate Top1000 temporal-holdout offline evaluations show that Ocean4Rec improves NDCG@20 over a stronger non-OCEAN Base+Recency ordering by 7.6% for an NCF generator and 61.5% for a LightGCN generator. HR@20 is inconclusive for NCF and improves by 67.3% for LightGCN, reflecting sparse exact-item replay labels and the strength of recency as an industrial baseline. The result should be read as offline replay evidence for a bounded auxiliary content-taste feature that preserves the deployability advantage of a request-time-LLM-free serving path.

2605.27418 2026-05-28 cs.MA cs.RO

Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

城市交叉口异构移动体的可微分模型预测安全

Wenzhe Song, Hao Zhang

AI总结 提出可微分模型预测安全(DMPS)框架,将模型预测控制的前瞻性嵌入数据驱动的端到端强化学习架构,通过可微分安全评价器实现精确在线安全校正,在高密度混合交通仿真中将碰撞率降至5.6%以下。

Comments 6 pages. Published in IEEE IARCE 2025

详情
Journal ref
2025 IEEE 5th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE), Chongqing, China, 2025, pp. 1-6
AI中文摘要

自动驾驶车辆和移动机器人在城市环境中的即将集成对未来的智能交通系统提出了严峻的安全挑战。本文解决了在无信号交叉口协调具有不同动力学的异构智能体的复杂问题。我们引入了一种新颖的框架,称为可微分模型预测安全(DMPS),它将模型预测控制的前瞻性嵌入到数据驱动的端到端强化学习架构中。DMPS智能体学习一个潜在动力学模型,以预测依赖于其动作的未来轨迹。然后,一个学习到的可微分安全评价器评估这些轨迹的风险。关键的是,通过利用通过整个展开预测模型的反向传播,智能体可以高效地计算未来安全性相对于当前动作的梯度,从而实现最小且精确的在线安全校正。集成到多智能体训练方案中,DMPS在高密度混合车辆-机器人交通仿真中几乎消除了碰撞,碰撞率低于5.6%,在不牺牲能量和交通效率的情况下展示了最先进的安全性。

英文摘要

The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model-predictive control into a data-driven, end-to-end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi-agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high-density, mixed vehicle-robot traffic simulations, demonstrating state-of-the-art safety without compromising energy and traffic efficiency.

2605.27417 2026-05-28 quant-ph cs.AI cs.LG

Quantum Machine Learning-based 6G edge Network: Enabling Adaptive Communication and Model Aggregation

基于量子机器学习的6G边缘网络:实现自适应通信与模型聚合

Wenjing Xiao, Jiatai Yan, Chenglong Shi, Shixin Chen, Miaojiang Chen, Min Chen, Saif Al-Kuwari, Ahmed Farouk

AI总结 针对6G V2X通信中的高维状态空间、异构节点和动态环境挑战,提出量子增强框架,包含信道自适应语义通信、多模态融合、模型迁移和联邦聚合四个模块,利用量子卷积神经网络、量子注意力、量子强化学习和量子张量分解提升效率、泛化能力和隐私保护。

详情
AI中文摘要

随着第六代(6G)移动通信技术的到来,车联网(V2X)通信在通信效率、系统泛化能力和模型协作方面面临前所未有的挑战。传统机器学习在处理V2X系统中的高维状态空间、异构V2X节点下的慢收敛和弱泛化、快速变化的信道以及多模态感知数据时存在困难。为解决这些问题,我们提出了一种量子增强的V2X通信与模型聚合框架,旨在实现6G中高效、鲁棒和智能的交通,该框架包含四个模块:信道自适应语义通信模块、多模态融合模块、模型迁移模块和联邦聚合模块。具体而言,信道自适应语义通信模块利用量子卷积神经网络(CNN)和量子失真度量,实现高效传输和跨不同条件的强泛化能力。多模态融合模块利用量子注意力和纠缠来压缩特征并关联异构数据中的语义。模型迁移模块采用量子强化学习对决策过程进行建模,并提高动态环境中的适应性。联邦聚合模块将量子张量分解与基于反向传播的校正相结合,以低开销提供隐私保护并增强全局模型的鲁棒性。这项工作为未来6G智能交通中的通信与模型协作勾勒了一种新范式。

英文摘要

With the advent of sixth-generation (6G) mobile communication technology, vehicle-to-everything (V2X) communication faces unprecedented challenges in communication efficiency, system generalization capabilities, and model collaboration. Conventional machine learning struggles with high-dimensional state spaces, slow convergence, and poor generalization under heterogeneous V2X nodes, rapidly varying channels, and multimodal sensing data in V2X systems. To address these issues, we propose a quantum-enhanced framework for V2X communication and model aggregation that targets efficient, robust, and intelligent transportation in 6G, which includes four modules: the channel-adaptive semantic communication module, the multimodal fusion module, the model transfer module, and the federated aggregation module. Specifically, the channel-adaptive semantic communication module leverages quantum convolutional neural networks (CNN) and quantum distortion metrics to enable efficient transmission and strong generalization across diverse conditions. The multimodal fusion module exploits quantum attention and entanglement to compress features and associate semantics across heterogeneous data. The model transfer module employs quantum reinforcement learning to model decision-making and improve adaptability in dynamic environments. The federated aggregation module integrates quantum tensor decomposition with backpropagation-based corrections to provide privacy preservation with low overhead and to strengthen global model robustness. This work outlines a new paradigm for communication and model collaboration in future 6G intelligent transportation.

2605.27416 2026-05-28 quant-ph cs.AI cs.DC cs.LG

Can Quantum Federated Learning Withstand Circuit-Level Backdoors?

量子联邦学习能否抵御电路级后门攻击?

Aakar Mathur, Mohammed Ruknuddin, Ashish Gupta

AI总结 提出电路级后门威胁模型(CULT),通过量子感知机制(Grover、Pauli、Bit-flip、Sign-flip)实现四种隐蔽攻击,理论证明攻击的隐蔽性,实验表明单个恶意客户端即可导致FedAvg精度严重下降,现有防御无法消除最坏情况。

Comments Accepted to IJCAI-ECAI 2026

详情
AI中文摘要

量子联邦学习(QFL)继承了联邦优化对恶意客户端的核心脆弱性,同时也引入了来自变分电路训练和测量驱动梯度的攻击面。本文提出了一种新颖的电路级后门威胁(CULT)模型,该模型通过利用量子感知机制(包括Grover、Pauli、Bit-flip和Sign-flip)形式化了四种隐蔽攻击。通过使恶意客户端在训练中和训练后表面上均可发起攻击,这些攻击能够严重破坏学习过程。我们建立了严格的理论基础,以证明在标准平滑性假设下攻击的隐蔽性。在MNIST和CIFAR-10数据集上进行的实验,采用非独立同分布划分和不同比例的恶意客户端,结果表明,即使只有一个恶意客户端,在FedAvg聚合下也能导致严重的精度下降。虽然流行的防御方法(包括Krum、Multi-Krum、FoolsGold、FLGuardian和Mud-HoG)在许多情况下减少了精度下降,但它们未能消除最坏情况下的失败案例,其中精度下降高达50%。实验分析进一步揭示,在CULT模型下,恶意更新通过保持接近良性范数来有效掩盖其存在,从而帮助攻击者逃避检测。

英文摘要

Quantum Federated Learning (QFL) inherits the core vulnerability of federated optimization to malicious clients, while also introducing an attack surface from variational circuit training and measurement-driven gradients. This work proposes a novel CircUit-Level backdoor Threat (CULT) model that formalizes four stealthy attacks by exploiting quantum-aware mechanisms, including Grover, Pauli, Bit-flip, and Sign-flip. By enabling malicious clients on both in-training and post-training surfaces, these attacks can critically undermine the learning process. We establish a rigorous theoretical foundation to demonstrate attack stealthiness under standard smoothness assumptions. Experiments on the MNIST and CIFAR-10 datasets with non-IID splits and varying fractions of malicious clients show that even a single malicious client can induce severe accuracy degradation under FedAvg aggregation. While popular defenses, including Krum, Multi-Krum, FoolsGold, FLGuardian, and Mud-HoG, reduce degradation in many regimes, they fail to eliminate worst-case failure cases, where accuracy drops up to 50\%. The experimental analysis further reveals that under the CULT model, malicious updates effectively mask their presence by staying close to benign norms, thereby helping attackers evade detection.

2605.27413 2026-05-28 q-bio.BM cs.AI

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

配体条件离散扩散用于蛋白质序列-结构协同设计

Chen Wei, Fanding Xu, Minghao Sun, Zhiyuan Liu, Lin Wang, Tianrui Jia, Yihang Zhou, Yang Zhang

AI总结 提出配体条件离散扩散模型ProtLiD²,通过几何感知交叉注意力联合生成氨基酸序列和离散结构令牌,实现配体约束下的蛋白质序列-结构协同设计,显著提升全局折叠置信度和配体感知通过率。

Comments 19 pages, 6 figures

详情
AI中文摘要

蛋白质通过氨基酸序列编码的三维结构执行其生物学功能,而配体结合蛋白质的协同设计需要模型在明确的配体约束下生成序列-结构兼容的蛋白质。尽管连续扩散和基于流的模型支持在坐标或潜在空间中进行配体感知设计,但现有的离散扩散蛋白质语言模型主要操作于序列或结构令牌,缺乏直接的小分子条件。我们引入了 extbf{ProtLiD$^2$},一个用于蛋白质序列-结构协同设计的 extbf{蛋白质}配体条件 extbf{离散扩散}模型。ProtLiD$^2$联合生成氨基酸序列和离散结构令牌,同时通过几何感知交叉注意力整合配体化学和几何信息。在超过一百万个配体-蛋白质复合物上训练后,ProtLiD$^2$将掩码离散扩散扩展到配体感知的功能性蛋白质设计。我们进一步提出了最大置信度边界引导的ReMask解码,这是一种推理时自校正策略,保留高置信度预测并重新掩码不确定的令牌。在整体蛋白质设计中,ProtLiD$^2$相比Complexa提高了全局折叠置信度,将TM-score从0.672提升至0.802,pLDDT从64.55提升至73.00。在口袋协同设计中,ProtLiD$^2$将活性位点BB-RMSD从FAIR/PocketGen的3.46/3.40Å降低至1.97Å,并在更严格的对接阈值下,将配体感知通过率从PocketGen的14.86%提升至59.73%,从6.08%提升至23.49%。这些结果支持配体条件离散扩散作为功能性蛋白质协同设计的有效令牌空间框架。代码将在https://github.com/auroua/ProtLiD提供。

英文摘要

Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protein co-design requires models that generate sequence-structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow-based models support ligand-aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small-molecule conditioning. We introduce \textbf{ProtLiD$^2$}, a \textbf{Prot}ein \textbf{L}igand-conditioned \textbf{D}iscrete \textbf{D}iffusion model for protein sequence-structure co-design. ProtLiD$^2$ jointly generates amino-acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry-aware cross-attention. Trained on over one million ligand-protein complexes, ProtLiD$^2$ extends masked discrete diffusion to ligand-aware functional protein design. We further propose maximum confidence-margin guided ReMask decoding, an inference-time self-correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD$^2$ improves global fold confidence over Complexa in whole-protein design, increasing TM-score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co-design, ProtLiD$^2$ reduces active-site BB-RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand-aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand-conditioned discrete diffusion as an effective token-space framework for functional protein co-design. Code will be available at https://github.com/auroua/ProtLiD.

2605.27412 2026-05-28 cs.NE cs.AI cs.LG

Advancing Direct Training for Spiking Neural Networks with Circulate-Firing Neurons and Learnable Gradients

利用循环发放神经元和可学习梯度推进脉冲神经网络的直接训练

Feifan Zhou, Xiang Wei, Yang Liu, Qiang Yu

AI总结 提出一种包含循环发放神经元、逐时间步可学习代理梯度和正负平衡损失函数的直接训练算法,以提升脉冲神经网络的信息表示能力和梯度传播精度,在多个数据集上取得竞争性性能并泛化至Transformer架构。

详情
AI中文摘要

脉冲神经网络(SNN)因其节能特性而备受关注,但与人工神经网络(ANN)相比仍存在显著性能差距。这一差距源于至少两个关键限制:首先,传统脉冲神经元的信息表示能力有限,未能充分利用膜电位的丰富动态;其次,固定代理梯度(SG)函数在时间步上导致梯度传播不精确,阻碍了有效的直接训练。为了解决这两个挑战,我们提出了一种新的直接训练算法,包含三个核心创新:第一,一种循环发放脉冲神经元模型,通过更有效地利用膜电位来增强信息表示能力;第二,一种逐时间步可学习的代理梯度函数,能够在反向传播过程中实现精确的梯度估计;第三,一种正负平衡损失函数,以实现正负膜电位之间的平衡,进一步提升SNN性能。大量实验表明,我们的方法在多个数据集上取得了竞争性性能。我们的方法可以无缝泛化到先进的Transformer架构,始终优于现有方法。我们的工作强调了进一步利用SNN内在膜动力学以提升性能的有效性,从而为推进高性能脉冲神经架构开辟了新途径。

英文摘要

Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared to Artificial Neural Networks (ANNs). This gap stems from at least two key limitations: first, conventional spiking neurons offer limited information representation capacity, underutilizing the rich dynamics of membrane potentials; second, fixed surrogate gradient (SG) functions across time steps leads to imprecise gradient propagation, impeding effective direct training. To address these two challenges, we propose a new direct training algorithm with three core innovations: first, a circulate-firing spiking neuron model that enhances information representation capacity by leveraging membrane potentials more effectively; second, a time-step-wise learnable surrogate gradient function, enabling accurate gradient estimation during backpropagation; third, a positive-negative balanced loss function to achieve equilibrium between positive and negative membrane potentials and further boost SNN performance. Extensive experiments demonstrate that our methods achieve competitive performance across multiple datasets. Our methods can generalize seamlessly to advanced architectures of Transformer, consistently outperforming existing methods. Our work highlights the effectiveness of further harnessing intrinsic membrane dynamics of SNNs for performance improvement, and thus open a new avenue for advancing high-performance spiking neural architectures.

2605.27411 2026-05-28 cs.NE cs.LG

Genetic algorithm vs. gradient descent for training a neural network architecture dedicated to low data regimes in small medical datasets

遗传算法与梯度下降在针对小医学数据集低数据量场景的神经网络架构训练中的比较

Amine Boukhari, Boglarka Ecsedi, Laszlo Papp, Mathieu Hatt

AI总结 针对DEBI-NN架构,比较遗传算法与梯度下降在分类任务中的性能,发现遗传算法在决策边界和分类准确率上显著优于梯度下降。

详情
AI中文摘要

目的/引言:距离编码生物形态信息神经网络(DEBI-NN)是一种最近提出的架构,其中连接权重由欧几里得空间中神经元之间的距离定义。与直接训练权重的经典神经网络相比,这种方法大幅减少了可训练参数的数量。DEBI-NN的训练过程基于遗传算法(GA),而非深度学习中最常用的优化算法梯度下降(GD)。我们旨在为DEBI-NN设计并实现一个GD学习器,并评估其与GA相比的性能。 材料与方法:我们设计了一种针对DEBI-NN的空间反向传播方案,并在分类任务中比较了GD和GA,使用了合成非线性“双月”数据集、两个临床医学影像放射组学数据集和一个胎儿心宫缩图数据集,样本量从n=85到n=2126。每个优化器通过针对每个数据集调整的超参数搜索进行调优。 结果:在所有实验中,GA始终产生更优的决策边界和分类性能(合成:100% vs 83%;DLBCL:83% vs 78%;HECKTOR:80% vs 67%;胎儿:81% vs 66%),而GD表现出不稳定性,未能完全捕捉DEBI-NN空间编码固有的非线性模式。神经元相互依赖导致的纠缠梯度限制了经典反向传播的有效性。 结论:这些发现凸显了基于梯度的方法在具有高度相互依赖空间参数的架构中的根本局限性,并确认了进化策略在训练DEBI-NN中的适用性。

英文摘要

Aim/Introduction: Distance-encoding biomorphic-informational neural network (DEBI-NN) is a recently proposed architecture in which connection weights are defined by the distances between neurons positioned in a Euclidian space. This approach drastically reduces the number of trainable parameters compared to classical neural networks in which weights are directly trained. The training process for DEBI-NN is based on a genetic algorithm (GA), rather than gradient descent (GD) which remains the prevailing optimization algorithm in deep learning. We aim to design and implement a GD learner for DEBI-NN and assess its performance compared to GA. Materials and Methods: We designed a spatial backpropagation scheme tailored to DEBI-NN and carried out a comparison between GD and GA for classification tasks, using a synthetic non-linear "two-moons" dataset, two clinical medical imaging radiomic datasets and a fetal cardiotocography dataset with a sample sizes ranging from n=85 to n=2126. Each optimizer was tuned through targeted hyperparameter searches adapted to each dataset. Results: Across all experiments, GA consistently produced superior decision boundaries and classification performance (Synthetic: 100% vs 83%; DLBCL: 83% vs 78%; HECKTOR: 80% vs 67%; Fetal: 81% vs 66%), whereas GD exhibited instability and failed to fully capture the non-linear patterns inherent to DEBI-NN's spatial encoding. The entangled gradients resulting from neuron interdependencies limit the effectiveness of classical backpropagation. Conclusion: These findings highlight fundamental limitations of gradient-based methods in architectures with highly interdependent spatial parameters and confirm the suitability of evolutionary strategies for training DEBI-NN.

2605.27409 2026-05-28 cs.NE cs.AI cs.LG

STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

STARS: 面向ANN到SNN无数据知识蒸馏的尖峰尾部感知关系合成

Shuhan Ye, Yi Yu, Qixin Zhang, Hui Lu, Jiaming He, Qinggang Zhang, Li Shen, Xudong Jiang

AI总结 提出STARS方法,通过关系一致性对齐和尾部感知正则化增强BN引导的合成数据,解决SNN学生网络在无数据知识蒸馏中约束不足的问题,在多个数据集上提升性能。

详情
AI中文摘要

SNN有望实现高能效和低延迟推理,但其性能仍落后于ANN。ANN到SNN的知识蒸馏有助于缩小这一差距,但在实际部署中原始训练数据通常不可用。现有的无数据知识蒸馏(DFKD)方法通过匹配教师侧先验(尤其是BN统计量)来合成替代数据,但这些面向ANN的约束主要正则化均值和方差,因此对于响应依赖于阈值穿越动态的SNN学生网络而言,约束不足。本文提出尖峰尾部感知关系合成(STARS),一种用于ANN到SNN DFKD的即插即用方法,通过两个互补目标增强标准BN引导合成:关系一致性对齐(保持教师和学生之间的跨样本关系一致性)和尾部感知正则化(通过软超越教师导出阈值来正则化阈值相关的尾部概率)。这些目标共同生成合成批次,这些批次在保持教师有效性的同时,对SNN学生网络更具信息性。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的多个ANN-SNN对实验表明,我们的方法一致改进了传统DFKD基线,甚至超过了若干KD方法,在CIFAR-10上提升高达4.6%,在CIFAR-100上提升高达6.7%,突显了在面向SNN的DFKD中,用关系约束和尾部感知约束补充BN匹配的重要性。

英文摘要

SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation helps narrow this gap, yet the original training data are often unavailable in practical deployment settings. Existing data-free knowledge distillation (DFKD) methods synthesize surrogate data by matching teacher-side priors, especially BN statistics, but these ANN-oriented constraints mainly regularize mean and variance and therefore remain under-constrained for SNN students whose responses depend on threshold-crossing dynamics. In this paper, we propose Spike Tail-Aware Relational Synthesis (STARS), a plug-and-play method for ANN-to-SNN DFKD that augments standard BN-guided synthesis with two complementary objectives: Relational Consistency Alignment, which preserves cross-sample relational consistency between teacher and student, and Tail-Aware Regularization, which regularizes threshold-relevant tail probabilities through soft exceedance over teacher-derived thresholds. Together, these objectives generate synthetic batches that remain teacher-valid while becoming more informative for SNN students. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet across multiple ANN-SNN pairs show that our method consistently improves conventional DFKD baselines and even surpasses several KD methods, with gains of up to 4.6\% on CIFAR-10 and 6.7\% on CIFAR-100, highlighting the importance of complementing BN matching with relational and tail-aware constraints in SNN-oriented DFKD.

2605.27408 2026-05-28 quant-ph cs.LG cs.NA math.NA

Neural Quantum Spectral Operator Learning for Solving Partial Differential Equations

神经量子谱算子学习求解偏微分方程

Chanyoung Kim, Myeonghwan Seong, Yujin Kim, Daniel K. Park, Youngjoon Hong

AI总结 提出首个混合量子-经典无监督算子学习框架NVQLS,利用Legendre-Galerkin弱形式解决VQLS符号歧义并引入神经嵌入编码,在1D/2D参数化PDE上实现高精度求解。

Comments 31 pages (main 9 pages), 17 figures, 8 tables

详情
AI中文摘要

偏微分方程(PDE)是物理和工程系统建模的核心,但重复求解参数化PDE仍然计算成本高昂。算子学习能够实现快速代理推理,但通常需要由昂贵的高保真PDE求解器生成的大规模输入-输出配对数据集。无监督算子学习框架减轻了数据依赖性,但仍受计算瓶颈限制。为解决这一问题,我们提出了神经变分量子线性求解器(NVQLS),这是首个利用Legendre-Galerkin弱形式的混合量子-经典算子学习框架。我们关键性地解决了VQLS能量最小化中的符号歧义,防止了错误的解表示。此外,我们引入了神经嵌入,一种新颖的编码方案,将变化的强迫项和PDE系数映射到参数化量子电路表示中。这些结构创新在高效态制备方案下提供了理论计算复杂度优势,同时相比代表性经典基线实现了更优的精度。在1D和2D参数化PDE上,在不同边界条件下的验证表明,NVQLS能够同时处理变化输入,为量子增强算子学习提供了一种可扩展的无监督方法。

英文摘要

Partial differential equations (PDEs) are central to modeling physical and engineering systems, but repeatedly solving parametric PDEs remains computationally expensive. Operator learning enables fast surrogate inference, yet typically requires large input-output paired datasets generated by costly high-fidelity PDE solvers. Unsupervised operator learning frameworks alleviate data dependency but remain hindered by computational bottlenecks. To address this, we propose Neural Variational Quantum Linear Solver (NVQLS), the first hybrid quantum-classical operator learning framework leveraging the Legendre--Galerkin weak formulation. We critically resolve the sign ambiguity in VQLS energy minimization, preventing erroneous solution representations. Additionally, we introduce a neural embedding, a novel encoding scheme to map varying forcings and PDE coefficients into parameterized quantum circuit representations. These structural innovations provide theoretical computational complexity advantages under efficient state preparation schemes, while achieving superior accuracy compared to a representative classical baseline. Validations on 1D and 2D parametric PDEs under diverse boundary conditions demonstrate NVQLS's capability to simultaneously process varying inputs, offering a scalable unsupervised approach to quantum-enhanced operator learning.

2605.27407 2026-05-28 cs.NE cs.AI cs.LG

Benchmarking Fairness in Spiking Neural Networks: Data Bias, Spurious Features, and Hardware Effects

脉冲神经网络中的公平性基准测试:数据偏差、虚假特征和硬件效应

Hudi He, Fukun Wang, Zhe Wang, Xinyi Wang, Shuhan Ye, Jiarui Liu, Qing Qing, Ziqi Xu, Xikun Zhang, Renqiang Luo

AI总结 本文首次提出脉冲神经网络公平性基准,通过引入人口统计覆盖缺口、虚假特征泄漏和部署环境不匹配三个现实维度,系统评估了12种先进SNN在资源约束下的公平性-性能权衡。

详情
AI中文摘要

评估脉冲神经网络(SNN)的公平性需要反映现实世界复杂性的严格基准,然而现有评估仍受限于肤浅的数据集多样性和理想化的硬件假设。本文首次引入SNN的系统性公平性基准,解决三个关键的现实维度:(1)训练数据中的人口统计覆盖缺口,(2)虚假特征泄漏(例如,肤色作为类别标签的代理),以及(3)部署环境不匹配(例如,具有受限脉冲编码的边缘设备)。我们的框架整合了四个跨人口统计数据集(带有受控偏差注入)和三个神经形态硬件模拟器(Loihi 2、SpiNNaker),从而能够在资源约束下隔离分析公平性-性能权衡。对12种最先进SNN的标准化评估揭示了显著差异:在偏差数据上训练的模型对代表性不足群体的假阳性率高出23%,而硬件限制(例如,降低的脉冲精度)在边缘部署中进一步将准确率差距放大至41%。关键的是,为云端SNN开发的偏差缓解策略在资源约束下通常会退化,这凸显了需要联合优化公平性和硬件效率的协同设计原则。通过连接算法公平性研究与神经形态工程,我们的基准为医疗和自主系统等社会关键应用中的可信SNN奠定了基础。我们的代码可在以下网址获取:https://anonymous.4open.science/r/SNN-Benchmarks-8017。

英文摘要

Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assessments remain limited by superficial dataset diversity and idealized hardware assumptions. This work introduces the first systematic fairness benchmark for SNNs, addressing three critical dimensions of realism: (1) demographic coverage gaps in training data, (2) spurious feature leakage (e.g., skin tone as a proxy for class labels), and (3) deployment-environment mismatches (e.g., edge devices with constrained spike encoding). Our framework integrates four cross-demographic datasets with controlled bias injections and three neuromorphic hardware simulators (Loihi 2, SpiNNaker), enabling isolated analysis of fairness-performance trade-offs under resource constraints. Standardized evaluations of 12 state-of-the-art SNNs reveal stark disparities: models trained on biased data exhibit 23\% higher false positive rates for underrepresented groups, while hardware limitations (e.g., reduced spike precision) further amplify accuracy gaps by up to 41\% in edge deployments. Critically, bias mitigation strategies developed for cloud-based SNNs often degrade under resource constraints, highlighting the need for co-design principles that jointly optimize fairness and hardware efficiency. By bridging algorithmic fairness research with neuromorphic engineering, our benchmark provides a foundation for trustworthy SNNs in socially critical applications such as healthcare and autonomous systems. Our code is available at: https://anonymous.4open.science/r/SNN-Benchmarks-8017.