arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.DB数据库15
2606.12387 2026-06-11 cs.DB cs.AI 新提交

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

TAHOE: 基于经验的自动提示优化文本到SQL系统

Zhiyi Chen, Jie Song, Peng Li

AI总结 提出TAHOE系统,通过错误驱动的提示学习管道将调试痕迹转化为结构化提示库,结合策略层建模用户意图,在Spider 2.0-Snow上无需更新参数即可显著提升Text-to-SQL性能。

详情
AI中文摘要

大型语言模型(LLM)通过Text-to-SQL使数据库访问民主化,但从原型到生产部署仍然困难。实际部署必须处理严格的SQL方言、大规模模式和不断变化的用户偏好,而有监督微调成本高且僵化,代理测试时扩展昂贵。我们提出Tahoe,一个将提示优化视为动态数据管理问题的系统。Tahoe在开发和部署阶段使用错误驱动的提示学习管道,将调试痕迹整合到结构化的提示库中。编译器反馈被提炼为可重用的语法提示(针对方言特定规则),而执行和用户反馈被转换为语义提示(针对模式和用户特定逻辑)。Tahoe进一步引入策略层,将冲突的用户意图建模为共享自然语言触发下的竞争策略,并利用近期信号和学习后归因统计来总结经验成功、危害、惰性和支持。在推理时,Tahoe检索相关提示,并通过逻辑规划后接SQL合成引导LLM。我们实现并评估了开发阶段的工作流,将部署时的人类反馈更新留作未来工作。在Spider 2.0-Snow上,Tahoe在不更新模型参数的情况下显著改进了Text-to-SQL。在113个有监督的Spider 2.0-Snow-0212示例上使用GPT-5.5,Tahoe将通过率从61.95%提高到79.42%,pass-at-4从72.57%提高到87.61%,实现了100%的Snowflake语法通过率,并将每个采样候选的平均编译器反馈批评轮次从2.79降低到0.12。相同的提示库也迁移到较弱的骨干模型,包括在Doubao-2.0-lite上获得19.7个百分点的通过率提升。

英文摘要

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

2606.11946 2026-06-11 cs.DB cs.CC cs.LG cs.LO 新提交

Neuro-Relational Programs: Unifying Queries and Neural Computation over Structured Data

神经关系程序:统一结构化数据上的查询与神经计算

Arie Soeteman, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Moritz Schönherr

AI总结 提出神经关系程序(NRP),一种扩展Datalog规则的声明式查询语言,通过嵌入操作融合关系推理与可学习神经组件,实现关系数据上的通用神经计算。

详情
Comments
37 pages
AI中文摘要

在关系数据库上进行深度学习的传统方法是将图神经网络(GNN)等神经模型应用于数据库的图表示。最近的方法则直接操作数据库,将元组与嵌入关联,并扩展查询机制以联合处理嵌入和关系内容。受这些发展的启发,我们引入了神经关系程序(NRP),这是一种针对关系数据库的声明式查询语言,其事实携带数值向量嵌入。NRP扩展了Datalog风格的规则,增加了组合、聚合和转换嵌入的操作,从而在单一形式主义中交错关系推理和可学习神经组件。这产生了一种对关系数据进行神经计算的通用方法:NRP既可以看作带有可训练组件的查询计划,也可以看作内置关系结构的神经架构。NRP的自然语法片段恢复了现有架构和查询形式主义。零元NRP对应于非自适应查询算法;一元NRP推广了GNN风格的消息传递,并精确捕捉了深度同态网络,我们将这一联系扩展到带有行ID的数据库上的前沿保护NRP。我们通过FOCQ(一阶逻辑在实权重结构上的计数扩展)刻画了带有ReLU-FFN变换的无限制NRP的表达能力,从而建立了与有序数据库上的均匀TC$^0$的精确联系。这些结果共同确立了NRP作为关系数据上查询和神经计算的广泛声明式框架。

英文摘要

The conventional approach to deep learning over relational databases applies neural models, such as Graph Neural Networks (GNNs), to a graph representation of the database. Recent approaches instead operate on databases directly, associating tuples with embeddings and extending query mechanisms to jointly process embeddings and relational content. Inspired by these developments, we introduce Neuro-Relational Programs (NRPs), a declarative query language for relational databases whose facts carry numeric vector embeddings. NRPs extend Datalog-style rules with operations that combine, aggregate, and transform embeddings, thereby interleaving relational reasoning and learnable neural components within a single formalism. This yields a general approach to neural computation over relational data: an NRP can be read both as a query plan with trainable components and as a neural architecture with relational structure built in. Natural syntactic fragments of NRPs recover existing architectures and query formalisms. Zero-ary NRPs correspond to non-adaptive query algorithms; monadic NRPs generalize GNN-style message passing and precisely capture Deep Homomorphism Networks, a connection that we extend to frontier-guarded NRPs over databases with row-ids. We characterize the expressive power of unrestricted NRPs with ReLU-FFN transformations by FOCQ, an extension of first-order logic with counting interpreted over real-weighted structures, yielding a precise connection with uniform TC$^0$ over ordered databases. Together, these results establish NRPs as a broad declarative framework for querying and neural computation over relational data.

2606.11789 2026-06-11 cs.DB 新提交

Efficient Graph Indexing for Interval-Aware Vector Search

面向区间感知向量搜索的高效图索引

Siyuan Liang, Ziqi Yin, Qi Zhang, Ronghua Li, Guoren Wang, Kaiwen Xue, Daiyin Wang, Xubin Li

AI总结 提出统一区间感知相对邻域图(URNG),支持多种区间感知ANN查询语义,并开发实用图索引UG,通过统一剪枝和迭代修复实现高效搜索。

详情
Comments
14 pages, 13 figures. Preprint version
AI中文摘要

区间感知近似最近邻(ANN)搜索出现在每个对象关联一个数值或区间的应用中,查询必须同时满足向量相似性和区间约束。现有方法通常针对单一查询语义(如区间过滤ANN搜索)定制,因此需要多个专用索引来支持多样化工作负载,导致大量索引和内存开销。为解决这一限制,我们提出了统一区间感知相对邻域图(URNG),一种用于区间感知ANN搜索的统一图框架。URNG保留了基于相对邻域图的ANN索引的单调可搜索性,同时额外确保查询诱导子图上的结构遗传性,使得单个索引能够支持多种区间感知查询语义。在此框架基础上,我们开发了UG,一种实用的图索引,通过统一区间感知剪枝和迭代修复高效近似URNG,以及用于区间感知ANN搜索的查询算法。在5个数据集上的大量实验表明,UG在多样化的区间感知工作负载中始终实现强精度-效率权衡,同时保持有竞争力的索引构建成本和内存使用。

英文摘要

Interval-aware Approximate Nearest Neighbor (ANN) search arises in applications where each object is associated with a numeric value or interval, and queries must satisfy both vector-similarity and interval constraints. Existing methods are typically tailored to a single query semantics, such as interval-filtered ANN search, and therefore require multiple specialized indexes to support diverse workloads, leading to substantial indexing and memory overhead. To address this limitation, we propose the Unified Interval-aware Relative Neighborhood Graph (URNG), a unified graph framework for interval-aware ANN search. URNG preserves the monotonic searchability of relative-neighborhood-graph based ANN indexes while additionally ensuring structural heredity over query-induced subgraphs, enabling a single index to support multiple interval-aware query semantics. Building on this framework, we develop UG, a practical graph index that efficiently approximates URNG through unified interval-aware pruning and iterative repair, together with a query algorithm for interval-aware ANN search. Extensive experiments on 5 datasets show that UG consistently achieves a strong accuracy-efficiency trade-off across diverse interval-aware workloads while maintaining competitive index construction cost and memory usage.

2606.11760 2026-06-11 cs.DS cs.CR cs.DB 新提交

A Fast Gaussian Mechanism under Continual Observation, with Applications

持续观测下的快速高斯机制及其应用

Rasmus Pagh, Sia Sejer

AI总结 针对持续更新场景下的私有向量发布问题,提出一种基于布朗桥的常数时间采样方法,实现高斯噪声的快速生成,并应用于差分隐私计数草图,提升正交范围计数查询和连接大小估计的性能。

详情
AI中文摘要

我们考虑在更新下私有发布$k$维向量的问题:从零向量开始,在时间$t_1, t_2,\dots$,向量分别加上$x^{(1)}, x^{(2)},\dots$。对于正整数$T, k$,我们将更新建模为数据集$\{(t_i, x^{(i)})\}_i$,其中$t_i \in [T]$且$x^{(i)} \in B_k$($k$维单位球)。如果两个这样的数据集的对称差大小至多为$1$,则称它们为相邻的。持续发布包括每个时间步$t=1,\dots,T$的和$A^{(t)} = \sum_{i \;: \; t_i \leq t} x^{(i)}$。经典的持续发布技术允许我们以$\text{polylog}(T)$的加性噪声幅度发布$A^{(1)},\dots,A^{(T)}$的近似,计算时间为$O(kT)$,即使在在线自适应情况下(数据持续揭示当前时间步)也是如此。受私有草图技术的启发,我们考虑在时间步$t$仅需发布$A^{(t)}$中条目的\emph{子集}的设置。我们的新结果是,可以在\emph{常数时间}内采样给定噪声向量中的任何所需条目,同时精确再现具有高斯噪声的二叉树机制的分布。对已知$O(\log T)$时间界的改进来自一种新的数据结构,它允许我们使用布朗桥在常数时间内以正确的相关性采样新的噪声值。我们提出了两个独立感兴趣的数据管理应用,它们将我们的技术与差分隐私CountSketch结合使用:1)正交范围计数查询的动态数据结构,具有比先前数据结构更好的隐私/准确性/空间权衡;2)连接大小估计,其中我们还展示了改进的高概率界。

英文摘要

We consider the problem of privately releasing a $k$-dimensional vector under updates: Starting with a zero vector, at times $t_1, t_2,\dots$ the vector is updated by adding $x^{(1)}, x^{(2)},\dots$, respectively. For positive integers $T$, $k$ we model the updates as a data set $\{(t_i, x^{(i)})\}_i$, where $t_i \in [T]$ and $x^{(i)} \in B_k$ (the $k$-dimensional unit ball). Two such data sets are said to be neighboring if their symmetric difference has size at most $1$. The continual release consists of the sum $A^{(t)} = \sum_{i \;: \; t_i \leq t} x^{(i)}$ for each time step $t=1,\dots,T$. Classical continual release techniques allow us to release an approximation of $A^{(1)},\dots,A^{(T)}$ with additive noise of magnitude $\text{polylog}(T)$, computed in time $O(kT)$, even in the on-line, adaptive case where data is continually revealed for the current time step. Motivated by private sketching techniques, we consider the setting where only a \emph{subset} of entries in $A^{(t)}$ need to be released at time step $t$. Our new result is that it is possible to sample any desired entry in a given noise vector in \emph{constant time} while reproducing exactly the distribution of the binary tree mechanism with Gaussian noise. The improvement on the known time bound of $O(\log T)$ comes from a new data structure that allows us to sample a new noise value with the correct correlations in constant time using Brownian bridges. We present two data management applications, of independent interest, that use our technique in conjunction with differentially private CountSketches: 1) A dynamic data structure for orthogonal range counting queries with a better privacy/accuracy/space trade-off than previous data structures, and 2) Join size estimation, where in addition we show improved high-probability bounds.

2606.11582 2026-06-11 cs.DB cs.SI 新提交

Querying Cohesive Subgraph regarding Span-Constrained Triangles on Temporal Graphs with Dynamic Index Maintenance

关于时间图上跨度约束三角形的凝聚子图查询与动态索引维护

Chuhan Hu, Ming Zhong, Lei Li

AI总结 提出时间图上的(k,δ)-truss概念,要求三角形在短时间窗口内存在,并设计基于索引的方法实现高效查询与动态维护,压缩比达10^{-4},查询速度提升2~4个数量级。

详情
AI中文摘要

时间图研究的最新进展重新定义了传统的静态图概念,如三角形、模体和$k$-核。受此启发,我们为时间图引入了一种新颖的$(k,\delta)$-truss,要求三角形在足够短的时间窗口内存在。$(k,\delta)$-truss确保了静态和时间上的内聚性,而原始的$k$-truss是$\delta = \infty$时的特例。为了处理$(k,\delta)$-truss查询,我们提出了无索引和基于索引的方法。利用$(k,\delta)$-truss的双重包含关系,我们的索引将所有的$(k,\delta)$-truss无损压缩成映射或树结构,显著减少了空间,同时实现了最优时间检索。为了扩展到大规模时间图,我们分别基于truss分解和truss维护开发了两种索引构建算法,大大减少了冗余计算。此外,我们提出了所提索引的动态维护技术。实验结果表明,基于索引的方法以交互时间处理查询,比无索引方法快2~4个数量级,同时索引实现了高达$10^{-4}$的压缩比,并且可以在不从头重建的情况下高效更新。

英文摘要

Recent advances in temporal graph research have redefined traditional static graph concepts such as triangles, motifs, and $k$-cores. Inspired by this, we introduce a novel $(k,\delta)$-truss for temporal graphs, requiring triangles to exist within sufficiently short time windows. The $(k,\delta)$-truss ensures both static and temporal cohesion, while the original $k$-truss is a special case when $\delta = \infty$. To address $(k,\delta)$-truss queries, we propose index-free and index-based approaches. Utilizing the dual containment relation of $(k,\delta)$-trusses, our indexes losslessly compress all $(k,\delta)$-trusses into map or tree structures, significantly reducing space while enabling optimal-time retrieval. To scale to large temporal graphs, we develop two index construction algorithms based on truss decomposition and truss maintenance, respectively, which substantially reduce redundant computations. Moreover, we present techniques for the dynamic maintenance of the proposed indexes. The experimental results demonstrate that index-based approaches process queries in interactive time and outperform the index-free approach by 2$\sim$4 orders of magnitude, while the indexes achieve compression ratios of up to $10^{-4}$ and can be updated efficiently without rebuilding from scratch.

2606.11560 2026-06-11 cs.DB cs.AI 新提交

LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems

LLMs+Graphs:迈向图原生的协同人工智能系统

Arijit Khan, Longxu Sun, Xin Huang

AI总结 本文综述了大语言模型与图计算的三种协同方式,包括增强推理、知识图谱双向集成及图算法增强的AI代理,并探讨了图数据管理与图机器学习的新能力,旨在为构建下一代图原生AI系统提供统一视角。

详情
Comments
10 pages, Accepted at PAKDD 2066 Tutorial
AI中文摘要

大语言模型(LLMs)发展迅速,但它们在结构化和多跳推理方面的局限性凸显了对图原生、协同人工智能(AI)系统的需求。图结构数据支撑着社交、生物、金融、交通、网络和知识领域的关键应用,因此理解LLMs如何利用图计算进行基于上下文的扎实推理至关重要。三种互补的协同方式正在涌现:通过图计算增强LLMs进行检索和推理;LLMs与知识图谱(KGs)的双向集成,其中LLMs支持KG构建和整理,而KGs强制执行语义约束和事实一致性;以及通过图算法增强的AI代理进行规划、决策和多步推理。同时,LLMs通过自然语言接口和混合LLM-图神经网络(GNN)流水线,为图数据管理和图机器学习(ML)引入了新能力。本教程综合了推动这些融合方向的算法、系统和设计原则,为数据科学和数据挖掘研究人员提供了将LLMs、图数据管理、图挖掘、图ML和代理计算集成到下一代图原生AI系统中的统一视角。

英文摘要

Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for graph-native, synergistic artificial intelligence (AI) systems. Graph-structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context-rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi-step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM-graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next-generation graph-native AI systems.

2606.11235 2026-06-11 cs.LG cs.DB stat.ME 新提交

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

少样本重采样:可扩展的统计可靠数据挖掘

Leonardo Pellegrina, Fabio Vandin

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系)

AI总结 提出FewRS方法,基于重采样评估数据挖掘结果的统计显著性,通过推导新的上界偏差界,仅需极少量重采样数据集即可保证假发现概率,显著提升可扩展性。

详情
Comments
Accepted to KDD 2026
AI中文摘要

知识发现的一个关键步骤是评估数据挖掘结果。在包括模式挖掘、图分析等多个应用中,此步骤包括评估结果的统计显著性,以避免仅由噪声或数据随机波动导致的虚假发现。虽然针对某些特定应用已经开发了专门程序,但基于重采样的方法被广泛使用,尤其是在无法推导解析结果的复杂分析中。然而,当前基于重采样的方法需要生成和分析数千个重采样数据集,因此对于大型数据集或计算密集型分析不实用。本文中,我们介绍了FewRS,一种简单有效的基于重采样的方法,用于评估数据挖掘结果的统计显著性,并对错误发现概率提供严格保证。我们的方法可应用于任何使用重采样方法的情况。FewRS基于我们对表示数据挖掘结果质量的检验统计量的上确界偏差推导出的新界。我们证明FewRS需要生成和分析极少数量的重采样数据集,从而得到高度可扩展且广泛适用的方法。我们在常见任务(如模式挖掘和网络分析)上测试了我们的方法。在所有情况下,与现有技术相比,我们的方法在运行时间上减少了多达两个数量级,同时保持高统计功效,使得能够在大型真实世界数据集上对数据挖掘结果进行统计验证。

英文摘要

A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

2606.09824 2026-06-11 cs.DB 版本更新

TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

TSseek: 基于正则表达式的分布式时间序列数据集相似性搜索

Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner

AI总结 提出TSseek框架,通过正则表达式查询语言支持趋势、值范围和通配符模式搜索,并构建分布式空间索引TSseek-X实现高效精确匹配。

详情
Comments
Extended version with full ablation studies and additional experiments
AI中文摘要

相似性搜索是时间序列分析中的基本操作。然而,大多数现有技术要求用户提供精确的值序列(通常是整个时间序列对象)作为查询输入。这种严格的要求限制了实际应用,用户更希望表达模式、趋势或值范围。灵活的基于模式的搜索已在文本检索和复杂事件处理中得到探索,但在大规模分布式时间序列中仍未得到充分研究。为弥补这一差距,我们提出TSseek,一个基于正则表达式的分布式时间序列数据集搜索框架。TSseek的查询语言使用户能够组合包含趋势、值范围和通配符片段的模式。我们表明,传统的近似技术(如PAA和SAX)及其索引结构不适合此类查询,因为它们无法对正则表达式查询构造进行操作。在TSseek中,我们通过将时间序列对象近似为保留趋势(斜率方向)和值范围的线段序列,并将查询构造转换为边界矩形,将时间序列对象和查询构造映射到同一空间。为支持高效处理,我们构建了TSseek-X,一个基于时间序列片段的分布式空间索引。TSseek支持两种基本查询类型:全匹配查询(针对整个序列)和子序列匹配查询(针对序列内的任意窗口)。在基准和真实数据集上,全扫描、基于模型和基于SAX的基线方法要么牺牲准确性,要么牺牲速度,而TSseek能高效地返回精确答案。此外,对于子序列工作负载,它比最先进的子序列匹配引擎实现了显著的加速。

英文摘要

Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.

2606.07001 2026-06-11 cs.DB cs.AI 版本更新

DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving

DataEvolver: 通过多级自我进化实现大型语言模型的自动数据准备

Chao Deng, Shaolei Zhang, Ju Fan, Xiaoyong Du

AI总结 提出DataEvolver,首个自我进化的数据准备系统,通过多级机制自动构建管道将原始数据转化为高质量数据,在七个基准上平均提升下游LLM性能10%。

详情
AI中文摘要

高质量训练数据对大型语言模型(LLMs)至关重要,通常需要大量且昂贵的人工整理。现有的自动数据准备方法依赖于预定义管道或定制化人工指令,这限制了它们对不同数据分布的适应性,并且缺乏来自高质量示例的原则性指导。在本文中,我们介绍了DataEvolver,这是首个自我进化的数据准备系统,能够自动构建管道将原始数据转化为高质量数据。DataEvolver采用多级机制来确保管道的可执行性和有效性。在算子级别,它逐步扩展算子集以构建逻辑计划,同时解决依赖冲突。在管道级别,它将逻辑计划实例化为可执行代码,并通过反馈循环迭代优化管道编排,从而减少准备数据与高质量示例之间的分布差距。在七个基准上的实验表明,与在原始数据上训练相比,DataEvolver显著提高了数据质量,并使下游LLM性能平均提升10%,突显了LLM与数据迭代协同进化的新机遇。

英文摘要

High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10\% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.

2606.01183 2026-06-11 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

详情
Comments
20 pages, 5 figures, 7 tables
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2602.17001 2026-06-11 cs.AI cs.CL cs.DB 版本更新

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS: 为时间序列数据库的自然语言查询设计的搜索-验证方法

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

AI总结 本文提出Sonar-TS,一种神经符号框架,用于解决时间序列数据库的自然语言查询问题,通过搜索-验证流程处理连续形态意图和超长历史数据,引入NLQTSBench基准进行评估,展示了该方法在复杂时间查询中的有效性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

自然语言查询时间序列数据库(NLQ4TSDB)旨在帮助非专家用户从大量时间记录中检索有意义的事件、区间和摘要。然而,现有的文本到SQL方法未针对连续形态意图(如形状或异常)进行设计,而时间序列模型在处理超长历史时面临挑战。为解决这些问题,我们提出Sonar-TS,一种神经符号框架,通过搜索-验证流程处理NLQ4TSDB。类似于主动声纳,它利用特征索引通过SQL ping候选窗口,随后通过生成的Python程序锁定并验证候选者与原始信号。为了实现有效的评估,我们引入NLQTSBench,这是第一个大规模基准,专门针对NLQ在TSDB规模的历史数据。我们的实验突显了该领域独特的挑战,并展示了Sonar-TS在传统方法无法处理的复杂时间查询中的有效性。本文首次系统研究了NLQ4TSDB,提供了一个通用框架和评估标准,以促进未来研究。

英文摘要

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

2605.02030 2026-06-11 cs.DB cs.DS 版本更新

U-HNSW: An Efficient Graph-based Solution to ANNS Under Universal Lp Metrics

U-HNSW:一种基于图的高效通用Lp度量近似最近邻搜索方法

Huayi Wang, Jingfan Meng, Jun Xu

AI总结 提出首个基于图的通用Lp度量近似最近邻搜索方法U-HNSW,利用L1和L2度量构建HNSW索引并采用早停策略,查询时间比MLSH快最多2670倍。

详情
AI中文摘要

在通用Lp度量下的近似最近邻搜索(ANNS-U-L_p)是一个重要且具有挑战性的研究问题,因为它要求同时回答所有可能的p(0<p≤2)值下的查询,而无需为每个可能的p值构建索引。最先进的解决方案MLSH是一种基于局部敏感哈希(LSH)的ANNS方法,其查询性能勉强可接受。相比之下,基于图的ANNS方法在ANNS-L_p问题(固定p值)上显著提高了查询效率,但无法直接扩展到ANNS-U-$L_p$问题。本文提出U-HNSW,这是首个用于ANNS-U-L_p的基于图的方法。我们的方案使用基于两个基础度量($L_1$和$L_2$)构建的HNSW图索引来生成有希望的最近邻候选,然后通过早停策略验证这些候选,该策略大幅减少了昂贵的Lp距离计算次数。实验结果表明,U-HNSW不仅比运行在RAM磁盘上的原始MLSH实现快最多2670倍(比理想化的MLSH快最多15倍),而且在ANNS-L_p问题(固定p值)上,除了少数特殊p值外,其性能也优于原始HNSW。

英文摘要

Approximate nearest neighbor search under universal L_p metrics (ANNS-U-L_p) is an important and challenging research problem, as it requires answering queries under all possible p (0<p <= 2) values simultaneously without building an index for each possible p value. The state-of-the-art solution, called MLSH, is a Locality-Sensitive Hashing (LSH)-based ANNS method with barely acceptable query performance. In contrast, graph-based ANNS methods, which offer significantly improved query efficiency on the ANNS-L_p problem (with a fixed p-value), cannot be naively extended to the ANNS-U-$L_p$ problem. In this paper, we propose U-HNSW, the first graph-based method for ANNS-U-L_p. Our scheme uses HNSW graph indexes built on two base metrics ($L_1$ and $L_2$) to generate promising nearest neighbors candidates, and then verifies these candidates with an early-termination strategy that substantially reduces the number of expensive L_p distance computations. Experimental results show that U-HNSW not only achieves up to 2670 times shorter query times than the original MLSH implementation running on a RAM disk (up to 15 times shorter than the idealized MLSH), but also outperforms the original HNSW on the ANNS-L_p problem (with a fixed p-value), except for a few special p values.

2604.21413 2026-06-11 cs.DB 版本更新

RUBICON: Agentic AI for Messy Enterprise Data

RUBICON: 面向混乱企业数据的代理型人工智能

Fabian Wenz, Felix Treutwein, Kai Arenja, Çagatay Demiralp, Michael Stonebraker

AI总结 针对企业数据异构、访问受限的特点,提出RUBICON系统,采用结构化查询接口和以表为中心的集成层,替代纯文本大语言模型管线,在基准测试中实现100%端到端准确率,并显著降低延迟和成本。

详情
Comments
4 pages, 1 tables
AI中文摘要

企业数据以多种形式存在,例如表格、文本、地图、电子邮件和CAD模型,这些数据受访问控制并隐藏在定制接口之后。当前的代理型人工智能系统将整个查询工作流委托给前沿大语言模型:单个模型解释请求、选择源或工具、整合检索到的证据、判断完整性并生成答案,几乎没有约束,模式使用有限,且整个过程中文本是主要表示形式。我们认为这对于企业数据是一种无效的抽象。可靠的代理型人工智能应要求结构化:每个源上的受限查询接口以及由查询处理器驱动的以表为中心的集成层。我们介绍了RUBICON,一个体现这一愿景的系统。RUBICON基于两个观察。首先,文本到SQL在真实企业数据上失败,必须大幅子集化才能获得可靠结果。其次,跨不同公司数据集的数据集成最好使用表格作为核心抽象,而不是以文本为中心的大语言模型管线。我们在两个基准测试上评估RUBICON:针对代理基线的企业级RUBICON-Bench,以及针对LOTUS和Palimpzest的SemBench。在RUBICON-Bench上,查询需要跨异构企业源协调,RUBICON实现了100%的端到端准确率,而所有代理基线(包括单代理和多代理ReAct系统)均未产生正确结果。在SemBench上,RUBICON超越了LOTUS和Palimpzest:准确率提高14.7%,延迟降低62.64%,令牌成本降低98.64%,表明以表为中心的架构更适合企业数据,同时带来显著的效率提升。

英文摘要

Enterprise data exists in many forms, such as tables, text, maps, e-mail, and CAD models, that are access-controlled and hidden behind bespoke interfaces. Current agentic AI systems delegate the entire query workflow to a frontier LLM: a single model interprets the request, selects sources or tools, integrates retrieved evidence, judges completeness, and generates an answer, with few constraints, limited use of schemas, and text as the primary representation throughout. We argue that this is an ineffective abstraction for enterprise data. Reliable agentic AI should instead require structure: a constrained query interface over each source and a table-centric integration layer driven by a query processor. We introduce RUBICON, a system that embodies this vision. RUBICON is based on two observations. First, text-to-SQL fails on real enterprise data and must be dramatically subsetted to achieve reliable results. Second, data integration across disparate corporate datasets is best performed using tables as the core abstraction rather than text-centric LLM pipelines. We evaluate RUBICON on two benchmarks: our enterprise-focused RUBICON-Bench, against agentic baselines, and SemBench, against LOTUS and Palimpzest. On RUBICON-Bench, where queries require coordination across heterogeneous enterprise sources, RUBICON achieves 100% end-to-end accuracy, while all agentic baselines, including single- and multi-agent ReAct systems, produce no correct answers. On SemBench, RUBICON surpasses both LOTUS and Palimpzest: it achieves 14.7% higher accuracy, reduces latency by 62.64%, and lowers token cost by 98.64%, demonstrating that a table-centric architecture better matches enterprise data while yielding significant efficiency gains.

2604.11454 2026-06-11 cs.DB cs.PL 版本更新

Foundations of the GraphAlg Language

GraphAlg语言基础

Daan de Graaf, Robert Brijder, Nikolay Yakovets

AI总结 本文展示图算法领域特定语言GraphAlg如何建立在矩阵操作形式语言MATLANG之上,通过扩展和语法糖实现,并证明任何GraphAlg程序可在支持同时归纳的for-MATLANG扩展中模拟。

详情
AI中文摘要

用于图算法的领域特定语言GraphAlg使得用户能够在图数据库中定义算法。在这项工作中,我们展示了GraphAlg是如何建立在用于矩阵操作的形式化语言MATLANG之上的。从MATLANG出发,我们描述了为推导出GraphAlg所需的MATLANG扩展和语法糖。此外,我们证明了任何GraphAlg程序都可以在支持同时归纳的for-MATLANG扩展中被模拟。

英文摘要

The GraphAlg domain-specific language for graph algorithms enables user-defined algorithms in graph databases. In this work we show how GraphAlg is built on top of the formal MATLANG language for matrix manipulation. Starting from MATLANG, we describe the extensions to MATLANG and the syntactic sugar needed to derive GraphAlg. Furthermore, we prove that any GraphAlg program can be simulated in an extension of for-MATLANG that supports simultaneous induction.

2603.24080 2026-06-11 cs.CL cs.DB 版本更新

LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

LLMpedia:一个大规模实现LLM百科全书知识的透明框架

Muhammed Saeed, Simon Razniewski

AI总结 提出LLMpedia框架,从三个模型家族中提取约130万篇百科全书文章,通过维基百科和网络证据审计,发现可验证真实率远低于MMLU基准,揭示了模型知识的事实性差距。

详情
AI中文摘要

像MMLU这样的基准测试表明,旗舰语言模型的事实性饱和度超过90%。LLMpedia显示这一图景并不完整。我们从三个模型家族的参数记忆中具体化出约130万篇百科全书文章,然后针对维基百科和精选网络证据审计每一条声明。对于gpt-5-mini,在维基百科覆盖的主题上,可验证真实率为68.4%——比MMLU低超过21个百分点——这一差距主要由不可验证性(30.5%)驱动,而非反驳(1.2%)。在维基百科之外,针对精选网络证据审计的前沿文章达到57.6%;维基百科仅覆盖模型呈现主题的56.7%,三个模型家族在主题选择上仅有7.3%的重叠。在受先前Grokipedia分析启发的检索陷阱基准测试中,LLMpedia在文本相似度约为维基百科一半的情况下更加事实准确。每个提示、文章和判决都已发布。数据、代码、界面:此 https URL。

英文摘要

Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every claim against Wikipedia and curated web evidence. For \texttt{gpt-5-mini}, the verifiable true rate is 68.4\% on Wikipedia-covered subjects - more than 21\,pp below MMLU - and the gap is driven by \emph{unverifiability} (30.5\%), not refutation (1.2\%). Beyond Wikipedia, frontier articles audited against curated web evidence reach 57.6\%; Wikipedia covers only 56.7\% of model-surfaced subjects, and three model families overlap in just 7.3\% of subject choices. In a retrieval-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia is more factual at roughly half the textual similarity to Wikipedia. Every prompt, article, and verdict is released. Data, code, interface: this https URL.