arXivDaily arXiv每日学术速递 周一至周五更新
2606.20539 2026-06-19 cs.DB cs.DS 新提交

Caching for Dollars, Not Hits: An Exact Offline Reference for Cloud-Egress Caching and the Crossover That Decides When It Pays

为美元缓存,而非命中率:云出口缓存的精确离线参考及决定何时值得的交叉点

Madhulatha Mandarapu, Sandeep Kunkunuru

AI总结 针对云存储出口费用而非延迟的缓存问题,提出多项式时间精确离线最优策略,发现LRU的美元后悔随成本分散度上升,而成本感知的GreedyDual可大幅降低,并给出决定何时需要成本感知缓存的闭合形式交叉点。

Comments 6 pages, 3 figures. Code, benchmarks, and full pre-registration: https://github.com/samyama-ai/cloud-egress-cache

详情
AI中文摘要

当缓存未命中从云对象存储获取数据时,计费基于每次GET请求和每字节出口流量,而非延迟。经典缓存最小化未命中率,这是错误的目标:一个很少但昂贵获取的对象可能比一个频繁但廉价获取的对象花费数千倍。广义缓存理论界定了未命中成本目标,但尚无公开基准衡量实际部署的启发式策略在真实云价格下与美元最优离线策略的差距。我们提供了该参考。对于具有异构未命中成本的统一大小页面缓存,离线美元最优可通过积分区间线性规划在多项式时间内精确求解——经暴力验证;可变大小是NP难的,因此我们将基于流的离线界从命中率目标扩展到美元(成本-FOO),误差约4%。基于此参考我们发现:(i) 异质性遗憾定律——LRU的美元遗憾随未命中成本分散度上升(Spearman 0.87),而成本感知的GreedyDual将其降至约十分之一;(ii) 竞争边界——当预算恰好覆盖昂贵工作集时,GreedyDual的残余遗憾降至接近零,否则为开放区间;(iii) 闭合形式交叉点 s* = GET费用/出口费率(S3上约4 KB,GCS上约330 B),可预测哪些部署需要成本感知缓存。在真实Twitter轨迹上,仅价格向量即可使工作负载跨越s*,按预测改变状态。该工件是一个可复现的计费忠实基准;其构建的启发式策略和界为先前工作,已致谢。

英文摘要

When a cache miss fetches from cloud object storage, the bill is per GET request and per byte of egress, not latency. Classic caching minimizes the miss rate, the wrong objective: a rarely but expensively fetched object can cost thousands of times more dollars than a frequently but cheaply fetched one. Generalized-caching theory bounds the miss-cost objective, but no reported benchmark measures how far deployed heuristics sit from the dollar-optimal offline policy on real cloud prices. We supply that reference. For uniform-size page caches with heterogeneous miss costs the offline dollar-optimum is exact in polynomial time via an integral interval linear program -- validated against brute force; variable sizes are NP-hard, so we extend the flow-based offline bound from the hit-ratio objective to dollars (cost-FOO), tight to about four percent. Against this reference we find: (i) a heterogeneity-regret law -- LRU's dollar-regret rises with miss-cost dispersion (Spearman 0.87) while cost-aware GreedyDual cuts it to roughly a tenth; (ii) a contention frontier -- GreedyDual's residual regret collapses to near zero exactly when the budget fits the expensive working set, and is the open slice otherwise; and (iii) a closed-form crossover s* = GET_fee/egress_rate (about 4 KB on S3, 330 B on GCS) that predicts which deployments need dollar-aware caching at all. On a real Twitter trace the price vector alone moves the workload across s*, shifting the regime as predicted. The artifact is a reproducible billing-faithful benchmark; heuristics and bounds it builds on are prior work, credited.

2606.20318 2026-06-19 cs.DB 新提交

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

AgenticDB: 面向数据库工作负载的代理式性能重配置

Xinyue Yang, Chaozheng Wang, Chen Zheng, Heng Zhang, Yanjun Wu

AI总结 提出AgenticDB框架,通过运行时交互实现数据库系统级和操作系统级重配置,诊断瓶颈并积累经验,在MySQL和PostgreSQL上平均性能提升118.1%。

详情
AI中文摘要

数据库配置调优对工作负载性能至关重要,但在实际部署中进行实用调优仍然困难。现有的自动调优器大多将调优视为对DBMS旋钮值的迭代搜索。这种形式导致执行成本高,过早缩小配置空间,并且未能充分解决实际需求:从系统反馈中诊断运行时瓶颈,探索操作系统级重配置机会,稳健地执行更改,以及从先前的试验和任务中学习。我们提出AgenticDB,一个用于数据库工作负载重配置的代理式框架。AgenticDB实现了一个上下文驱动的工具,通过与目标数据库环境交互,提出DBMS级和操作系统级更改,在安全约束下应用它们,观察工作负载性能和运行时状态,并使用执行反馈来指导后续决策。这种运行时交互使AgenticDB能够诊断瓶颈,探索更广泛的DBMS和操作系统级重配置空间,避免不安全或不支持的操作,并在重配置任务内部和之间积累经验。因此,AgenticDB将数据库调优转变为一种自我改进的重配置过程,其中运行时反馈迭代地改进后续决策。我们在MySQL和PostgreSQL上使用YCSB、Sysbench和TPC-H工作负载进行了广泛实验。结果表明,AgenticDB在所有评估的工作负载上实现了最佳最终性能,平均比最强基线提高118.1%,并将总到达最佳时间减少22.6%。结果还表明,其操作系统级动作空间、稳健的执行生命周期和增强记忆的规划有助于实现更有效和实用的数据库重配置。

英文摘要

Database configuration tuning is critical for workload performance, but practical tuning on real deployments remains difficult. Existing automatic tuners mostly formulate tuning as iterative search over DBMS knob values. This formulation leads to high execution cost, prematurely narrows the configuration space, and leaves practical requirements insufficiently addressed: diagnosing runtime bottlenecks from system feedback, exploring OS-level reconfiguration opportunities, executing changes robustly, and learning from previous trials and tasks. We propose AgenticDB, an agentic framework for database workload reconfiguration. AgenticDB implements a context-grounded harness that interacts with the target database environment by proposing DBMS- and OS-level changes, applying them under safety constraints, observing workload performance and runtime states, and using execution feedback to guide subsequent decisions. This runtime interaction enables AgenticDB to diagnose bottlenecks, explore a broader DBMS- and OS-level reconfiguration space, avoid unsafe or unsupported actions, and accumulate experience within and across reconfiguration tasks. As a result, AgenticDB turns database tuning into a self-refining reconfiguration process in which runtime feedback iteratively improves later decisions. We conduct extensive experiments on MySQL and PostgreSQL using YCSB, Sysbench, and TPC-H workloads. The results show that AgenticDB achieves the best final performance on all evaluated workloads, improving over the strongest baseline by 118.1% on average and reducing aggregate time-to-best by 22.6%. The results also demonstrate that its OS-level action space, robust execution lifecycle, and memory-enhanced planning contribute to more effective and practical database reconfiguration.

2606.19969 2026-06-19 cs.DB cs.DC 新提交

The Bi-Channel Networking Paradigm for Database Systems in the Cloud

云数据库系统的双通道网络范式

Georg Kreuzmayr, Muhammad El-Hindi, Benjamin Wagner, Tobias Ziegler, Viktor Leis

AI总结 针对现代高速云网络中内核TCP栈成为数据库性能瓶颈的问题,提出双通道网络范式,将通信分离为高性能数据路径和可靠控制路径,结合用户空间UDP与内核TCP,在分布式shuffle和复制键值存储中实现高吞吐与低开销。

Comments Accepted to EDBT 2027 (Lille, France)

详情
AI中文摘要

当网络链路速度较慢时,云和分布式数据库系统可以依赖通用的内核抽象,并将网络通信视为黑盒。在当今快速云网络下,这种方法失效了:数据库性能受到内核TCP栈CPU开销的限制。用用户空间UDP替换TCP可以减少这种开销,但需要重新实现基本保证,如可靠性和有序性。为解决这一难题,数据库系统不应再将网络视为黑盒,而应将其与数据库操作协同设计。我们提出了数据库系统的双通道范式,将通信分为两个通道:一个用于延迟和带宽敏感操作的高性能数据路径,以及一个用于协调和恢复的可靠控制路径。我们通过结合用户空间UDP和基于内核的TCP来实现该范式,尽管其他协议栈组合也是可能的。这种设计利用了现代NIC的能力,同时保留了TCP的可靠性。我们在两个代表性场景中展示了该范式的效率和简洁性:一个分布式shuffle用三个CPU核饱和200 Gbit/s,以及一个每秒处理数百万条消息的复制键值存储。

英文摘要

When network links were slow, cloud and distributed database systems could rely on generic kernel abstractions and treat network communication as a black box. With today's fast cloud networks, this approach breaks down: database performance becomes limited by the CPU overhead of the kernel TCP stack. Replacing TCP with user-space UDP can reduce this overhead, but it requires reimplementing essential guarantees, such as reliability and ordering. To solve this conundrum, database systems should no longer treat networking as a black box but co-design it with database operations. We propose the bi-channel paradigm for database systems, which separates communication into two channels: A high-performance data path for latency- and bandwidth-sensitive operations, and a reliable control path for coordination and recovery. We implement the paradigm by combining user-space UDP and kernel-based TCP, though other stack combinations are possible. This design exploits modern NIC capabilities while preserving TCP's reliability. We demonstrate the paradigm's efficiency and simplicity in two representative settings: a distributed shuffle saturating 200 Gbit/s with three CPU cores, and a replicated key-value store processing millions of messages per second.

2606.19898 2026-06-19 cs.DB cs.IR 新提交

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

AI总结 提出查询感知路由框架,通过轻量级ML模型预测各候选方法的召回率,结合离线基准表选择最佳召回-QPS权衡,在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情
AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词,是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试,发现没有单一方法占主导地位。此外,即使在单个数据集和谓词类型内,查询的最佳方法也可能不同。因此,我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率,路由器查阅离线基准表(该表将每种方法和参数设置映射到其测量的召回率和QPS),然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集,并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练,并应用于五个未见过的验证数据集。最终结果表明,与现有的过滤ANN基线相比,我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡,同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

2606.19576 2026-06-19 cs.DB cs.DC 新提交

REMOP: REmote-Memory-aware OPerator Optimization

REMOP: 远程内存感知的算子优化

Shiquan Zhang, Yunhao Mao, Yuqiu Zhang, Gengrui Zhang, Jeyhun Karimov, Hans-Arno Jacobsen

AI总结 针对远程内存环境下查询处理中数据传输轮次过多的问题,提出REMOP框架,通过轮次感知的算子内内存策略优化内存溢出执行,在DuckDB中实现三种算子,减少高达97%的传输轮次和48%的算子运行时间。

Comments 14 pages, 13 figures, 9 tables. Preprint, under review

详情
AI中文摘要

远程和分离内存层扩展了分析数据库引擎的有效内存容量,但也重塑了内存溢出查询处理的成本结构。当算子溢出到本地DRAM之外时,将页面移动到远程内存既会产生数据传输时间,也会产生每次传输的固定往返延迟。经典的算子分析和缓冲区分配启发式方法主要通过最小化总I/O量来针对磁盘溢出。在远程内存下,这些策略可能不是最优的,因为它们可能触发过多的传输轮次。我们提出了REMOP,一个远程内存感知的算子优化框架,它使用传输轮次感知的算子内内存策略来改善内存预算紧张下的内存溢出执行。REMOP将传输轮次数引入延迟成本模型,并推导出算子特定的缓冲区划分策略,在DuckDB中为阻塞嵌套循环连接、外部归并排序和外部哈希连接实例化了该方法。我们在双节点计算-内存测试平台上的评估表明,在溢出密集的微基准测试中,REMOP减少了高达97%的传输轮次和高达48%的算子运行时间,并将溢出TPC-H和TPC-DS查询的平均运行时间分别降低了22.7%和26.4%。

英文摘要

Remote and disaggregated memory tiers expand the effective memory capacity of analytical database engines, but they also reshape the cost structure of out-of-memory query processing. When an operator spills beyond local DRAM, moving pages to remote memory incurs both data-transfer time and a fixed round-trip latency per transfer. Classical operator analyses and buffer-allocation heuristics primarily target disk spilling by minimizing total I/O volume. Under remote memory, these strategies can be suboptimal because they may trigger excessive transfer rounds. We present REMOP, a remote-memory-aware operator optimization framework that uses transfer-round-aware intra-operator memory policies to improve out-of-memory execution under tight memory budgets. REMOP introduces the number of transfer rounds into the latency cost model and derives operator-specific buffer-partitioning strategies, instantiating the approach for blocked nested-loop join, external merge sort, and external hash join in DuckDB. Our evaluation on a two-node compute-memory testbed shows that REMOP reduces transfer rounds by up to 97% and operator runtime by up to 48% on spill-heavy microbenchmarks, and lowers the average runtime of spilling TPC-H and TPC-DS queries by 22.7% and 26.4% end-to-end.

2606.19751 2026-06-19 cs.DB math.OC 新提交

DeQL: A Decision Query Language for Prescriptive Analytics over Relational Data

DeQL:一种用于关系数据规范性分析的决策查询语言

Matteo Brucato, Fjodor Kholodkov, Soren Little, Jakob Mayer, Duc Nguyen

AI总结 DeQL扩展SQL以支持决策查询,通过CREATE CANDIDATES和DECIDE两个构造定义选项空间、约束和目标,实现子集选择、分配、调度等决策,并支持不确定性优化和模型评分。

详情
AI中文摘要

DeQL(决策查询语言)扩展了SQL以表达决策查询:给定从关系数据中提取的选项、策略约束和可测量的目标,DeQL查询计算出最佳行动方案。两个构造实现了这一扩展:CREATE CANDIDATES,定义来自关系源的选项空间;DECIDE,声明决策变量、命名约束以及针对这些变量的目标。该设计遵循SQL的原则:用户说明要优化的内容,而引擎选择如何求解;每个查询消费并产生关系;问题的结构对引擎保持可见。本文档规范了该语言(其设计原则、语法、形式文法及执行模型),并附有涵盖子集选择、分配、指派、调度以及多级聚合决策的示例,以及针对不确定性优化、内联模型评分和时间与质量受限求解的扩展。这是该规范的第一版;该语言正在积极开发中,本版本固定了后续修订将基于的核心构造。

英文摘要

DeQL (Decision Query Language) extends SQL to express decision queries: given options drawn from relational data, constraints from policy, and a measurable objective, a DeQL query computes the best course of action. Two constructs carry the extension: CREATE CANDIDATES, which defines the space of options from relational sources, and DECIDE, which declares decision variables, named constraints, and an objective over them. The design follows SQL's principles: the user states what to optimize while the engine chooses how to solve it, every query consumes and produces relations, and the structure of a problem stays visible to the engine. This document specifies the language (its design principles, syntax, formal grammar, and execution model) with examples spanning subset selection, allocation, assignment, scheduling, and decisions at multiple levels of aggregation, and extensions for optimization under uncertainty, inline model scoring, and time- and quality-bounded solving. It is the first version of the specification; the language is under active development, and this version fixes the core constructs on which later revisions will build.

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 交叉投稿

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80:全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DEMR-ONERA,巴黎-萨克雷大学) DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DTIS-ONERA,巴黎-萨克雷大学) Hugging Face

AI总结 为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题,基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集,支持跨模态检索与生成任务。

详情
AI中文摘要

多模态基础模型因大规模光学基准而快速发展,但合成孔径雷达(SAR)的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测(GRD)产品,未保留复值SAR测量或原生采集几何,限制了基于物理的多模态学习。特别是,结合甚高分辨率(VHR)SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据(SICD)构建的VHR SAR-光学-文本数据集。从约2500个全球场景(VV/HH,20cm–2m原生分辨率)出发,通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格,并将图像分割为1024×1024的图块。对于每个SAR图块,我们检索高分辨率光学图块,并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体(短/中/长),以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组(复数和幅度斜距SAR图块、对齐光学图块、自然语言描述),覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码,以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用,网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

2606.20461 2026-06-19 cs.LG cs.CY cs.DB 交叉投稿

Data Bias Mitigation under Coverage Constraints & The Price of Fairness

覆盖约束下的数据偏差缓解与公平的代价

Bruno Scarone, Alfredo Viola, Renée J. Miller

发表机构 * Khoury College of Computer Sciences, Northeastern University(东北大学库里计算机科学学院) Cheriton School of Computer Science, University of Waterloo(滑铁卢大学切里顿计算机科学学院)

AI总结 针对多敏感属性交叉群体的偏差问题,提出在覆盖约束下扩展偏差缓解框架,通过整数线性规划优化缓解策略,权衡偏差近似误差与数据效率,并刻画公平的代价。

Comments Accepted to FAccT 2026

详情
AI中文摘要

机器学习模型已被证明在多个敏感属性(如种族和性别)交叉的个体上表现出歧视性结果或性能下降。这源于两个相互关联的挑战:缺乏量化偏差(可能是交叉的)的原则性措施,以及训练数据中交叉子群的代表性不足。我们扩展了一个最近的偏差缓解框架,以纳入覆盖约束,确保跨群体(包括交叉子群)的充分代表性。由于对所有群体实现完全零偏差可能不是数据高效的(意味着可能需要大量数据),我们的解决方案在满足覆盖约束的同时,用偏差的小近似误差换取更高的数据效率。我们还将偏差缓解表述为一个整数线性规划,优化所有缓解策略,并刻画公平的代价,即最小数据修改成本,作为公平容忍度的函数。这对于法律合规(法规可能规定特定的公平阈值)和数据治理(使从业者能够在偏差减少和数据修改(特别是数据购买)成本之间做出明智的权衡)都至关重要。我们在公开数据集上评估了我们的技术,表明通过我们的框架进行偏差缓解可以保持多个分类器的预测准确性,并且覆盖约束虽然出于统计考虑,但对于保持下游机器学习性能至关重要。

英文摘要

Machine learning models have been shown to exhibit discriminatory outcomes or degraded performance for individuals at the intersection of multiple sensitive attributes, such as race and gender. This stems in part from two interrelated challenges: the lack of principled measures for quantifying bias (potentially intersectional), and insufficient representation of intersectional subgroups in training data. We extend a recent bias mitigation framework to incorporate coverage constraints that enforce sufficient representation across groups, including intersectional subgroups. Since achieving exactly zero bias for all groups may not be data efficient (meaning it may require large amounts of data), our solution trades small approximation errors in bias for greater data efficiency while satisfying coverage constraints. We also formulate bias mitigation as an integer linear program that optimizes over all mitigation strategies, and characterize the price of fairness, the minimum data modification cost, as a function of fairness tolerance. This is essential both for legal compliance, where regulations may mandate specific fairness thresholds, and for data governance, enabling practitioners to make informed trade-offs between bias reduction and data modification (particularly, data purchasing) costs. We evaluate our techniques on publicly available datasets, demonstrating that bias mitigation via our framework preserves predictive accuracy across multiple classifiers, and that coverage constraints, while motivated by statistical considerations, are essential for preserving downstream ML performance.

2606.20388 2026-06-19 cs.HC cs.AI cs.DB 交叉投稿

DataMagic: Transforming Tabular Data into Data Insight Video

DataMagic: 将表格数据转化为数据洞察视频

Yupeng Xie, Chen Ma, Zhenyang Wang, Liangwei Wang, Jiayi Zhu, Chuxuan Zeng, Zhouan Shen, Boyan Li, Yuyu Luo

AI总结 提出DataMagic系统,通过声明式规范DVSpec和多智能体架构,将原始表格数据和自然语言查询转化为叙事性数据洞察视频,并支持交互式探索。

Comments 5 pages, 3 figures, accepted at VLDB 2026

详情
AI中文摘要

数据视频整合动态图表、语音叙述和同步动画,以时间叙事的方式传达数据洞察,使其成为提高数据管理生命周期中数据消费效率的有效媒介。然而,制作高质量的数据视频需要涵盖数据分析、叙事设计和视频制作的专业知识。现有方法存在不足:静态可视化工具(如BI仪表板)缺乏叙事逻辑和动画;创作工具要求用户预先准备可视化,而非从原始数据开始;像素级视频生成模型无法保证数据保真度或来源。我们演示了DataMagic,一个端到端的交互式系统,将原始表格数据和自然语言查询转化为叙事性数据洞察视频。为确保数据保真度,DataMagic引入了声明式规范DVSpec,通过数据驱动的语义引用将视觉和动画元素绑定到底层数据字段。为解决设计空间的组合爆炸问题,DataMagic采用先生成后编排的多智能体架构,并行生成候选场景,然后通过全局编排优化叙事连贯性。利用DVSpec逻辑与渲染的解耦,系统进一步支持三种交互模式和基于结构化来源的数据问答,将单向视频转化为可探索的交互式数据界面。在109个真实世界样本上的评估验证了DataMagic的有效性。主页:此 https URL

英文摘要

Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high-quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre-prepare visualizations rather than working from raw data; pixel-level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end-to-end interactive system that transforms raw tabular data and natural language queries into narrative data-insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data-driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate-then-Orchestrate multi-agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec's decoupling of logic and rendering, the system further supports three interaction modes and structured provenance-based data Q&A, transforming one-way videos into explorable interactive data interfaces. Evaluation on 109 real-world samples validates the effectiveness of the DataMagic. Homepage: https://datamagic-home.github.io/

2606.20208 2026-06-19 cs.AI cs.DB cs.NE 交叉投稿

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

超越准确性:衡量预测模型的逻辑合规性

Guillaume Olivier Delplanque, Pierre Genevès, Nabil Layaïda, Zephirin Faure

AI总结 提出规则违反分数(RVS),一种独立于预测准确性的评估指标,用于量化预测模型对逻辑规则的遵守程度,并通过实验证明两个准确率相近的模型可能表现出截然不同的逻辑合规性。

详情
AI中文摘要

机器学习模型主要通过预测性能指标进行评估,如排序质量、预测误差或分类准确性。虽然这些指标有效量化了预测与真实值的匹配程度,但它们不评估模型输出是否尊重预定义的逻辑或领域特定约束。在医疗、金融和自主系统等高安全性应用中,逻辑一致性与预测准确性同样关键,但尚无标准指标捕捉这一维度。我们引入了规则违反分数(RVS),这是一种互补的评估指标,独立于预测准确性,量化预测模型对给定逻辑规则集的遵守程度。RVS 对硬规则(严格约束)和软规则(统计规律)区别对待,可在任何数据集和任何在关系词汇上表达的预测模型上进行评估,并可通过为 Horn 规则自动生成的 SQL 查询进行计算。除了评估模型,RVS 还可以评估训练数据集的逻辑一致性,并帮助识别定义不良的规则。我们在三个基准测试上评估了 RVS,涵盖知识图谱链接预测和关系回归,包括基于规则、基于嵌入和神经符号的预测模型。我们的结果表明,两个实现相当预测准确性的模型可能表现出显著不同的逻辑合规性,揭示了标准指标无法捕捉的模型行为差异。

英文摘要

Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

2606.19692 2026-06-19 cs.CR cs.DB cs.IR 交叉投稿

When Global Gating Is Enough: Admission-Time Hubness Control in Anisotropic Vector Retrieval Systems

当全局门控足够:各向异性向量检索系统中的准入时间枢纽性控制

Prashant Kumar Pathak, Tarun Kumar Sharma

AI总结 针对检索增强生成中向量枢纽性引发的投毒风险,提出准入时间控制方法,通过哨兵查询评分隔离枢纽文档,全局门控在多个数据集上达到高召回率和低误报率。

详情
AI中文摘要

向量枢纽性(少数点成为许多查询的最近邻)在检索增强生成(RAG)中造成投毒风险:一个注入的文档可能影响不相关的请求。现有防御使用周期性反向k近邻扫描,存在暴露窗口和重复的全语料库工作。我们研究准入时间控制,根据哨兵查询对每个候选文档评分,并在插入前隔离类似枢纽的文档。在两个10万文档语料库、五个编码器以及不相交的攻击者和防御者查询集上,全局门控在决定性嵌入空间点达到召回率1.0(有效范围内>=0.92),在HotFlip攻击上达到0.91 +/- 0.07,对一般文档的误报率为1%。每主题门控没有提供可靠的好处,这与各向异性耦合局部和全局可见性一致。阈值是增量维护的,插入成本与语料库大小无关,删除成本摊销。在HNSW上,准入增加约3.1%的摄入延迟,评分在10^6向量上保持平坦,近似索引下1.2%的决策翻转,不涉及攻击。来源信息补充了门控对自然或紧密领域枢纽的处理。

英文摘要

Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (>=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.

2606.09824 2026-06-19 cs.DB 版本更新

TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

TSseek: 基于正则表达式的分布式时间序列数据集相似性搜索

Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner

AI总结 提出TSseek框架,通过正则表达式查询语言支持趋势、值范围和通配符模式搜索,并构建分布式空间索引TSseek-X实现高效精确匹配。

Comments Extended version with full ablation studies and additional experiments. v3 corrects bibliographic metadata for several references

详情
AI中文摘要

相似性搜索是时间序列分析中的基本操作。然而,大多数现有技术要求用户提供精确的值序列(通常是整个时间序列对象)作为查询输入。这种严格的要求限制了实际应用,用户更希望表达模式、趋势或值范围。灵活的基于模式的搜索已在文本检索和复杂事件处理中得到探索,但在大规模分布式时间序列中仍未得到充分研究。为弥补这一差距,我们提出TSseek,一个基于正则表达式的分布式时间序列数据集搜索框架。TSseek的查询语言使用户能够组合包含趋势、值范围和通配符片段的模式。我们表明,传统的近似技术(如PAA和SAX)及其索引结构不适合此类查询,因为它们无法对正则表达式查询构造进行操作。在TSseek中,我们通过将时间序列对象近似为保留趋势(斜率方向)和值范围的线段序列,并将查询构造转换为边界矩形,将时间序列对象和查询构造映射到同一空间。为支持高效处理,我们构建了TSseek-X,一个基于时间序列片段的分布式空间索引。TSseek支持两种基本查询类型:全匹配查询(针对整个序列)和子序列匹配查询(针对序列内的任意窗口)。在基准和真实数据集上,全扫描、基于模型和基于SAX的基线方法要么牺牲准确性,要么牺牲速度,而TSseek能高效地返回精确答案。此外,对于子序列工作负载,它比最先进的子序列匹配引擎实现了显著的加速。

英文摘要

Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.

2606.01183 2026-06-19 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

Comments 20 pages, 5 figures, 7 tables

详情
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2604.08552 2026-06-19 cs.DB cs.AI 版本更新

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

使用本体约束的LLM代理自动化标准化遗留生物医学元数据

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

发表机构 * Division of Computational Medicine, Stanford University(斯坦福大学计算医学部) Department of Biology, University of Pennsylvania(宾夕法尼亚大学生物学系)

AI总结 提出基于LLM的元数据标准化系统,通过实时查询标准指南和本体服务,在839条HuBMAP记录上验证,相比纯LLM方法显著提升预测准确性。

详情
AI中文摘要

科学元数据通常不完整且不符合社区标准,限制了数据集的可发现性、互操作性和重用。即使存在标准元数据报告指南,它们通常缺乏机器可操作的表征。生成FAIR数据集需要将元数据标准编码为具有丰富字段规范和精确值约束的机器可操作模板。最近的研究表明,由字段名称和本体约束引导的LLM可以改善元数据标准化,但这些方法将约束视为静态文本提示,仅依赖模型的训练知识。我们提出了一种基于LLM的元数据标准化系统,该系统实时查询标准报告指南和权威生物医学术语服务,以按需检索规范正确的标准。我们在来自人类生物分子图谱计划(HuBMAP)的839条遗留元数据记录上评估了该方法,使用专家策划的金标准进行精确匹配评估。我们的评估表明,与仅使用LLM相比,通过实时工具访问增强LLM在受本体约束和不受本体约束的字段上均持续提高了预测准确性,展示了一种实用的生物医学元数据自动化标准化方法。

英文摘要

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.