Natural Language Query to Configuration for Retrieval Agents
面向检索代理的自然语言查询到配置
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia
AI总结 提出BRANE方法,利用LLM将查询转换为工作负载特征,并训练轻量级预测器选择最优配置,在多个基准上实现成本-质量帕累托前沿的优化。
详情
现代检索代理暴露了许多配置选择——LLM、检索器、文档数量、跳数和合成策略——每个都影响答案质量和服务成本。目前,这些流水线通常针对每个工作负载手动调整一次,留下了大量每查询优化的空间。我们形式化了这个问题:给定一个自然语言查询以及一个准确性或预算目标,从预定义的流水线目录中选择在推理时最小化成本或最大化准确性的配置。我们提出了**BRANE**,它使用LLM将每个查询转换为工作负载特定的特征,然后训练一个轻量级的每配置预测器,估计流水线是否能正确回答查询。在推理时,**BRANE**选择最大化预测正确性(经成本惩罚)的配置,无需重新训练即可暴露可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench上,**BRANE**持续推动成本-质量帕累托前沿,以高达89%的成本降低匹配最佳固定配置的准确性,并优于LLM路由、基于规则和微调的Qwen3-4B基线。这些结果表明,对整个检索流水线进行每查询配置是静态工作负载级调优的实用替代方案。
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.