SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering
SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由
Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu
AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。
Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures
详情
视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。
Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.