The Brain That Goes Quiet: Serving a Large Model's Knowledge at 131 Tokens per Second on an 8 GB Laptop by Removing the Large Model from the Runtime Path
静默的大脑:通过从运行时路径中移除大模型,在8 GB笔记本电脑上以每秒131个令牌的速度提供大模型知识
Myeong Jun Jo
AI总结 本文提出一种离线知识存储方法,将大模型(35B MoE)用于构建结构化知识库,运行时仅用轻量路由器和1B小模型,在8GB笔记本上将端到端响应时间从4.4秒降至0.5秒,吞吐量提升至131 tokens/s。
详情
- Comments
- 17 pages, 5 figures
在之前的工作中,我展示了35B类混合专家模型可以在具有8 GB GPU内存的消费级笔记本电脑上加载和执行。该结果解决了一个放置问题,并立即暴露了另一个问题:即使正确放置,大模型每次查询仍需要大约四秒才能回答,因为它在每次查询时仍被调用。本文记录了当我停止调用它时发生的情况。在离线阶段,大模型读取源文档并将经过验证的答案条目写入结构化知识存储;在运行时,只有轻量级路由器、确定性渲染器和1B类模型处于活动状态。在同一台8 GB笔记本电脑上,端到端响应时间从约4,465毫秒降至518毫秒,有效端到端吞吐量从15.7 tokens/s升至131 tokens/s,小模型的流式解码速率保持在226-237 tokens/s,首令牌时间为29-62毫秒。瓶颈是结构性的:三种不同的大模型(Qwen、Gemma和GLM类)都显示出相同的多秒运行时成本,并且所有三个模型都在离线状态下生成了可用的知识存储。在由17个真实文档构建的563条条目的存储上,关键词路由的top-1准确率降至1.5%,而基于BM25的路由达到92.8%(top-3为99.4%),置信门通过升级12.3%的查询将有效top-1提升至98.0%。小模型在携带相同内容的不同信封格式上的精确匹配保真度从9/9到0/9不等。一个16案例的验证门阻止了所有十个损坏条目,同时接纳了所有六个支持的条目。
In earlier work I showed that a 35B-class Mixture-of-Experts model can be loaded and executed on a consumer laptop with 8 GB of GPU memory. That result solved a placement problem and immediately exposed a different one: even correctly placed, the large model needed roughly four seconds to answer, because it was still being invoked at every query. This paper documents what happened when I stopped invoking it. During an offline phase, the large model reads source documents and writes verified answer entries into a structured knowledge store; at runtime, only a lightweight router, a deterministic renderer, and a 1B-class model are active. On the same 8 GB laptop, end-to-end response time fell from approximately 4,465 ms to 518 ms, effective end-to-end throughput rose from 15.7 to 131 tokens per second, and the small model's streaming decode rate held at 226-237 tokens per second with a time-to-first-token of 29-62 ms. The bottleneck is structural: three different large models (Qwen, Gemma, and GLM class) all showed the same multi-second runtime cost, and all three produced usable knowledge stores offline. On a 563-entry store built from seventeen real documents, keyword routing collapsed to 1.5% top-1 accuracy while BM25-based routing reached 92.8% (99.4% top-3), and a confidence gate raised effective top-1 to 98.0% by escalating 12.3% of queries. Exact-match fidelity of the small model ranged from 9/9 to 0/9 across envelope formats carrying identical content. A 16-case verification gate blocked all ten corrupted entries while admitting all six supported ones.