arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.20127 2026-05-18 cs.LG cs.CY

Trajectory-Aware Reliability Modeling of Democratic Systems

民主系统中的轨迹感知可靠性建模

Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge

AI总结本文提出基于动态因果神经自回归（DCNAR）的轨迹感知可靠性模型，用于捕捉机构网络中退化传播动态，优于传统生存模型，提升系统退化预测能力。

详情

AI中文摘要

复杂系统中的故障常通过渐进退化和应力在相互作用组件中的传播产生，而非孤立冲击。民主系统表现出类似动态，削弱机构可能引发相关机构结构的级联退化。传统可靠性与生存模型通常基于当前系统状态估计故障风险，但未明确捕捉退化在机构网络中的传播。本文引入基于动态因果神经自回归（DCNAR）的轨迹感知可靠性建模框架。该框架首先估计机构指标间的因果交互结构，然后建模其联合时间演变以生成系统状态的预测轨迹。故障风险定义为预测轨迹在固定时间范围内跨越预定义退化阈值的概率。利用纵向机构指标，我们比较了基于DCNAR的轨迹风险模型与离散时间危险和Cox比例危险模型。结果表明，轨迹感知建模在预测传播驱动的机构故障方面优于Cox模型。这些发现强调了对动态系统交互建模在可靠性分析和早期系统退化检测中的重要性。

英文摘要

Failures in complex systems often emerge through gradual degradation and the propagation of stress across interacting components rather than through isolated shocks. Democratic systems exhibit similar dynamics, where weakening institutions can trigger cascading deterioration in related institutional structures. Traditional reliability and survival models typically estimate failure risk based on the current system state but do not explicitly capture how degradation propagates through institutional networks over time. This paper introduces a trajectory-aware reliability modeling framework based on Dynamic Causal Neural Autoregression (DCNAR). The framework first estimates a causal interaction structure among institutional indicators and then models their joint temporal evolution to generate forward trajectories of system states. Failure risk is defined as the probability that predicted trajectories cross predefined degradation thresholds within a fixed horizon. Using longitudinal institutional indicators, we compare DCNAR-based trajectory risk models with discrete-time hazard and Cox proportional hazards models. Results show that trajectory-aware modeling consistently outperforms Cox models and improves risk prediction for several propagation-driven institutional failures. These findings highlight the importance of modeling dynamic system interactions for reliability analysis and early detection of systemic degradation.

URL PDF HTML ☆

赞 0 踩 0

2604.19572 2026-05-18 cs.CL

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

通过观察上下文压缩实现高效终端智能体的自进化框架

Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin

AI总结本文提出TACO框架，通过自进化压缩规则提升终端智能体的效率与性能，实验表明在多个基准测试中均取得显著提升。

Comments 27 pages

详情

AI中文摘要

当终端智能体扩展到长周期任务时，关键瓶颈不仅是上下文长度的限制，而是交互历史中噪声终端观察的累积。保留原始观察保留了有用的环境反馈，但也会导致上下文饱和和高token成本；相反，简单压缩可能丢弃后续动作所需的任务关键信号。由于终端环境在仓库、命令和执行状态上高度异质，基于启发式或固定提示的压缩方法难以泛化。我们提出TACO，一种即插即用、无需训练、自进化的终端智能体压缩框架。TACO自动发现、细化和重用交互轨迹中的结构化压缩规则，使工作流适应性地过滤低价值终端输出，同时保留任务相关观察。在TerminalBench（TB 1.0和TB 2.0）以及四个额外的终端相关基准测试中，TACO在智能体架构和基础模型上均提升了任务性能和token效率。在TerminalBench中，TACO在强智能体模型上获得1%-4%的准确率提升，并在相同token预算下提高约2%-3%的准确率。在额外的终端相关基准测试中，它减少了总token消耗，同时保持或提高任务成功率。这些结果表明，自进化、工作流适应的观察压缩是实现更可靠和高效长周期终端智能体的有效途径。代码可在https://github.com/multimodal-art-projection/TACO公开获取。

英文摘要

As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of noisy terminal observations in the interaction history. Retaining raw observations preserves useful environment feedback, but also leads to context saturation and high token cost; conversely, naive compression may discard task-critical signals needed for subsequent actions. Because terminal environments are highly heterogeneous across repositories, commands, and execution states, heuristic-based or fixed-prompt compression methods are difficult to generalize. We propose TACO, a plug-and-play, training-free, self-evolving Terminal Agent Compression framework for existing terminal agents. TACO automatically discovers, refines, and reuses structured compression rules from interaction trajectories, enabling workflow-adaptive filtering of low-value terminal outputs while preserving task-relevant observations. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks, including SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench, show that TACO consistently improves task performance and token efficiency across agent scaffolds and backbone models. On TerminalBench, TACO yields 1%-4% accuracy gains across strong agentic models and improves accuracy by around 2%-3% under the same token budget. On additional terminal-related benchmarks, it reduces total token consumption while maintaining or improving task success rates. These results suggest that self-evolving, workflow-adaptive observation compression is an effective path toward more reliable and efficient long-horizon terminal agents. The code is publicly available at https://github.com/multimodal-art-projection/TACO.

URL PDF HTML ☆

赞 0 踩 0

2604.18556 2026-05-18 cs.CL cs.LG

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

GSQ: 通过Gumbel-Softmax采样实现高精度低精度标量量化以用于LLMs

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Maximilian Kleinegger, Dan Alistarh

AI总结本文提出GSQ，一种基于Gumbel-Softmax松弛的标量量化方法，在低比特数下实现高精度，适用于LLMs的部署。

详情

AI中文摘要

量化已成为高效部署LLM的标准工具，尤其在本地推理中，模型现在通常以每参数2-3比特的服务。当前最先进的方法分为简单的标量量化技术，如GPTQ或AWQ，这些方法虽然广泛部署但精度在3-4比特/参数（bpp）处达到平台期，以及“第二代”向量或trellis量化方法，如QTIP、GPTVQ和AQLM，这些方法推动了精度前沿但难以实现和扩展。本文探讨是否这一差距是根本性的，或者是否经过优化的标量量化器可以恢复大部分差距。我们通过引入GSQ（Gumbel-Softmax量化），一种后训练的标量量化方法，联合学习每个坐标网格的分配和每个组的尺度，使用离散网格的Gumbel-Softmax松弛。GSQ将松弛的基数匹配到目标比特宽度范围内的少量级别（例如，三进制和3 bpp分别有3-8个级别），使优化可行。实际应用中，在标准的Llama-3.1-8B/70B-Instruct模型上，GSQ在2和3比特下缩小了标量量化与QTIP前沿之间的差距，同时使用对称的标量网格和组内量化，因此与现有标量推理内核兼容。我们进一步表明，相同的离散分配优化可以应用于实际的GGUF K-Quant检查点：从公开发布的GGUF模型开始，GSQ在投影回相同部署格式的同时提高精度。最后，GSQ可扩展到万亿级专家混合模型，如Kimi-K2.5，其中向量量化方法难以应用。源代码可在https://github.com/IST-DASLab/GSQ公开获取。

英文摘要

Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized $\textit{scalar}$ quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus remains compatible with existing scalar inference kernels. We further show that the same discrete-assignment optimization can be applied to practical GGUF K-Quant checkpoints: starting from publicly released GGUF models, GSQ improves accuracy while projecting the result back into the same deployment format. Finally, GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply. The source code is publicly available at https://github.com/IST-DASLab/GSQ.

URL PDF HTML ☆

赞 0 踩 0

2604.18145 2026-05-18 cs.CV cs.AI

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

基于3D医学影像的区域 grounded 报告生成：一个细粒度数据集和图增强框架

Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

AI总结本文提出VietPET-RoI数据集和HiRRA框架，通过图增强模块捕捉RoI属性依赖，提升3D PET/CT报告生成的临床可靠性，实验表明其在BLEU、ROUGE-L和临床指标上均优于现有方法。

Comments 16 pages; Accepted to appear in ACL 2026

详情

AI中文摘要

自动化的3D PET/CT影像报告生成受到高维体数据和标注数据稀缺的挑战，尤其是低资源语言。当前黑盒方法将整个体积映射到报告，忽略了临床工作中分析局部感兴趣区域（RoIs）以得出诊断结论的流程。本文通过引入VietPET-RoI数据集，首个大规模3D PET/CT数据集，包含600个PET/CT样本和1960个手动标注的RoIs，配以相应临床报告。此外，为展示该数据集的实用性，我们提出了HiRRA框架，通过图基关系模块模拟专业放射科医生的诊断流程，从全局模式匹配转向局部临床发现。我们还引入了新的临床评估指标，即RoI覆盖度和RoI质量指数，利用LLM提取测量RoI定位准确性和属性描述的忠实度。大量评估表明，我们的框架实现了SOTA性能，比现有模型在BLEU和ROUGE-L上分别高出19.7%和4.7%，在临床指标上取得45.8%的显著提升，表明增强的临床可靠性和减少的幻觉。我们的代码和数据集可在GitHub上获得。

英文摘要

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2604.14692 2026-05-18 cs.CV

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链式窥视：面向视频理解的搜索引导渐进性对象基础推理

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

AI总结本文提出Chain-of-Glimpse框架，通过搜索引导的渐进推理解决视频中对象变化问题，提升多步骤决策的准确性和可解释性。

详情

AI中文摘要

视频理解需要在不同帧间识别和推理语义区分度高的视觉对象，但现有对象无关方法难以有效处理时间变化带来的显著对象变化。为此，我们引入Chain-of-Glimpse，一种搜索引导的渐进性对象基础推理框架，通过将每个推理步骤明确锚定到特定视觉证据区域，实现组合性和多步骤决策。形式上，Chain-of-Glimpse将视频推理视为逐步过程，逐步构建围绕任务相关视觉对象的空间基础轨迹，从而减少对显著性驱动线索的过度依赖。具体而言，Chain-of-Glimpse包含一个搜索引导的控制器，通过强化学习优化，以格式奖励显著激励基础能力，以迭代地基础视觉证据区域并形成可靠的推理轨迹，产生准确且可解释的多步骤决策。在域内NExTQA和域外Video-Holmes、CG-Bench Reasoning和VRBench基准测试中，广泛评估表明Chain-of-Glimpse在多样化视频推理任务中表现出一致的性能提升、鲁棒性和泛化能力。

英文摘要

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2604.10210 2026-05-18 cs.CV cs.AI cs.LG

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN：渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

AI总结本文提出A3-FPN，通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示，提升密集预测任务中小物体的识别性能。

详情

DOI: 10.1016/j.patcog.2026.113793
Journal ref: Pattern Recognition, 2026, 113793

AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展，但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络（A3-FPN），通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言，A3-FPN采用横向扩展的列网络，实现渐近全局特征交互，并将每个层次与所有层次表示解耦。在特征融合中，它从相邻层次收集补充内容，生成位置加权偏移和权重用于上下文感知重采样，并学习深度上下文重权重以提高类别内相似性。在特征重组装中，它进一步加强了同一尺度的判别特征学习，并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明，A3-FPN可以轻松集成到最先进的CNN和Transformer架构中，取得显著性能提升。值得注意的是，当与OneFormer和Swin-L主干结合时，A3-FPN在MS COCO上达到49.6的mask AP，在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

URL PDF HTML ☆

赞 0 踩 0

2604.08426 2026-05-18 cs.LG cs.AI cs.CL

KV Cache Offloading for Context-Intensive Tasks

KV缓存卸载用于上下文密集型任务

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

AI总结本文研究了KV缓存卸载在上下文密集型任务中的应用，通过Text2JSON基准测试发现，该方法在Llama 3和Qwen 3模型上导致性能下降，分析指出低秩投影和不可靠地标是主要问题，并提出更简单的替代策略以提升准确性。

Comments Preprint

详情

AI中文摘要

随着长上下文LLM在广泛应用中的需求增长，键值（KV）缓存已成为延迟和内存使用的关键瓶颈。最近，KV缓存卸载作为一种减少内存占用和推理延迟同时保持准确性的有前途的方法出现。先前的评估主要集中在不需要从上下文中提取大量信息的任务上。在本文中，我们研究了KV缓存卸载在上下文密集型任务中的应用：解决这些问题需要从输入提示中查找大量信息。我们创建并发布了Text2JSON基准测试，这是一个高度上下文密集型任务，需要从原始文本中提取结构化知识。我们评估了现代KV卸载在Text2JSON和其他上下文密集型任务上的表现，并发现Llama 3和Qwen 3模型上存在显著的性能下降。我们的分析确定了两个关键原因：键的低秩投影和不可靠的地标，并提出了一种更简单的替代策略，该策略在多个LLM家族和基准测试中显著提高了准确性。这些发现突显了对长上下文压缩技术进行全面和严格评估的必要性。

英文摘要

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

URL PDF HTML ☆

赞 0 踩 0

2604.05966 2026-05-18 cs.CL

FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

FinReporting: 一种用于跨司法管辖区财务披露本地化报告的代理工作流

Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang, Yixi Zhou, Xunwen Zheng, Yueru He, Yuyang Dai, Georgi Georgiev, Ayesha Gull, Muhammad Usman Safder, Fan Wu, Liyuan Meng, Fengxian Ji, Junning Zhao, Xueqing Peng, Jimin Huang, Yu Chen, Xue, Liu, Preslav Nakov, Zhuohan Xie

AI总结本文提出FinReporting，一种代理工作流，用于跨司法管辖区的财务披露本地化报告。该系统构建了涵盖损益表、资产负债表和现金流量表的统一本体，将报告分解为可审计的阶段，并通过约束验证器提升一致性和可靠性。

Comments Accepted at ACL 2026 Demo Track. 9 pages, including figures and tables

详情

AI中文摘要

金融报告系统越来越多地利用大语言模型（LLMs）来提取和总结企业披露信息。然而，现有方法大多假设单一市场环境，忽视了不同司法管辖区之间的结构性差异。会计分类法、标记基础设施（例如XBRL与PDF）以及汇总惯例的差异给语义对齐和可靠验证带来了重大挑战。本文旨在弥合这一差距。我们提出了FinReporting，一种用于跨司法管辖区财务报告的代理工作流。该系统构建了一个涵盖损益表、资产负债表和现金流量表的统一本体，并将报告分解为可审计的阶段，包括文件获取、提取、本体映射和异常记录。与将LLMs视为自由生成器不同，FinReporting将其作为受明确决策规则约束的验证器，具有证据支撑。在评估美国、日本和中国的年度报告时，FinReporting在异构报告制度下提高了一致性和可靠性。我们还发布了一个交互式演示，可实现跨市场检查，并支持结构化导出本地化财务报表。我们的演示可在url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo}获取。描述我们系统的视频可在https://www.youtube.com/watch?v=f65jdEL31Kk获取。

英文摘要

Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs.\ PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding. Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.

URL PDF HTML ☆

赞 0 踩 0

2604.02812 2026-05-18 cs.RO

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

通过合成神经符号监督学习结构化机器人策略

Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

AI总结本文提出通过合成神经符号监督方法，利用视觉语言模型生成结构化机器人策略，结合多模态感知与符号控制，实现高维学习与符号控制的结合。

详情

AI中文摘要

视觉语言模型（VLMs）最近在将多模态观测映射到机器人行为方面展示了强大能力。然而，大多数现有方法依赖于端到端的视觉-运动策略，这些策略仍然不透明且难以分析，限制了其在现实世界机器人应用中的使用。相比之下，经典机器人系统通常依赖于结构化策略表示，提供可解释性、模块性和反应执行。本文研究如何将基础模型专门化以生成基于多模态感知的结构化机器人策略，弥合高维学习与符号控制之间的差距。我们提出了一种神经符号方法，其中VLM从视觉观测、自然语言指令和结构化系统规范中合成可执行的行为树策略。为了实现可扩展的监督而无需手动标注，我们引入了一个自动化流程，生成一个领域随机化的多模态数据集，其中包含与基础模型生成的指令-策略示例配对的场景。通过将受限符号语法下的结构化任务分解与硬件特定的运动控制解耦，我们证明了一个12B参数模型仅通过在硅中的监督即可学习执行BT合成所需的结构化空间-符号映射。在两个异构机械臂上的现实物理实验表明，这些结构受限的策略能够实现零样本迁移至现实世界环境。结果强调，通过程序化合成高保真的神经符号训练数据，可以绕过机器人规划中的数据瓶颈。

英文摘要

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

URL PDF HTML ☆

赞 0 踩 0

2603.17915 2026-05-18 cs.CL cs.AI

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

IndicSafe：评估南亚多语言大语言模型安全性的基准

Priyaranjan Pattnayak, Sanchari Chowdhuri

AI总结本文提出IndicSafe基准，评估12种南亚语言中LLM的安全性，发现跨语言一致性仅12.8%，安全率波动超17%，揭示多语言LLM安全泛化缺口。

详情

AI中文摘要

随着大语言模型（LLM）在多语言环境中的部署，其在文化多样性和低资源语言中的安全性行为仍不明确。我们首次系统评估了12种印地语系语言中LLM的安全性，这些语言由超过12亿人使用，但在LLM训练数据中代表性不足。使用覆盖种姓、宗教、性别、健康和政治的6000个文化相关提示集，我们评估了10种领先LLM在翻译提示变体上的表现。我们的分析揭示了显著的安全漂移：跨语言一致性仅为12.8%，安全率波动超过17%。某些模型在低资源脚本中过度拒绝良性提示，在政治敏感话题上过度标记，而其他模型未能标记不安全生成。我们使用提示级熵、类别偏见分数和多语言一致性指数量化这些失败。我们的发现突显了多语言LLM在安全泛化方面的关键缺口，并表明安全对齐在不同语言中并不均匀转移。我们发布了IndicSafe，这是首个能够为印地语部署提供文化知情安全评估的基准，并倡导基于地区危害的语言意识对齐策略。

你所交往的公司：大语言模型如何回应黑暗三联特质

Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov

AI总结研究探讨LLMs对表达不同黑暗三联特质（操纵、自大、精神病态）的用户提示的回应方式，发现模型在不同严重程度下表现出纠正与强化行为的差异，对设计更安全的对话系统有启示。

2603.01290 2026-05-18 cs.AI cs.GT cs.LG cs.SY eess.SY

Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

在部分可观测性下对手状态推断：一种用于2026年F1能源策略的HMM-POMDP框架

Kalliopi Kleisarchaki

AI总结本文提出HMM-POMDP框架用于2026F1能源策略，通过HMM推断对手状态并利用DQN决策，解决部分可观测博弈问题，检测反收割陷阱。

Comments 17 pages. v3: editorial corrections and bibliographic updates. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry from Australian Grand Prix (8 March 2026) onwards

详情

AI中文摘要

2026年F1技术规则对能源策略进行了根本性改变：在内燃机与电池动力50/50分配、无限再生和驾驶员控制的Override模式下，最优能源部署策略不仅取决于驾驶员自身状态，还取决于对手车辆的隐藏状态。这形成了一个部分可观测随机博弈，无法通过单agent优化方法解决。本文提出一个可处理的双层推断和决策框架。第一层是一个40状态的隐藏马尔可夫模型（HMM），通过六个公开可观测的 telemetry 信号推断每个对手的ERS充电水平（四种模式：H、M、L_harvest、L_derate）、Override模式状态和轮胎退化状态。第二层是一个深度Q网络（DQN）策略，以HMM信念状态为输入，选择能量部署策略。我们正式刻画了反收割陷阱，一种欺骗策略，其中车辆故意压制可观测部署信号以诱导对手进入失败攻击，并表明检测它需要对ERS水平和harvest/derate子模式进行信念状态推断。在合成比赛上，HMM实现了96.8%的ERS水平准确性（随机基线25%），将L_harvest与L_derate分类准确率为89.4%，反收割陷阱检测召回率为96.3%。赛季前分析表明，赛道依赖的充电可用性（每圈1.0x到2.2x）是主要干扰因素；墨尔本是最难的验证环境。Baum-Welch校准在2026年比赛 telemetry 上从澳大利亚大奖赛（2026年3月8日）开始。

英文摘要

The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode, the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 40-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level (four modes: H, M, L_harvest, L_derate), Override Mode status, and tyre degradation state from six publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap, a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack, and show that detecting it requires belief-state inference over both ERS level and the harvest/derate sub-mode. On synthetic races, the HMM achieves 96.8% ERS-level accuracy (random baseline 25%), classifies L_harvest vs. L_derate with 89.4% accuracy, and detects counter-harvest trap conditions with 96.3% recall. Pre-season analysis indicates circuit-dependent recharge availability (1.0x to 2.2x per lap) as the primary confound; Melbourne is the hardest-case validation environment. Baum-Welch calibration on 2026 race telemetry begins with the Australian Grand Prix (8 March 2026).

URL PDF HTML ☆

赞 0 踩 0

2602.23410 2026-05-18 cs.LG cs.AI eess.SP q-bio.NC

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain-OF：一种适用于fMRI、EEG和MEG的多功能基础模型

Hanning Guo, Hanwen Bi, Farah Abdellatif, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

AI总结 Brain-OF通过联合预训练fMRI、EEG和MEG数据，解决多模态数据语义异质性和分辨率差异问题，提升跨模态数据处理能力。

详情

AI中文摘要

脑基础模型在多种神经科学任务中取得了显著进展。然而，现有模型多局限于单一功能模态，限制了其利用互补的时空动态和不同神经成像技术的集体数据规模的能力。这一限制主要源于模态间的严重语义异质性和分辨率差异。为解决这些问题，我们提出了Brain-OF，一种联合预训练fMRI、EEG和MEG的多功能脑基础模型，能够在统一框架内处理单模态和多模态输入。为协调异构的时空分辨率，我们引入了Any-Resolution神经信号采样器，将多样化的脑信号投影到共享的语义空间。为进一步管理语义偏移，Brain-OF的主干整合了DINT注意力与稀疏专家混合模型，其中共享专家捕捉模态不变的表示，路由专家专注于模态特定的语义。此外，为了通过自监督学习显式内化神经活动的特征，我们提出了Masked Temporal-Frequency Modeling，一种双域预训练目标，联合重建时间和频率域中的脑信号。Brain-OF在包含约40个数据集的大型语料库上进行预训练，并在多样化的下游任务中表现出色，突显了联合多模态集成和双域预训练的优势。

英文摘要

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

URL PDF HTML ☆

赞 0 踩 0

2602.21536 2026-05-18 cs.CV

IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

IHF-Harmony：基于可逆分层流模型的多模态磁共振图像统一化

Pengli Zhu, Yitao Zhu, Haowen Pang, Anqi Qiu

AI总结本文提出IHF-Harmony，通过可逆分层流模型实现多模态MRI图像统一化，利用无配对数据提升跨模态可扩展性，保留解剖结构并提升下游任务性能。

详情

AI中文摘要

回顾性MRI统一化受限于跨模态的可扩展性差和依赖旅行受试者数据集。为解决这些问题，我们引入IHF-Harmony，一种统一的可逆分层流框架，用于使用无配对数据的多模态统一化。通过将翻译过程分解为可逆的特征转换，IHF-Harmony保证了双射映射和无损重建，以防止解剖扭曲。具体而言，可逆分层流（IHF）通过分层减法耦合逐步去除与伪影相关的特征，而伪影感知归一化（AAN）则利用解剖固定特征调节来准确转移目标特征。结合解剖和伪影一致性损失目标，IHF-Harmony实现了高保真的统一化，保留了源解剖结构。在多个MRI模态上的实验表明，IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法，促进了大规模多中心成像研究的稳健统一化。代码可在https://github.com/Idea89560041/IHF-Harmony获取。

英文摘要

Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code is available at https://github.com/Idea89560041/IHF-Harmony.

URL PDF HTML ☆

赞 0 踩 0

2602.21141 2026-05-18 cs.CV

SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

SynthRender 和 IRIS：用于工业物体感知双向仿真-现实迁移的开源框架和数据集

Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

AI总结本文提出SynthRender和IRIS，通过合成数据生成与结构化评估，系统研究双向仿真-现实迁移，提供32类数据集和CAD模型，实现高效合成训练与工业应用。

详情

AI中文摘要

物体感知对于机器人物料搬运和质量检测等任务至关重要。然而，现代监督深度学习模型需要大量标注数据以在半受控条件下实现稳健自动化；这是在自有工业部件上广泛应用的主要障碍。我们通过整合合成数据生成和结构化经验评估的框架，系统研究双向仿真-现实迁移。我们的方法结合2D到3D的现实到仿真技术，通过SynthRender开源框架的程序化引导域随机化（GDR）从物理部件创建3D资产。跨多个基准的结构化消融研究量化了单个渲染设计选择的影响，得出实用的高效合成训练指南。为支持在现实工业条件下的评估，我们引入工业现实-仿真图像集（IRIS），包含32类，具有多样的纹理、类内变化、强类间相似性，并有19,672个注释，提供CAD模型和重建网格用于双向仿真-现实基准测试。在三个工业基准上，所提框架实现了高度竞争性的性能，达到99.1% mAP@50在公开机器人数据集、98.3% mAP@50在汽车基准和95.3% mAP@50在IRIS上。

英文摘要

Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning models require large annotated datasets for robust automation under semi-uncontrolled conditions; a major barrier for widespread deployment with proprietary industrial parts. We address this through an integrated framework combining synthetic data generation and structured empirical evaluation for systematic investigation of bidirectional sim-to-real transfer. Our method integrates 2D-to-3D Reality-to-Simulation techniques for 3D asset creation from physical parts with programmatic Guided Domain Randomization (GDR) via SynthRender, an open-source synthetic image generation framework. Structured ablation studies across multiple benchmarks quantify the impact of individual rendering design choices, yielding practical guidelines for dataefficient synthetic training. To support evaluation under realistic industrial conditions, we introduce Industrial Real-Sim Imagery Set (IRIS), a 32-class dataset with diverse textures, intra-class variation, strong inter-class similarities, and 19,672 annotations, providing both CAD models and reconstructed meshes for bidirectional sim-to-real benchmarking. Across three industrial benchmarks, the proposed framework achieves highly competitive performance, reaching 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

URL PDF HTML ☆

赞 0 踩 0

2602.19423 2026-05-18 cs.CV

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS: 从局部偏好和稀疏提示中学习，用于电子显微镜领域自适应分割

Jiabao Chen, Shan Xiong, Jialin Peng

AI总结本文提出Prefer-DAS，通过利用局部偏好和稀疏提示实现高效的领域自适应分割，结合自训练和提示引导对比学习，提升了分割性能和灵活性。

详情

AI中文摘要

领域自适应分割（DAS）是一种有前景的范式，用于从各种大规模电子显微镜（EM）数据中界定细胞内结构，而无需在每个领域内耗费大量标注数据。然而，普遍的无监督领域自适应（UDA）策略往往表现出有限且有偏的性能，阻碍了其实际应用。在本研究中，我们探索稀疏点和局部人类偏好作为目标领域的弱标签，从而提出一个更加现实且标注高效的设置。具体而言，我们开发了Prefer-DAS，它开创了稀疏提示学习和局部偏好对齐。Prefer-DAS是一种可提示的多任务模型，整合了自训练和提示引导的对比学习。与SAM-like方法不同，Prefer-DAS允许在训练和推理阶段使用完整的、部分的甚至没有点提示，从而实现了交互式分割。与使用图像级人类偏好对齐进行分割不同，我们引入了局部直接偏好优化（LPO），为与空间变化的人类反馈对齐提供了即插即用的解决方案。为了解决潜在的反馈缺失问题，我们还引入了无监督偏好优化（UPO），它利用自学习的偏好。结果，Prefer-DAS模型能够根据点和人类偏好的可用性有效执行弱监督和无监督的DAS。在四个具有挑战性的DAS任务上的全面实验表明，我们的模型在自动和交互式分割模式中均优于SAM-like方法以及无监督和弱监督的DAS方法，突显了其强大的泛化能力和灵活性。此外，我们的模型性能非常接近或甚至超过了监督模型的性能。

英文摘要

Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO), plug-and-play solutions for alignment with spatially varying human feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

URL PDF HTML ☆

赞 0 踩 0

2602.08556 2026-05-18 cs.SD

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

全局旋转等变相位建模用于语音增强的深度幅度-相位交互

Chengzhong Wang, Andong Li, Dingding Yao, Junfeng Li

AI总结本文提出一种全局旋转等变的幅度-相位双流框架，通过强制相位流保持全局旋转等变性，提升语音增强中的相位建模效果，实验显示在相位检索、去噪、去回声和带宽扩展任务中均优于现有方法。

Comments Submitted to IEEE TASLP

详情

AI中文摘要

尽管深度学习在语音增强（SE）领域取得了进展，但有效的相位建模仍具挑战性，因为传统网络通常在平坦的欧几里得特征空间中操作，难以建模相位的底层圆拓扑结构。为此，我们提出了一种幅度-相位双流框架，通过强制全局旋转等变性（GRE）特性对齐相位流的内在圆几何结构。具体而言，我们引入了基于模数的信息交换模块（MPICM）和混合注意力双馈前馈网络（HADF）瓶颈，两者均设计用于在相位流中保持GRE。在相位检索、去噪、去回声和带宽扩展任务中进行了全面评估，以验证所提方法在多个先进基线上的优越性。值得注意的是，所提架构在相位检索任务中将相位距离减少了超过20%，并在零样本跨语料库去噪评估中将PESQ提高了超过0.1。在涉及混合失真的一般语音增强任务中，整体优势也得到确立。定性分析进一步表明，学习到的相位特征表现出明显的周期性模式，与相位的内在圆性质一致。源代码可在https://github.com/wangchengzhong/GRE-Net获取。

英文摘要

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual Feed-Forward Network (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/GRE-Net.

URL PDF HTML ☆

赞 0 踩 0

2602.06932 2026-05-18 cs.LG

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

当强化学习遇见自适应推测训练：一个统一的训练-服务系统

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

AI总结本文提出Aurora系统，通过强化学习实时学习推测器，解决传统方法中部署延迟和领域漂移问题，实验显示在多个模型上实现显著加速。

详情

AI中文摘要

推测解码可以显著加速大语言模型服务，但目前大多数部署将推测器训练与服务分离，将其视为独立的离线建模问题。我们证明这种解耦方法引入了显著的部署和适应延迟：（1）高服务时间，因为推测器必须在部署前长时间离线训练；（2）延迟的效用反馈，因为真正的端到端解码加速只有在训练后才能知道，不能可靠地从接受率推断；（3）领域漂移退化，因为目标模型被重新用于新领域，推测器变得过时且效果下降。为了解决这些问题，我们提出了Aurora，一个统一的训练-服务系统，通过持续学习活推理轨迹直接学习推测器。Aurora将在线推测器学习重新定义为异步强化学习问题：接受的令牌提供正反馈，而被拒绝的推测器提案提供隐含的负反馈，用于提高样本效率。我们的设计集成了基于SGLang的推理服务器和异步训练服务器，使推测器更新能够热交换而不停止服务。关键的是，Aurora支持零日部署：推测器可以立即服务并快速适应实时流量，提高系统性能同时提供即时效用反馈。在实验中，Aurora在最近发布的前沿模型（如MiniMax M2.1 229B和Qwen3-Coder-Next 80B）上实现了1.5倍的零日加速。Aurora还有效适应用户流量的分布变化，在广泛使用的模型（如Qwen3和Llama3）上，相对于经过良好训练但静态的推测器，提供了额外1.25倍的加速。

英文摘要

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

URL PDF HTML ☆

赞 0 踩 0

2602.06470 2026-05-18 cs.CL cs.AI

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

ExplainerPFN：迈向无模型零样本特征重要性估计的表格基础模型

Joao Fonseca, Julia Stoyanovich

AI总结本文提出ExplainerPFN，一种基于TabPFN的表格基础模型，通过预训练合成结构因果数据实现无模型零样本特征重要性估计，展示了其在真实和合成数据集上的竞争力。

Comments 35 pages, 11 figures

详情

AI中文摘要

在监督分类任务中计算特征重要性对模型可解释性至关重要。Shapley值是解释模型预测的常用方法，但需要直接访问底层模型，这一假设在现实部署中常被违反。我们探讨在零样本设置下是否能仅通过输入数据分布和不评估目标模型来获得有意义的特征归因。由于多个模型可能产生相同预测但产生不同Shapley分解，数据到归因的映射并非唯一可识别。因此，我们针对“真实数据”而非“真实模型”学习后验均值归因，基于元训练先验。为此，我们引入ExplainerPFN，一种基于TabPFN的表格基础模型，预训练于合成结构因果数据，通过精确或近精确的Shapley值监督，可预测未见过的表格数据集的特征归因，而无需模型访问、梯度或示例解释。我们的贡献包括：（1）展示少量样本替代解释器在仅使用两个参考观测时可实现高SHAP保真度；（2）提出ExplainerPFN，首个无需访问底层模型或参考解释的零样本方法，提供无现有解释器可应用的归因；（3）发布开源实现，包括完整训练流程和合成数据生成器；（4）通过大量真实和合成数据集实验，展示ExplainerPFN在性能上可与依赖2-10个SHAP示例的少量样本替代解释器竞争。

英文摘要

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. We investigate whether meaningful feature attributions can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. Because multiple models can produce identical predictions yet yield different Shapley decompositions, the mapping from data to attributions is not uniquely identifiable. We therefore target attributions that are "true to the data" rather than "true to the model", learning a posterior mean attribution under a meta-training prior. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN, pretrained on synthetic structural causal datasets supervised with exact or near-exact Shapley values, that predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot surrogate explainers achieve high SHAP fidelity with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley-value-style feature attributions without access to the underlying model or reference explanations, providing a principled attribution where no existing explainer can be applied; (3) we release an open-source implementation including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples.

URL PDF HTML ☆

赞 0 踩 0

2601.21798 2026-05-18 cs.CV

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

CG-MLLM：通过多模态大语言模型实现图像描述与3D内容生成

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

AI总结本文提出CG-MLLM，一种能实现3D描述和高分辨率3D生成的多模态大语言模型，通过混合Transformer架构分离不同建模需求，结合预训练视觉语言模型与专用3D VAE潜在空间，提升3D生成质量与感知能力。

Comments ICML 2026

详情

AI中文摘要

大型语言模型(LLMs)已革新了文本生成和多模态感知，但其在3D内容生成方面的能力仍待探索。现有方法往往只能生成低分辨率网格或粗略结构代理，无法原生捕捉细粒度几何结构。本文提出CG-MLLM，一种新型多模态大语言模型，能够在单一框架内实现3D描述和高分辨率3D生成。通过混合Transformer架构，CG-MLLM分离了不同的建模需求，其中Token-level Autoregressive (TokenAR) Transformer处理token级内容，Block-level Autoregressive (BlockAR) Transformer处理块级内容。通过整合预训练的视觉语言骨干网络与专用3D VAE潜在空间，CG-MLLM促进了标准token与空间块之间的长上下文交互。实验结果表明，CG-MLLM在生成高保真3D对象方面显著优于现有MLLMs，有效将高分辨率3D内容创作带入主流LLM范式。此外，我们进一步发现，学习生成3D内容能够反向增强模型的基于图像的3D理解能力。

英文摘要

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

URL PDF HTML ☆

赞 0 踩 0

2601.21636 2026-05-18 cs.LG cs.CR stat.ML

Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

无需采样矩阵机制下的随机分配隐私计费

Jan Schuchardt, Nikita Kalinin

AI总结本文提出基于Rényi散度和条件组合的无采样界限，用于矩阵分解下随机分配的差分隐私放大，解决了采样方法的高概率保证和随机放弃问题，适用于任意带状和非带状矩阵。

详情

AI中文摘要

我们研究了在随机分配（也称为球入箱模型）下矩阵分解中差分隐私模型训练的隐私放大。Choquette-Choo等人（2025）提出了一种基于采样的蒙特卡洛方法来计算放大参数，但其保证要么仅在高概率下成立，要么需要机制的随机放弃。此外，确保(ε,δ)-DP所需的样本数与δ成反比。相反，我们开发了基于Rényi散度和条件组合的无采样界限。前者通过动态规划公式高效计算界限，后者通过提供更强的隐私保证来补充，特别是在小ε的情况下，Rényi散度界限本质上导致过估计。我们的框架适用于任意带状和非带状矩阵。通过数值比较，我们展示了我们的方法在广泛使用的矩阵机制中的有效性。

英文摘要

We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

URL PDF HTML ☆

赞 0 踩 0