arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2604.20127 2026-05-18 cs.LG cs.CY

Trajectory-Aware Reliability Modeling of Democratic Systems

民主系统中的轨迹感知可靠性建模

Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge

AI总结 本文提出基于动态因果神经自回归(DCNAR)的轨迹感知可靠性模型,用于捕捉机构网络中退化传播动态,优于传统生存模型,提升系统退化预测能力。

详情
AI中文摘要

复杂系统中的故障常通过渐进退化和应力在相互作用组件中的传播产生,而非孤立冲击。民主系统表现出类似动态,削弱机构可能引发相关机构结构的级联退化。传统可靠性与生存模型通常基于当前系统状态估计故障风险,但未明确捕捉退化在机构网络中的传播。本文引入基于动态因果神经自回归(DCNAR)的轨迹感知可靠性建模框架。该框架首先估计机构指标间的因果交互结构,然后建模其联合时间演变以生成系统状态的预测轨迹。故障风险定义为预测轨迹在固定时间范围内跨越预定义退化阈值的概率。利用纵向机构指标,我们比较了基于DCNAR的轨迹风险模型与离散时间危险和Cox比例危险模型。结果表明,轨迹感知建模在预测传播驱动的机构故障方面优于Cox模型。这些发现强调了对动态系统交互建模在可靠性分析和早期系统退化检测中的重要性。

英文摘要

Failures in complex systems often emerge through gradual degradation and the propagation of stress across interacting components rather than through isolated shocks. Democratic systems exhibit similar dynamics, where weakening institutions can trigger cascading deterioration in related institutional structures. Traditional reliability and survival models typically estimate failure risk based on the current system state but do not explicitly capture how degradation propagates through institutional networks over time. This paper introduces a trajectory-aware reliability modeling framework based on Dynamic Causal Neural Autoregression (DCNAR). The framework first estimates a causal interaction structure among institutional indicators and then models their joint temporal evolution to generate forward trajectories of system states. Failure risk is defined as the probability that predicted trajectories cross predefined degradation thresholds within a fixed horizon. Using longitudinal institutional indicators, we compare DCNAR-based trajectory risk models with discrete-time hazard and Cox proportional hazards models. Results show that trajectory-aware modeling consistently outperforms Cox models and improves risk prediction for several propagation-driven institutional failures. These findings highlight the importance of modeling dynamic system interactions for reliability analysis and early detection of systemic degradation.

2604.19572 2026-05-18 cs.CL

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

通过观察上下文压缩实现高效终端智能体的自进化框架

Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin

AI总结 本文提出TACO框架,通过自进化压缩规则提升终端智能体的效率与性能,实验表明在多个基准测试中均取得显著提升。

Comments 27 pages

详情
AI中文摘要

当终端智能体扩展到长周期任务时,关键瓶颈不仅是上下文长度的限制,而是交互历史中噪声终端观察的累积。保留原始观察保留了有用的环境反馈,但也会导致上下文饱和和高token成本;相反,简单压缩可能丢弃后续动作所需的任务关键信号。由于终端环境在仓库、命令和执行状态上高度异质,基于启发式或固定提示的压缩方法难以泛化。我们提出TACO,一种即插即用、无需训练、自进化的终端智能体压缩框架。TACO自动发现、细化和重用交互轨迹中的结构化压缩规则,使工作流适应性地过滤低价值终端输出,同时保留任务相关观察。在TerminalBench(TB 1.0和TB 2.0)以及四个额外的终端相关基准测试中,TACO在智能体架构和基础模型上均提升了任务性能和token效率。在TerminalBench中,TACO在强智能体模型上获得1%-4%的准确率提升,并在相同token预算下提高约2%-3%的准确率。在额外的终端相关基准测试中,它减少了总token消耗,同时保持或提高任务成功率。这些结果表明,自进化、工作流适应的观察压缩是实现更可靠和高效长周期终端智能体的有效途径。代码可在https://github.com/multimodal-art-projection/TACO公开获取。

英文摘要

As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of noisy terminal observations in the interaction history. Retaining raw observations preserves useful environment feedback, but also leads to context saturation and high token cost; conversely, naive compression may discard task-critical signals needed for subsequent actions. Because terminal environments are highly heterogeneous across repositories, commands, and execution states, heuristic-based or fixed-prompt compression methods are difficult to generalize. We propose TACO, a plug-and-play, training-free, self-evolving Terminal Agent Compression framework for existing terminal agents. TACO automatically discovers, refines, and reuses structured compression rules from interaction trajectories, enabling workflow-adaptive filtering of low-value terminal outputs while preserving task-relevant observations. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks, including SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench, show that TACO consistently improves task performance and token efficiency across agent scaffolds and backbone models. On TerminalBench, TACO yields 1%-4% accuracy gains across strong agentic models and improves accuracy by around 2%-3% under the same token budget. On additional terminal-related benchmarks, it reduces total token consumption while maintaining or improving task success rates. These results suggest that self-evolving, workflow-adaptive observation compression is an effective path toward more reliable and efficient long-horizon terminal agents. The code is publicly available at https://github.com/multimodal-art-projection/TACO.

2604.18556 2026-05-18 cs.CL cs.LG

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

GSQ: 通过Gumbel-Softmax采样实现高精度低精度标量量化以用于LLMs

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Maximilian Kleinegger, Dan Alistarh

AI总结 本文提出GSQ,一种基于Gumbel-Softmax松弛的标量量化方法,在低比特数下实现高精度,适用于LLMs的部署。

详情
AI中文摘要

量化已成为高效部署LLM的标准工具,尤其在本地推理中,模型现在通常以每参数2-3比特的服务。当前最先进的方法分为简单的标量量化技术,如GPTQ或AWQ,这些方法虽然广泛部署但精度在3-4比特/参数(bpp)处达到平台期,以及“第二代”向量或trellis量化方法,如QTIP、GPTVQ和AQLM,这些方法推动了精度前沿但难以实现和扩展。本文探讨是否这一差距是根本性的,或者是否经过优化的标量量化器可以恢复大部分差距。我们通过引入GSQ(Gumbel-Softmax量化),一种后训练的标量量化方法,联合学习每个坐标网格的分配和每个组的尺度,使用离散网格的Gumbel-Softmax松弛。GSQ将松弛的基数匹配到目标比特宽度范围内的少量级别(例如,三进制和3 bpp分别有3-8个级别),使优化可行。实际应用中,在标准的Llama-3.1-8B/70B-Instruct模型上,GSQ在2和3比特下缩小了标量量化与QTIP前沿之间的差距,同时使用对称的标量网格和组内量化,因此与现有标量推理内核兼容。我们进一步表明,相同的离散分配优化可以应用于实际的GGUF K-Quant检查点:从公开发布的GGUF模型开始,GSQ在投影回相同部署格式的同时提高精度。最后,GSQ可扩展到万亿级专家混合模型,如Kimi-K2.5,其中向量量化方法难以应用。源代码可在https://github.com/IST-DASLab/GSQ公开获取。

英文摘要

Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized $\textit{scalar}$ quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus remains compatible with existing scalar inference kernels. We further show that the same discrete-assignment optimization can be applied to practical GGUF K-Quant checkpoints: starting from publicly released GGUF models, GSQ improves accuracy while projecting the result back into the same deployment format. Finally, GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply. The source code is publicly available at https://github.com/IST-DASLab/GSQ.

2604.18145 2026-05-18 cs.CV cs.AI

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

基于3D医学影像的区域 grounded 报告生成:一个细粒度数据集和图增强框架

Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

AI总结 本文提出VietPET-RoI数据集和HiRRA框架,通过图增强模块捕捉RoI属性依赖,提升3D PET/CT报告生成的临床可靠性,实验表明其在BLEU、ROUGE-L和临床指标上均优于现有方法。

Comments 16 pages; Accepted to appear in ACL 2026

详情
AI中文摘要

自动化的3D PET/CT影像报告生成受到高维体数据和标注数据稀缺的挑战,尤其是低资源语言。当前黑盒方法将整个体积映射到报告,忽略了临床工作中分析局部感兴趣区域(RoIs)以得出诊断结论的流程。本文通过引入VietPET-RoI数据集,首个大规模3D PET/CT数据集,包含600个PET/CT样本和1960个手动标注的RoIs,配以相应临床报告。此外,为展示该数据集的实用性,我们提出了HiRRA框架,通过图基关系模块模拟专业放射科医生的诊断流程,从全局模式匹配转向局部临床发现。我们还引入了新的临床评估指标,即RoI覆盖度和RoI质量指数,利用LLM提取测量RoI定位准确性和属性描述的忠实度。大量评估表明,我们的框架实现了SOTA性能,比现有模型在BLEU和ROUGE-L上分别高出19.7%和4.7%,在临床指标上取得45.8%的显著提升,表明增强的临床可靠性和减少的幻觉。我们的代码和数据集可在GitHub上获得。

英文摘要

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

2604.14692 2026-05-18 cs.CV

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链式窥视:面向视频理解的搜索引导渐进性对象基础推理

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

AI总结 本文提出Chain-of-Glimpse框架,通过搜索引导的渐进推理解决视频中对象变化问题,提升多步骤决策的准确性和可解释性。

详情
AI中文摘要

视频理解需要在不同帧间识别和推理语义区分度高的视觉对象,但现有对象无关方法难以有效处理时间变化带来的显著对象变化。为此,我们引入Chain-of-Glimpse,一种搜索引导的渐进性对象基础推理框架,通过将每个推理步骤明确锚定到特定视觉证据区域,实现组合性和多步骤决策。形式上,Chain-of-Glimpse将视频推理视为逐步过程,逐步构建围绕任务相关视觉对象的空间基础轨迹,从而减少对显著性驱动线索的过度依赖。具体而言,Chain-of-Glimpse包含一个搜索引导的控制器,通过强化学习优化,以格式奖励显著激励基础能力,以迭代地基础视觉证据区域并形成可靠的推理轨迹,产生准确且可解释的多步骤决策。在域内NExTQA和域外Video-Holmes、CG-Bench Reasoning和VRBench基准测试中,广泛评估表明Chain-of-Glimpse在多样化视频推理任务中表现出一致的性能提升、鲁棒性和泛化能力。

英文摘要

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

2604.10210 2026-05-18 cs.CV cs.AI cs.LG

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN:渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

AI总结 本文提出A3-FPN,通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示,提升密集预测任务中小物体的识别性能。

详情
Journal ref
Pattern Recognition, 2026, 113793
AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展,但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络(A3-FPN),通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言,A3-FPN采用横向扩展的列网络,实现渐近全局特征交互,并将每个层次与所有层次表示解耦。在特征融合中,它从相邻层次收集补充内容,生成位置加权偏移和权重用于上下文感知重采样,并学习深度上下文重权重以提高类别内相似性。在特征重组装中,它进一步加强了同一尺度的判别特征学习,并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明,A3-FPN可以轻松集成到最先进的CNN和Transformer架构中,取得显著性能提升。值得注意的是,当与OneFormer和Swin-L主干结合时,A3-FPN在MS COCO上达到49.6的mask AP,在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

2604.08426 2026-05-18 cs.LG cs.AI cs.CL

KV Cache Offloading for Context-Intensive Tasks

KV缓存卸载用于上下文密集型任务

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

AI总结 本文研究了KV缓存卸载在上下文密集型任务中的应用,通过Text2JSON基准测试发现,该方法在Llama 3和Qwen 3模型上导致性能下降,分析指出低秩投影和不可靠地标是主要问题,并提出更简单的替代策略以提升准确性。

Comments Preprint

详情
AI中文摘要

随着长上下文LLM在广泛应用中的需求增长,键值(KV)缓存已成为延迟和内存使用的关键瓶颈。最近,KV缓存卸载作为一种减少内存占用和推理延迟同时保持准确性的有前途的方法出现。先前的评估主要集中在不需要从上下文中提取大量信息的任务上。在本文中,我们研究了KV缓存卸载在上下文密集型任务中的应用:解决这些问题需要从输入提示中查找大量信息。我们创建并发布了Text2JSON基准测试,这是一个高度上下文密集型任务,需要从原始文本中提取结构化知识。我们评估了现代KV卸载在Text2JSON和其他上下文密集型任务上的表现,并发现Llama 3和Qwen 3模型上存在显著的性能下降。我们的分析确定了两个关键原因:键的低秩投影和不可靠的地标,并提出了一种更简单的替代策略,该策略在多个LLM家族和基准测试中显著提高了准确性。这些发现突显了对长上下文压缩技术进行全面和严格评估的必要性。

英文摘要

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

2604.05966 2026-05-18 cs.CL

FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

FinReporting: 一种用于跨司法管辖区财务披露本地化报告的代理工作流

Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang, Yixi Zhou, Xunwen Zheng, Yueru He, Yuyang Dai, Georgi Georgiev, Ayesha Gull, Muhammad Usman Safder, Fan Wu, Liyuan Meng, Fengxian Ji, Junning Zhao, Xueqing Peng, Jimin Huang, Yu Chen, Xue, Liu, Preslav Nakov, Zhuohan Xie

AI总结 本文提出FinReporting,一种代理工作流,用于跨司法管辖区的财务披露本地化报告。该系统构建了涵盖损益表、资产负债表和现金流量表的统一本体,将报告分解为可审计的阶段,并通过约束验证器提升一致性和可靠性。

Comments Accepted at ACL 2026 Demo Track. 9 pages, including figures and tables

详情
AI中文摘要

金融报告系统越来越多地利用大语言模型(LLMs)来提取和总结企业披露信息。然而,现有方法大多假设单一市场环境,忽视了不同司法管辖区之间的结构性差异。会计分类法、标记基础设施(例如XBRL与PDF)以及汇总惯例的差异给语义对齐和可靠验证带来了重大挑战。本文旨在弥合这一差距。我们提出了FinReporting,一种用于跨司法管辖区财务报告的代理工作流。该系统构建了一个涵盖损益表、资产负债表和现金流量表的统一本体,并将报告分解为可审计的阶段,包括文件获取、提取、本体映射和异常记录。与将LLMs视为自由生成器不同,FinReporting将其作为受明确决策规则约束的验证器,具有证据支撑。在评估美国、日本和中国的年度报告时,FinReporting在异构报告制度下提高了一致性和可靠性。我们还发布了一个交互式演示,可实现跨市场检查,并支持结构化导出本地化财务报表。我们的演示可在url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo}获取。描述我们系统的视频可在https://www.youtube.com/watch?v=f65jdEL31Kk获取。

英文摘要

Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs.\ PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding. Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.

2604.02812 2026-05-18 cs.RO

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

通过合成神经符号监督学习结构化机器人策略

Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

AI总结 本文提出通过合成神经符号监督方法,利用视觉语言模型生成结构化机器人策略,结合多模态感知与符号控制,实现高维学习与符号控制的结合。

详情
AI中文摘要

视觉语言模型(VLMs)最近在将多模态观测映射到机器人行为方面展示了强大能力。然而,大多数现有方法依赖于端到端的视觉-运动策略,这些策略仍然不透明且难以分析,限制了其在现实世界机器人应用中的使用。相比之下,经典机器人系统通常依赖于结构化策略表示,提供可解释性、模块性和反应执行。本文研究如何将基础模型专门化以生成基于多模态感知的结构化机器人策略,弥合高维学习与符号控制之间的差距。我们提出了一种神经符号方法,其中VLM从视觉观测、自然语言指令和结构化系统规范中合成可执行的行为树策略。为了实现可扩展的监督而无需手动标注,我们引入了一个自动化流程,生成一个领域随机化的多模态数据集,其中包含与基础模型生成的指令-策略示例配对的场景。通过将受限符号语法下的结构化任务分解与硬件特定的运动控制解耦,我们证明了一个12B参数模型仅通过在硅中的监督即可学习执行BT合成所需的结构化空间-符号映射。在两个异构机械臂上的现实物理实验表明,这些结构受限的策略能够实现零样本迁移至现实世界环境。结果强调,通过程序化合成高保真的神经符号训练数据,可以绕过机器人规划中的数据瓶颈。

英文摘要

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

2603.17915 2026-05-18 cs.CL cs.AI

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

IndicSafe:评估南亚多语言大语言模型安全性的基准

Priyaranjan Pattnayak, Sanchari Chowdhuri

AI总结 本文提出IndicSafe基准,评估12种南亚语言中LLM的安全性,发现跨语言一致性仅12.8%,安全率波动超17%,揭示多语言LLM安全泛化缺口。

详情
AI中文摘要

随着大语言模型(LLM)在多语言环境中的部署,其在文化多样性和低资源语言中的安全性行为仍不明确。我们首次系统评估了12种印地语系语言中LLM的安全性,这些语言由超过12亿人使用,但在LLM训练数据中代表性不足。使用覆盖种姓、宗教、性别、健康和政治的6000个文化相关提示集,我们评估了10种领先LLM在翻译提示变体上的表现。我们的分析揭示了显著的安全漂移:跨语言一致性仅为12.8%,安全率波动超过17%。某些模型在低资源脚本中过度拒绝良性提示,在政治敏感话题上过度标记,而其他模型未能标记不安全生成。我们使用提示级熵、类别偏见分数和多语言一致性指数量化这些失败。我们的发现突显了多语言LLM在安全泛化方面的关键缺口,并表明安全对齐在不同语言中并不均匀转移。我们发布了IndicSafe,这是首个能够为印地语部署提供文化知情安全评估的基准,并倡导基于地区危害的语言意识对齐策略。

英文摘要

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

2603.15269 2026-05-18 cs.CV

Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

自监督ImageNet表示用于活体共聚焦显微镜:无需分割图的曲折度分级

Kim Ouan, Noémie Moreau, Katarzyna Bozek

AI总结 本文提出利用自监督预训练的ImageNet特征进行活体共聚焦显微镜的曲折度分级,无需分割图,提升了准确率和灵敏度。

Comments 7 pages, 4 figures, MIDL 2026 - Short Paper Track

详情
AI中文摘要

角膜神经纤维的曲折度用于指示不同疾病。当前最先进的曲折度分级方法严重依赖于这些神经纤维的昂贵分割图。本文证明自监督预训练的ImageNet特征可转移到活体共聚焦显微镜领域。我们展示DINO不应被忽视作为医学影像的深度学习模型,尽管它被后来的两个版本取代。经过仔细微调,DINO在准确率(84.25%)和灵敏度(77.97%)方面优于现有方法。我们的微调模型在无需分割图的情况下专注于分级的关键形态学元素。

英文摘要

The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

2603.13452 2026-05-18 cs.AI cs.CY cs.LG

MESD: A Risk-Sensitive Metric for Explanation Fairness Across Intersectional Subgroups

MESD:一种用于跨交集子组解释公平性的风险敏感度度量

Gideon Popoola, John Sheppard

AI总结 本文提出MESD,一种衡量不同交集子组解释质量差异的程序公平度量,结合标签感知聚合、经验贝叶斯收缩和CVaR加权,通过多目标优化框架UEF优化效用、结果公平和程序公平。

详情
AI中文摘要

机器学习中的公平性主要通过结果导向指标,如人口统计学均等性,来评估预测是否在受保护群体中统计上一致。然而,这些指标无法检测模型是否对不同人口群体使用系统性不同的推理,这违反了程序公平原则。这个问题被交集性加剧,其中模型可能在个别属性(如种族)上显得公平,但在交集子群(如种族×性别)上表现出显著差异,即公平性红区划分。本文引入多类别解释稳定性差异(MESD),一种程序公平度量,量化由多个受保护属性的笛卡尔积形成的交集子组中的解释质量差异。MESD整合了三个组件,即标签感知聚合,与结果条件公平对齐,经验贝叶斯收缩以稳定小交集群体的估计,以及条件价值-at-风险(CVaR)加权以强调最坏情况子群差异。我们将MESD整合到多目标优化框架(UEF)中,通过NSGA-II联合优化效用、结果公平和程序公平。我们在三个基准数据集和四种最先进方法上评估了MESD和UEF,证明MESD揭示了仅靠结果指标无法察觉的程序差异。我们将我们的贡献置于程序正义理论中,并讨论了对监管合规和交集公平性的意义。

英文摘要

Fairness in machine learning is predominantly evaluated through outcome-oriented metrics, such as Demographic parity, which measure whether predictions are statistically consistent across protected groups. However, these metrics cannot detect whether a model uses systematically different reasoning for different demographic groups, which violates procedural fairness principles. This problem is compounded by intersectionality, where models may appear fair on individual attributes (e.g., race) while exhibiting significant disparities for intersectional subgroups (e.g., race $\times$ gender), a phenomenon known as fairness gerrymandering. In this work, we introduce Multi-category Explanation Stability Disparity (MESD), a procedural fairness metric that quantifies disparities in explanation quality across intersectional subgroups formed by the Cartesian product of multiple protected attributes. MESD integrates three components, which are label-aware aggregation aligned with outcome-conditional fairness, empirical-Bayes shrinkage to stabilize estimates for small intersectional groups, and Conditional Value-at-Risk (CVaR) weighting to emphasize worst-case subgroup disparities. We integrate MESD within a multi-objective optimization framework (UEF) that jointly optimizes utility, outcome fairness, and procedural fairness using NSGA-II. We evaluated MESD and UEF on three benchmark datasets along with four state-of-the-art methods in several experiments, and we demonstrate that MESD reveals procedural disparities invisible to outcome metrics alone. We position our contribution within procedural justice theory and discuss implications for regulatory compliance and intersectional equity.

2603.05377 2026-05-18 cs.RO cs.CV

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier: 基于视觉-语言基础的通用导航

Esteban Padilla-Cerdio, Boyang Sun, Marc Pollefeys, Hermann Blum

AI总结 本文提出OpenFrontier框架,通过稀疏子目标识别与到达问题实现高效导航,无需任务特定训练或微调,适用于多种视觉-语言先验模型,展示零样本性能和真实机器人部署效果。

详情
AI中文摘要

开放世界导航要求机器人在复杂日常环境中做出决策并适应灵活的任务需求。传统导航方法依赖密集3D重建和手工制定的目标指标,限制了其在任务和环境中的泛化能力。最近的视觉-语言导航(VLN)和视觉-语言-动作(VLA)模型使端到端策略成为可能,但通常需要交互式训练、大规模数据收集或任务特定的微调。我们提出将导航视为稀疏子目标识别与到达问题,并发现提供视觉锚定目标以高语义先验能够实现高效目标条件导航。基于这一见解,我们选择视觉前沿作为语义锚点,并提出OpenFrontier导航框架,无需任务特定训练或微调,无缝整合多种视觉-语言先验模型。OpenFrontier通过轻量级系统设计实现高效导航,不依赖密集3D语义映射、任务特定策略训练或模型微调。我们评估了OpenFrontier在多个导航基准上的表现,并展示了强大的零样本性能以及在移动机器人上的有效实际部署。

英文摘要

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision-language navigation (VLN) and vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D semantic mapping, task-specific policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

2603.04299 2026-05-18 cs.CL

The Company You Keep: How LLMs Respond to Dark Triad Traits

你所交往的公司:大语言模型如何回应黑暗三联特质

Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov

AI总结 研究探讨LLMs对表达不同黑暗三联特质(操纵、自大、精神病态)的用户提示的回应方式,发现模型在不同严重程度下表现出纠正与强化行为的差异,对设计更安全的对话系统有启示。

详情
AI中文摘要

大型语言模型(LLMs)通常表现出高度顺从和强化的对话风格,也称为AI讨好行为。尽管这种模式源于奖励用户满意度而非准确性的训练目标,但在与反映负面社会倾向的用户提示互动时可能变得有问题。此类回应可能放大有害行为而非减轻它。本研究通过定制数据集考察LLMs如何回应表达不同黑暗三联特质(操纵、自大、精神病态)的用户提示。分析显示,模型在不同严重程度下表现出纠正与强化行为的差异,模型行为也取决于严重程度和回应情感。研究结果对设计能够检测并恰当回应用户从无害到有害请求的更安全对话系统具有启示。

英文摘要

Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this pattern arises from training objectives that reward user satisfaction over accuracy, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.

2603.01290 2026-05-18 cs.AI cs.GT cs.LG cs.SY eess.SY

Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

在部分可观测性下对手状态推断:一种用于2026年F1能源策略的HMM-POMDP框架

Kalliopi Kleisarchaki

AI总结 本文提出HMM-POMDP框架用于2026F1能源策略,通过HMM推断对手状态并利用DQN决策,解决部分可观测博弈问题,检测反收割陷阱。

Comments 17 pages. v3: editorial corrections and bibliographic updates. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry from Australian Grand Prix (8 March 2026) onwards

详情
AI中文摘要

2026年F1技术规则对能源策略进行了根本性改变:在内燃机与电池动力50/50分配、无限再生和驾驶员控制的Override模式下,最优能源部署策略不仅取决于驾驶员自身状态,还取决于对手车辆的隐藏状态。这形成了一个部分可观测随机博弈,无法通过单agent优化方法解决。本文提出一个可处理的双层推断和决策框架。第一层是一个40状态的隐藏马尔可夫模型(HMM),通过六个公开可观测的 telemetry 信号推断每个对手的ERS充电水平(四种模式:H、M、L_harvest、L_derate)、Override模式状态和轮胎退化状态。第二层是一个深度Q网络(DQN)策略,以HMM信念状态为输入,选择能量部署策略。我们正式刻画了反收割陷阱,一种欺骗策略,其中车辆故意压制可观测部署信号以诱导对手进入失败攻击,并表明检测它需要对ERS水平和harvest/derate子模式进行信念状态推断。在合成比赛上,HMM实现了96.8%的ERS水平准确性(随机基线25%),将L_harvest与L_derate分类准确率为89.4%,反收割陷阱检测召回率为96.3%。赛季前分析表明,赛道依赖的充电可用性(每圈1.0x到2.2x)是主要干扰因素;墨尔本是最难的验证环境。Baum-Welch校准在2026年比赛 telemetry 上从澳大利亚大奖赛(2026年3月8日)开始。

英文摘要

The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode, the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 40-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level (four modes: H, M, L_harvest, L_derate), Override Mode status, and tyre degradation state from six publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap, a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack, and show that detecting it requires belief-state inference over both ERS level and the harvest/derate sub-mode. On synthetic races, the HMM achieves 96.8% ERS-level accuracy (random baseline 25%), classifies L_harvest vs. L_derate with 89.4% accuracy, and detects counter-harvest trap conditions with 96.3% recall. Pre-season analysis indicates circuit-dependent recharge availability (1.0x to 2.2x per lap) as the primary confound; Melbourne is the hardest-case validation environment. Baum-Welch calibration on 2026 race telemetry begins with the Australian Grand Prix (8 March 2026).

2602.23410 2026-05-18 cs.LG cs.AI eess.SP q-bio.NC

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain-OF:一种适用于fMRI、EEG和MEG的多功能基础模型

Hanning Guo, Hanwen Bi, Farah Abdellatif, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

AI总结 Brain-OF通过联合预训练fMRI、EEG和MEG数据,解决多模态数据语义异质性和分辨率差异问题,提升跨模态数据处理能力。

详情
AI中文摘要

脑基础模型在多种神经科学任务中取得了显著进展。然而,现有模型多局限于单一功能模态,限制了其利用互补的时空动态和不同神经成像技术的集体数据规模的能力。这一限制主要源于模态间的严重语义异质性和分辨率差异。为解决这些问题,我们提出了Brain-OF,一种联合预训练fMRI、EEG和MEG的多功能脑基础模型,能够在统一框架内处理单模态和多模态输入。为协调异构的时空分辨率,我们引入了Any-Resolution神经信号采样器,将多样化的脑信号投影到共享的语义空间。为进一步管理语义偏移,Brain-OF的主干整合了DINT注意力与稀疏专家混合模型,其中共享专家捕捉模态不变的表示,路由专家专注于模态特定的语义。此外,为了通过自监督学习显式内化神经活动的特征,我们提出了Masked Temporal-Frequency Modeling,一种双域预训练目标,联合重建时间和频率域中的脑信号。Brain-OF在包含约40个数据集的大型语料库上进行预训练,并在多样化的下游任务中表现出色,突显了联合多模态集成和双域预训练的优势。

英文摘要

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

2602.21536 2026-05-18 cs.CV

IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

IHF-Harmony:基于可逆分层流模型的多模态磁共振图像统一化

Pengli Zhu, Yitao Zhu, Haowen Pang, Anqi Qiu

AI总结 本文提出IHF-Harmony,通过可逆分层流模型实现多模态MRI图像统一化,利用无配对数据提升跨模态可扩展性,保留解剖结构并提升下游任务性能。

详情
AI中文摘要

回顾性MRI统一化受限于跨模态的可扩展性差和依赖旅行受试者数据集。为解决这些问题,我们引入IHF-Harmony,一种统一的可逆分层流框架,用于使用无配对数据的多模态统一化。通过将翻译过程分解为可逆的特征转换,IHF-Harmony保证了双射映射和无损重建,以防止解剖扭曲。具体而言,可逆分层流(IHF)通过分层减法耦合逐步去除与伪影相关的特征,而伪影感知归一化(AAN)则利用解剖固定特征调节来准确转移目标特征。结合解剖和伪影一致性损失目标,IHF-Harmony实现了高保真的统一化,保留了源解剖结构。在多个MRI模态上的实验表明,IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法,促进了大规模多中心成像研究的稳健统一化。代码可在https://github.com/Idea89560041/IHF-Harmony获取。

英文摘要

Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code is available at https://github.com/Idea89560041/IHF-Harmony.

2602.21141 2026-05-18 cs.CV

SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

SynthRender 和 IRIS:用于工业物体感知双向仿真-现实迁移的开源框架和数据集

Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

AI总结 本文提出SynthRender和IRIS,通过合成数据生成与结构化评估,系统研究双向仿真-现实迁移,提供32类数据集和CAD模型,实现高效合成训练与工业应用。

详情
AI中文摘要

物体感知对于机器人物料搬运和质量检测等任务至关重要。然而,现代监督深度学习模型需要大量标注数据以在半受控条件下实现稳健自动化;这是在自有工业部件上广泛应用的主要障碍。我们通过整合合成数据生成和结构化经验评估的框架,系统研究双向仿真-现实迁移。我们的方法结合2D到3D的现实到仿真技术,通过SynthRender开源框架的程序化引导域随机化(GDR)从物理部件创建3D资产。跨多个基准的结构化消融研究量化了单个渲染设计选择的影响,得出实用的高效合成训练指南。为支持在现实工业条件下的评估,我们引入工业现实-仿真图像集(IRIS),包含32类,具有多样的纹理、类内变化、强类间相似性,并有19,672个注释,提供CAD模型和重建网格用于双向仿真-现实基准测试。在三个工业基准上,所提框架实现了高度竞争性的性能,达到99.1% mAP@50在公开机器人数据集、98.3% mAP@50在汽车基准和95.3% mAP@50在IRIS上。

英文摘要

Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning models require large annotated datasets for robust automation under semi-uncontrolled conditions; a major barrier for widespread deployment with proprietary industrial parts. We address this through an integrated framework combining synthetic data generation and structured empirical evaluation for systematic investigation of bidirectional sim-to-real transfer. Our method integrates 2D-to-3D Reality-to-Simulation techniques for 3D asset creation from physical parts with programmatic Guided Domain Randomization (GDR) via SynthRender, an open-source synthetic image generation framework. Structured ablation studies across multiple benchmarks quantify the impact of individual rendering design choices, yielding practical guidelines for dataefficient synthetic training. To support evaluation under realistic industrial conditions, we introduce Industrial Real-Sim Imagery Set (IRIS), a 32-class dataset with diverse textures, intra-class variation, strong inter-class similarities, and 19,672 annotations, providing both CAD models and reconstructed meshes for bidirectional sim-to-real benchmarking. Across three industrial benchmarks, the proposed framework achieves highly competitive performance, reaching 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

2602.19423 2026-05-18 cs.CV

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS: 从局部偏好和稀疏提示中学习,用于电子显微镜领域自适应分割

Jiabao Chen, Shan Xiong, Jialin Peng

AI总结 本文提出Prefer-DAS,通过利用局部偏好和稀疏提示实现高效的领域自适应分割,结合自训练和提示引导对比学习,提升了分割性能和灵活性。

详情
AI中文摘要

领域自适应分割(DAS)是一种有前景的范式,用于从各种大规模电子显微镜(EM)数据中界定细胞内结构,而无需在每个领域内耗费大量标注数据。然而,普遍的无监督领域自适应(UDA)策略往往表现出有限且有偏的性能,阻碍了其实际应用。在本研究中,我们探索稀疏点和局部人类偏好作为目标领域的弱标签,从而提出一个更加现实且标注高效的设置。具体而言,我们开发了Prefer-DAS,它开创了稀疏提示学习和局部偏好对齐。Prefer-DAS是一种可提示的多任务模型,整合了自训练和提示引导的对比学习。与SAM-like方法不同,Prefer-DAS允许在训练和推理阶段使用完整的、部分的甚至没有点提示,从而实现了交互式分割。与使用图像级人类偏好对齐进行分割不同,我们引入了局部直接偏好优化(LPO),为与空间变化的人类反馈对齐提供了即插即用的解决方案。为了解决潜在的反馈缺失问题,我们还引入了无监督偏好优化(UPO),它利用自学习的偏好。结果,Prefer-DAS模型能够根据点和人类偏好的可用性有效执行弱监督和无监督的DAS。在四个具有挑战性的DAS任务上的全面实验表明,我们的模型在自动和交互式分割模式中均优于SAM-like方法以及无监督和弱监督的DAS方法,突显了其强大的泛化能力和灵活性。此外,我们的模型性能非常接近或甚至超过了监督模型的性能。

英文摘要

Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO), plug-and-play solutions for alignment with spatially varying human feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

2602.08556 2026-05-18 cs.SD

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

全局旋转等变相位建模用于语音增强的深度幅度-相位交互

Chengzhong Wang, Andong Li, Dingding Yao, Junfeng Li

AI总结 本文提出一种全局旋转等变的幅度-相位双流框架,通过强制相位流保持全局旋转等变性,提升语音增强中的相位建模效果,实验显示在相位检索、去噪、去回声和带宽扩展任务中均优于现有方法。

Comments Submitted to IEEE TASLP

详情
AI中文摘要

尽管深度学习在语音增强(SE)领域取得了进展,但有效的相位建模仍具挑战性,因为传统网络通常在平坦的欧几里得特征空间中操作,难以建模相位的底层圆拓扑结构。为此,我们提出了一种幅度-相位双流框架,通过强制全局旋转等变性(GRE)特性对齐相位流的内在圆几何结构。具体而言,我们引入了基于模数的信息交换模块(MPICM)和混合注意力双馈前馈网络(HADF)瓶颈,两者均设计用于在相位流中保持GRE。在相位检索、去噪、去回声和带宽扩展任务中进行了全面评估,以验证所提方法在多个先进基线上的优越性。值得注意的是,所提架构在相位检索任务中将相位距离减少了超过20%,并在零样本跨语料库去噪评估中将PESQ提高了超过0.1。在涉及混合失真的一般语音增强任务中,整体优势也得到确立。定性分析进一步表明,学习到的相位特征表现出明显的周期性模式,与相位的内在圆性质一致。源代码可在https://github.com/wangchengzhong/GRE-Net获取。

英文摘要

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual Feed-Forward Network (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/GRE-Net.

2602.06932 2026-05-18 cs.LG

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

当强化学习遇见自适应推测训练:一个统一的训练-服务系统

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

AI总结 本文提出Aurora系统,通过强化学习实时学习推测器,解决传统方法中部署延迟和领域漂移问题,实验显示在多个模型上实现显著加速。

详情
AI中文摘要

推测解码可以显著加速大语言模型服务,但目前大多数部署将推测器训练与服务分离,将其视为独立的离线建模问题。我们证明这种解耦方法引入了显著的部署和适应延迟:(1)高服务时间,因为推测器必须在部署前长时间离线训练;(2)延迟的效用反馈,因为真正的端到端解码加速只有在训练后才能知道,不能可靠地从接受率推断;(3)领域漂移退化,因为目标模型被重新用于新领域,推测器变得过时且效果下降。为了解决这些问题,我们提出了Aurora,一个统一的训练-服务系统,通过持续学习活推理轨迹直接学习推测器。Aurora将在线推测器学习重新定义为异步强化学习问题:接受的令牌提供正反馈,而被拒绝的推测器提案提供隐含的负反馈,用于提高样本效率。我们的设计集成了基于SGLang的推理服务器和异步训练服务器,使推测器更新能够热交换而不停止服务。关键的是,Aurora支持零日部署:推测器可以立即服务并快速适应实时流量,提高系统性能同时提供即时效用反馈。在实验中,Aurora在最近发布的前沿模型(如MiniMax M2.1 229B和Qwen3-Coder-Next 80B)上实现了1.5倍的零日加速。Aurora还有效适应用户流量的分布变化,在广泛使用的模型(如Qwen3和Llama3)上,相对于经过良好训练但静态的推测器,提供了额外1.25倍的加速。

英文摘要

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

2602.06470 2026-05-18 cs.CL cs.AI

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

AI总结 本文提出UNO框架,通过用户日志提炼规则和偏好对,利用查询反馈驱动聚类处理数据异质性,量化模型知识与日志数据间的认知差距,提升LLM系统性能。

详情
AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型(LLMs)的发展,但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此,近期研究更加关注从真实世界部署中持续学习,其中用户交互日志提供了丰富的真人类反馈和过程知识。然而,从用户日志学习具有挑战性,因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为,且用户日志收集与模型优化之间的差异(例如,非策略优化问题)进一步加剧了这一问题。为此,我们提出UNO(用户日志驱动的优化),一个统一的框架,用于通过用户日志改进LLM系统(LLMsys)。UNO首先将日志提炼为半结构化的规则和偏好对,然后利用查询和反馈驱动的聚类来管理数据异质性,最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块,以处理从用户日志中提取的初级和反思性经验,从而提升未来的响应。广泛的实验表明,UNO在效果和效率上均达到最先进的水平,显著优于检索增强生成(RAG)和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

2602.06239 2026-05-18 cs.LG

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

可证明地避免在不了解数据分布的情况下直接偏好优化中的过度优化

Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher

AI总结 PEPO通过训练不同数据子集的偏好优化策略并聚合最坏情况,避免了过度优化问题,其样本复杂度仅依赖单策略集中性系数,保持了DPO的简洁性和实用性。

详情
AI中文摘要

我们引入PEPO(基于悲观的偏好优化),一种单步直接偏好优化(DPO)类算法,用于缓解偏好学习中已知的过度优化问题,而无需了解数据生成分布或学习显式奖励模型。PEPO通过训练不同数据子集的偏好优化策略的集合,并通过最坏情况构造聚合,以促进模型间的共识。在表格设置中,PEPO的样本复杂度仅依赖于单策略集中性系数,从而避免了影响易产生过度优化算法(如DPO)保证的所有策略集中性。理论发现通过令人信服的实践性能得到验证,同时保留了DPO风格训练的简洁性和实用性。

英文摘要

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

2602.04003 2026-05-18 cs.AI

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

当AI说服人:对抗性解释攻击对人类信任AI辅助决策的影响

Shutong Fan, Lan Zhang, Xiaoyong Yuan

AI总结 本文研究了对抗性解释攻击如何通过操控LLM生成的解释框架,影响人类对AI输出的信任,揭示了认知层的新型安全风险。

详情
AI中文摘要

大多数对抗性威胁针对AI模型的计算行为,而非依赖它们的人类。然而,现代AI系统越来越多地在人类决策循环中运行,用户根据模型推荐进行解释和行动。大型语言模型(LLMs)生成流畅的自然语言解释,影响用户对AI输出的认知和信任,揭示了AI与用户之间的沟通渠道这一新攻击面。我们引入对抗性解释攻击(AEAs),攻击者操控LLM生成的解释框架以调节人类对错误输出的信任。我们通过信任失调差距这一指标,正式化这一行为威胁,该指标捕捉了良性与对抗性解释之间人类信任的差异。通过这一指标,我们强调了说服性解释框架可能在AI预测错误时仍能保持用户信任的行为风险。为了表征这一威胁,我们进行了包含超过200名参与者的实验,系统地变化解释框架的四个维度:推理模式、证据类型、沟通风格和呈现格式。我们的发现显示,用户对对抗性和良性解释的信任几乎相同,对抗性解释尽管错误,却保留了大多数良性信任。最脆弱的情况出现在AEAs接近专家沟通时,结合权威证据、中性语气和领域合适的推理。脆弱性最高出现在困难任务、事实驱动领域以及受教育程度较低、年轻或高度信任AI的参与者中。

英文摘要

Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.

2602.01970 2026-05-18 cs.AI cs.LG

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

小规模可泛化提示预测模型可引导大推理模型的高效强化学习后训练

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

AI总结 本文提出GPS方法,通过轻量级生成模型进行提示难度的贝叶斯推断,结合中间难度优先和历史锚定多样性,提升大模型强化学习后的训练效率和测试效率。

详情
AI中文摘要

强化学习能增强大语言模型的推理能力,但通常因滚动优化而产生高计算成本。在线提示选择通过优先选择信息性提示来提高训练效率。然而,现有方法要么依赖昂贵的精确评估,要么构建缺乏跨提示泛化的提示特定预测模型。本研究引入可泛化的提示选择(GPS),通过轻量级生成模型对提示难度进行贝叶斯推断,利用共享优化历史训练。中间难度优先和历史锚定多样性被纳入批量获取原则中以选择信息性提示批次。小规模预测模型在测试时也具备泛化能力,以实现高效的计算分配。在各种推理基准上的实验表明,GPS在训练效率、最终性能和测试效率上显著优于更优的基线方法。

英文摘要

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

2602.01295 2026-05-18 cs.LG

Best-of-Both-Worlds for Heavy-Tailed Markov Decision Processes

兼顾最佳的双世界方法用于厚尾马尔可夫决策过程

Yu Chen, Yuhao Liu, Jiatai Huang, Yihan Du, Longbo Huang

AI总结 本文提出HT-FTRL-OM和HT-FTRL-UOB算法,针对厚尾马尔可夫决策过程,在对抗环境和自界环境实现最佳双世界保证,分别获得实例无关和对数实例依赖的 regret。

详情
AI中文摘要

我们研究了具有厚尾损失(HTMDPs)的回合制马尔可夫决策过程。现有方法在随机环境中过于保守,在对抗性环境中缺乏适应性。本文提出HT-FTRL-OM和HT-FTRL-UOB算法,实现了最佳双世界保证:在对抗环境中获得实例无关的regret,在自界(包括随机情况)环境中获得对数实例依赖的regret。在已知转移设置中,HT-FTRL-OM应用FTRL框架在占用措施上,使用新型跳过损失估计器,获得对抗性环境中的~O(T^{1/α}) regret和随机环境中的O(log T) regret。在此框架基础上,我们开发了新的算法HT-FTRL-UOB以处理更具挑战性的未知转移设置。在轻微截断非负条件下的损失分布下,该算法使用悲观跳过损失估计器,在对抗性环境中获得~O(T^{1/α} + sqrt(T)) regret,在随机性环境中获得O(log²(T)) regret。我们的分析通过几个技术洞察克服了关键障碍,包括对厚尾移位损失的局部控制机制、新的次优质量传播原则和新的regret分解,将转移不确定性与厚尾估计误差和跳过偏差分离。

英文摘要

We investigate episodic Markov Decision Processes with heavy-tailed losses (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{O}(T^{1/α})$ regret bound in adversarial regimes and a ${O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. Under a mild truncative nonnegativity condition on the loss distributions, this algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{O}(T^{1/α} + \sqrt{T})$ regret in adversarial regimes and a ${O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

2602.01167 2026-05-18 cs.AI

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

所有个体层都有帮助吗?视觉-语言模型中任务干扰层的实证研究

Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

AI总结 研究通过层干预发现部分层阻碍下游任务,提出任务自适应层剔除方法提升性能,揭示预训练VLM的意外模块化特性。

详情
AI中文摘要

当前VLM在多种多模态任务中表现出色,但默认启用所有层可能阻碍任务表现。通过干预单层参数发现,某些层反而抑制任务性能。系统研究各层对不同任务的影响,提出任务-层交互向量量化方法,并引入无需训练的测试时适应方法TaLo,动态剔除最干扰的层,提升模型在多个任务和数据集上的性能,包括提升Qwen-VL在ScienceQA地图任务上的准确率。

英文摘要

Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

2601.23068 2026-05-18 cs.LG cs.AI

ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

ExplainerPFN:迈向无模型零样本特征重要性估计的表格基础模型

Joao Fonseca, Julia Stoyanovich

AI总结 本文提出ExplainerPFN,一种基于TabPFN的表格基础模型,通过预训练合成结构因果数据实现无模型零样本特征重要性估计,展示了其在真实和合成数据集上的竞争力。

Comments 35 pages, 11 figures

详情
AI中文摘要

在监督分类任务中计算特征重要性对模型可解释性至关重要。Shapley值是解释模型预测的常用方法,但需要直接访问底层模型,这一假设在现实部署中常被违反。我们探讨在零样本设置下是否能仅通过输入数据分布和不评估目标模型来获得有意义的特征归因。由于多个模型可能产生相同预测但产生不同Shapley分解,数据到归因的映射并非唯一可识别。因此,我们针对“真实数据”而非“真实模型”学习后验均值归因,基于元训练先验。为此,我们引入ExplainerPFN,一种基于TabPFN的表格基础模型,预训练于合成结构因果数据,通过精确或近精确的Shapley值监督,可预测未见过的表格数据集的特征归因,而无需模型访问、梯度或示例解释。我们的贡献包括:(1)展示少量样本替代解释器在仅使用两个参考观测时可实现高SHAP保真度;(2)提出ExplainerPFN,首个无需访问底层模型或参考解释的零样本方法,提供无现有解释器可应用的归因;(3)发布开源实现,包括完整训练流程和合成数据生成器;(4)通过大量真实和合成数据集实验,展示ExplainerPFN在性能上可与依赖2-10个SHAP示例的少量样本替代解释器竞争。

英文摘要

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. We investigate whether meaningful feature attributions can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. Because multiple models can produce identical predictions yet yield different Shapley decompositions, the mapping from data to attributions is not uniquely identifiable. We therefore target attributions that are "true to the data" rather than "true to the model", learning a posterior mean attribution under a meta-training prior. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN, pretrained on synthetic structural causal datasets supervised with exact or near-exact Shapley values, that predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot surrogate explainers achieve high SHAP fidelity with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley-value-style feature attributions without access to the underlying model or reference explanations, providing a principled attribution where no existing explainer can be applied; (3) we release an open-source implementation including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples.

2601.21798 2026-05-18 cs.CV

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

CG-MLLM:通过多模态大语言模型实现图像描述与3D内容生成

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

AI总结 本文提出CG-MLLM,一种能实现3D描述和高分辨率3D生成的多模态大语言模型,通过混合Transformer架构分离不同建模需求,结合预训练视觉语言模型与专用3D VAE潜在空间,提升3D生成质量与感知能力。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)已革新了文本生成和多模态感知,但其在3D内容生成方面的能力仍待探索。现有方法往往只能生成低分辨率网格或粗略结构代理,无法原生捕捉细粒度几何结构。本文提出CG-MLLM,一种新型多模态大语言模型,能够在单一框架内实现3D描述和高分辨率3D生成。通过混合Transformer架构,CG-MLLM分离了不同的建模需求,其中Token-level Autoregressive (TokenAR) Transformer处理token级内容,Block-level Autoregressive (BlockAR) Transformer处理块级内容。通过整合预训练的视觉语言骨干网络与专用3D VAE潜在空间,CG-MLLM促进了标准token与空间块之间的长上下文交互。实验结果表明,CG-MLLM在生成高保真3D对象方面显著优于现有MLLMs,有效将高分辨率3D内容创作带入主流LLM范式。此外,我们进一步发现,学习生成3D内容能够反向增强模型的基于图像的3D理解能力。

英文摘要

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

2601.21636 2026-05-18 cs.LG cs.CR stat.ML

Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

无需采样矩阵机制下的随机分配隐私计费

Jan Schuchardt, Nikita Kalinin

AI总结 本文提出基于Rényi散度和条件组合的无采样界限,用于矩阵分解下随机分配的差分隐私放大,解决了采样方法的高概率保证和随机放弃问题,适用于任意带状和非带状矩阵。

详情
AI中文摘要

我们研究了在随机分配(也称为球入箱模型)下矩阵分解中差分隐私模型训练的隐私放大。Choquette-Choo等人(2025)提出了一种基于采样的蒙特卡洛方法来计算放大参数,但其保证要么仅在高概率下成立,要么需要机制的随机放弃。此外,确保(ε,δ)-DP所需的样本数与δ成反比。相反,我们开发了基于Rényi散度和条件组合的无采样界限。前者通过动态规划公式高效计算界限,后者通过提供更强的隐私保证来补充,特别是在小ε的情况下,Rényi散度界限本质上导致过估计。我们的框架适用于任意带状和非带状矩阵。通过数值比较,我们展示了我们的方法在广泛使用的矩阵机制中的有效性。

英文摘要

We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.