arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.08809 2026-06-05 cs.CV eess.IV

Training-Free Inference for High-Resolution Sinogram Completion

无需训练的高分辨率sinogram补全

Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren

发表机构 * William & Mary（威廉玛丽学院）； Argonne National Laboratory（阿贡国家实验室）； University of Chicago（芝加哥大学）

AI总结本文提出了一种无需训练的高效扩散推理方法HRSino，用于高分辨率sinogram补全，通过自适应分配推理努力来提高计算效率和补全精度。

详情

AI中文摘要

高分辨率sinogram补全对于计算断层扫描重建至关重要，因为缺失的投影可能会引入严重的伪影。尽管扩散模型为该任务提供了强大的生成先验，但其推理成本随着分辨率的增加而变得不可接受。我们提出HRSino，一种无需训练且高效的扩散推理方法，用于高分辨率sinogram补全。通过显式考虑信号特性中的空间异质性，如频谱稀疏性和局部复杂性，HRSino在空间区域和分辨率上自适应地分配推理努力，而不是应用统一的高分辨率扩散步骤。这使得在粗粒度上能够捕捉全局一致性，同时仅在必要时细化局部细节。实验结果表明，与最先进的框架相比，HRSino将峰值内存使用量减少了高达30.81%，推理时间减少了高达17.58%，并在不同数据集和分辨率上保持补全精度。

英文摘要

High-resolution sinogram completion is critical for computed tomography reconstruction, as missing projections can introduce severe artifacts. While diffusion models provide strong generative priors for this task, their inference cost grows prohibitively with resolution. We propose HRSino, a training-free and efficient diffusion inference approach for high-resolution sinogram completion. By explicitly accounting for spatial heterogeneity in signal characteristics, such as spectral sparsity and local complexity, HRSino allocates inference effort adaptively across spatial regions and resolutions, rather than applying uniform high-resolution diffusion steps. This enables global consistency to be captured at coarse scales while refining local details only where necessary. Experimental results show that HRSino reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to the state-of-the-art framework, and maintains completion accuracy across datasets and resolutions.

URL PDF HTML ☆

赞 0 踩 0

2605.20628 2026-06-05 cs.CL

Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

Divide-Prompt-Refine：一种无需训练的、结构感知的生物医学摘要生成框架

Sylvey Lin, Joe Menke, Shufan Ming, Dongin Nam, Neil Smalheiser, Halil Kilicoglu

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出DPR-BAG框架，旨在生成具有完整文本但无摘要的生物医学文章的连贯且事实准确的摘要。该框架通过分解全文文档为结构化的修辞要素，进行并行LLM摘要生成，并应用最终的精炼阶段恢复全局话语连贯性。

详情

Comments: Accepted by BioNLP 2026

AI中文摘要

生物医学摘要在下游NLP应用中起着关键作用，例如信息检索、生物本体标注和生物医学知识发现。然而，大量生物医学文章没有摘要，这降低了这些文章在下游任务中的实用性。我们提出了DPR-BAG（Divide, Prompt, and Refine for Biomedical Abstract Generation），一种无需训练的零样本框架，能够为具有完整文本但无摘要的生物医学文章生成连贯且事实准确的摘要。DPR-BAG按照背景-目的-方法-结果-结论（BOMRC）模式将全文文档分解为结构化的修辞要素，对每个要素进行并行LLM摘要生成，并应用最终的精炼阶段以恢复全局话语连贯性。在PMC-MAD数据集上，DPR-BAG在抽象新颖性上优于强大的提取式和微调基线，同时保持事实一致性。我们的消融研究揭示了一个反直觉的发现：增加提示复杂性或显式注入实体级指导可能会降低事实对齐，突显了受控提示策略的重要性。这些发现突显了无需训练、结构感知的框架在低资源环境下可扩展生物医学摘要生成的潜力。我们的数据和代码可在https://huggingface.co/datasets/pmc-mad/PMC-MAD和https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG上获得。

英文摘要

Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.

URL PDF HTML ☆

赞 0 踩 0

2605.20119 2026-06-05 cs.LG cs.AI

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Toto 2.0：时间序列预测进入规模化时代

Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, David Asker

发表机构 * Datadog AI Research（Datadog AI研究院）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出Toto 2.0模型家族，通过单一训练配方在400万到25亿参数范围内实现可靠的预测质量提升，并在三个基准测试中达到新状态。

2605.19839 2026-06-05 cs.CV

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

当偏好标签不足时：从真实数据对齐扩散模型

Weiyan Chen, Weijian Deng, Yao Xiao, Weijie Tu, ZiYi Dong, Ibrahim Radwan, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文研究了真实数据作为偏好对齐的替代监督源，通过数据驱动的方法，利用真实图像作为参考点，对比生成或扰动样本以构建偏好信号，无需手动标注的偏好对，实验证明真实数据监督能有效对齐扩散模型并达到与现有偏好方法相当的性能。

详情

Comments: ICML 2026 Camera Ready; Project Page: https://cwyxx.github.io/RealAlign

AI中文摘要

偏好对齐旨在通过学习优选样本与非优选样本的比较来引导生成模型。在实践中，大多数现有方法依赖于从模型生成图像中构造的偏好对。这种监督本质上是相对的，当两个样本都表现出伪影或视觉质量有限时，其模糊性使得难以推断何为真正理想的输出。在本工作中，我们探讨了真实数据是否可以作为偏好对齐的替代监督源。我们采用以数据为中心的视角，研究了一种整理策略，将真实图像作为参考点，并通过将其与生成或扰动样本进行对比，构建偏好信号，而无需手动标注的偏好对。通过实证分析，我们证明了基于真实数据的监督能有效指导扩散模型的对齐，并达到与现有基于偏好方法相当的性能。我们的结果表明，真实数据为偏好对齐提供了一个实用且互补的监督源，并突显了标签高效对齐策略的方向。代码和模型可在https://cwyxx.github.io/RealAlign获取。

英文摘要

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.

URL PDF HTML ☆

赞 0 踩 0

2510.00054 2026-06-05 cs.CV cs.AI

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: 通过分层解耦重新思考高分辨率MLLMs中的Zoom-IN方法

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出HiDe框架，通过分层解耦方法解决高分辨率图像中背景干扰导致的视觉理解问题，提升多模态大语言模型在高分辨率图像任务中的性能。

详情

Comments: Accepted by ICML2026

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，它们在高分辨率图像上的性能仍然不够理想。尽管现有方法通常将这一限制归因于感知约束，并认为MLLMs难以识别小物体，从而使用'缩放进'策略以获得更好的细节，我们的分析揭示了不同的原因：主要问题不是物体大小，而是由复杂的背景干扰引起的。我们通过一系列解耦实验系统分析了这种'缩放进'操作，并提出了一种无需训练的分层解耦框架（HiDe），该框架使用基于标记的注意力解耦（TAD）来解耦问题标记并识别关键信息标记，然后利用其注意力权重实现与目标视觉区域的精确对齐。随后，它利用布局保持解耦（LPD）将这些区域与背景解耦，并重建一个紧凑的表示，该表示在保留基本空间布局的同时消除了背景干扰。HiDe在V*Bench、HRBench4K和HRBench8K上设定了新的SOTA，将Qwen2.5-VL 7B和InternVL3 8B提升至SOTA（在V*Bench上分别为92.1%和91.6%），甚至超过了强化学习方法。经过优化后，HiDe的内存使用比之前的无训练方法减少了75%。代码可在https://tennine2077.github.io/HiDe.github.io/上提供。

英文摘要

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.19309 2026-06-05 cs.CL

How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

文档解析器如何失效？审计文档智能中的结构脆弱性

Yue Chen, Yihao Wang, Ziyi Tang, Yongsen Zheng, Keze Wang

发表机构 * Sun Yat-sen University（中山大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出ProSA框架，通过解耦控制探测、策略驱动目标和结构感知诊断，审计文档布局分析（DLA）管道中的结构脆弱性，发现块级结构损失率（B-SLR）比受影响面积更能反映OCR不稳定性，且结构探测导致更大的下游QA/检索退化。

详情

Comments: 18 pages, 5 figures, preprint

AI中文摘要

文档布局分析（DLA）管道为检索增强生成、长文档问答和其他文档智能系统提供结构化页面表示，但其鲁棒性评估仍然主要是以面积为中心的。我们识别出这种足迹偏差，并提出ProSA，一个轻量级的输出级审计框架，它解耦了受控探测、策略驱动目标和结构感知诊断。ProSA结合了块级结构损失率（B-SLR）、粒度感知暴露描述符和路径归因，以分析结构身份在何处丢失、在何种暴露粒度下出现故障以及故障如何传播。在MinerU和PP-StructureV3上对1000页进行实验，受影响面积与探测引起的OCR不稳定性相关性较弱（R^2=0.384/0.110），而B-SLR与之相关性更强（R^2=0.727/0.916）。暴露描述符进一步分离了遮挡主导和拓扑主导的路径，而匹配足迹的结构探测导致的下游QA/检索退化远大于面积匹配的擦除。这些结果将DLA鲁棒性评估从基于足迹的压力测试转向结构感知的脆弱性审计。

英文摘要

Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose ProSA, a lightweight output-level auditing framework that decouples controlled probing, policy-driven targeting, and structure-aware diagnosis. ProSA combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where structural identity is lost, at what exposure granularity failures emerge, and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, while matched-footprint structural probes cause much larger downstream QA/retrieval degradation compared to area-matched erasure. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

URL PDF HTML ☆

赞 0 踩 0

2405.05097 2026-06-05 cs.LG stat.ML

Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional propagation of values and densities

受生物学启发的联合分布神经元：基于层次相关性重建的多向传播神经元

Jarek Duda

发表机构 * arXiv.org

AI总结本文提出了一种受生物学启发的联合分布神经元，通过层次相关性重建实现多向值和密度传播，改进了传统人工神经元在学习、灵活性和鲁棒性方面的不足。

详情

Comments: 12 pages, 17 figures

AI中文摘要

最近，一百万生物神经元（BNN）在现代强化学习（RL）方法中表现出色，尤其在Pong游戏中表现优异，表明它们在学习、灵活性和鲁棒性方面仍具有显著优势，提示需要改进当前人工神经元（如MLP/KAN）以更好地与生物神经元一致。本文提出了一种扩展KAN方法的神经元，包含局部联合分布模型：ρ(x)=∑_{j∈B} a_j f_j(x)对于x∈[0,1]^d，增加了对KAN的解释和信息流控制，并允许逐步补充生物神经元的三个基本特性：1）生物轴突可以双向传播，而当前人工神经元仅单向传播，联合分布神经元可通过替换变量获得条件值/分布；2）动物表现出风险规避，需要处理方差，现实世界更需要概率模型，所提方法可预测和传播分布作为矩向量（期望值、方差等）；3）生物神经元需要局部训练，除了反向传播外，所提方法还允许其他训练方式，如直接训练、张量分解或最终的局部和有前景的信息瓶颈。所提方法非常通用，也可用于扩展softmax在transformer或JEPA嵌入中的应用，暗示特征是现实世界属性联合密度的混合矩。

英文摘要

Recently a million of biological neurons (BNN) has turned out better from modern RL methods in playing Pong~\cite{RL}, reminding they are still qualitatively superior e.g. in learning, flexibility and robustness - suggesting to try to improve current artificial e.g. MLP/KAN for better agreement with biological. There is proposed extension of KAN approach to neurons containing model of local joint distribution: $ρ(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$, adding interpretation and information flow control to KAN, and allowing to gradually add missing 3 basic properties of biological: 1) biological axons propagate in both directions~\cite{axon}, while current artificial are focused on unidirectional propagation - joint distribution neurons can repair by substituting some variables to get conditional values/distributions for the remaining. 2) Animals show risk avoidance~\cite{risk} requiring to process variance, and generally real world rather needs probabilistic models - the proposed can predict and propagate also distributions as vectors of moments: (expected value, variance) or higher. 3) biological neurons require local training, and beside backpropagation, the proposed allows many additional ways, like direct training, through tensor decomposition, or finally local and promising: information bottleneck. Proposed approach is very general, can be also used as extension of softmax in embeddings of e.g. transformer, JEPA, Mamba, suggesting interpretation that features are mixed moments of joint density of real-world properties.

URL PDF HTML ☆

赞 0 踩 0

2605.17249 2026-06-05 cs.RO

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

SEDualVLN：一种空间增强的双系统用于视觉语言导航

Jingzhi Huang, Junkai Huang, Wenxuan Song, Haoyang Yang, Hailong Huang, Haoang Li, Yi Wang

发表机构 * Hong Kong Polytechnic University（香港理工大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结本文提出SEDualVLN，一种空间增强的双系统框架，用于解决视觉语言导航中的长距离导航和动态推理问题，通过两个系统协同工作实现高效导航。

详情

AI中文摘要

视觉语言导航（VLN）方法目前主要遵循两种主要范式：一种是端到端视觉语言模型（VLM）策略，通过微调导航轨迹直接预测动作；另一种是零样本模块化流程，整合预训练的多模态大语言模型（MLLM）以实现无训练的泛化到未见环境。然而，端到端方法在长距离导航中表现不佳且缺乏动态推理能力，而零样本方法受限于有限的空间定位能力，且需要大量推理时间。为弥合这一差距，我们引入SEDualVLN，一种空间增强的双系统VLN框架。系统1是一个增强全球和局部空间意识的VLM模型，用于动作生成。系统2整合了一个通用MLLM和一个映射模块，其中MLLM通过利用实时3D地图的自上而下视图和渲染路径图像流来规划路径点。两个系统利用不同形式的空间增强来培养智能体在VLN任务中的方向感。最终，它们通过快慢协调的方法合作完成导航任务。SEDualVLN在VLN-CE基准上实现了最先进的性能，进一步的消融研究证明了每个系统和模块的有效性。

英文摘要

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.

URL PDF HTML ☆

赞 0 踩 0

2605.16716 2026-06-05 cs.CV cs.AI

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN：面向多元文化文本到视频生成的多智能体框架

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

发表机构 * Santa Clara University（圣克拉拉大学）

AI总结提出MAVEN多智能体提示优化框架，通过并行或串行分解提示为人物、动作、地点维度，提升单文化和跨文化文本到视频生成的文化保真度，并构建包含243个文化提示和972个视频的基准进行评估。

详情

Comments: [14] pages, [6] figures, [11] tables, appendix included. Preprint

AI中文摘要

文本到视频（T2V）生成在视觉保真度方面取得了快速进展，但其在单个提示中忠实呈现多种文化的能力仍未被充分探索。我们提出MAVEN，一个多智能体提示优化框架，旨在提高单文化和跨文化T2V生成中的文化保真度。MAVEN将提示分解为人物、动作和地点维度，由并行或串行运行的专业智能体处理。为了支持系统评估，我们贡献了一个新的基准，包含243个基于文化的提示和972个对应视频，涵盖三种文化（中文、美式、罗马尼亚）、三种动作类别以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估表明，多智能体优化，特别是并行专业化，在保持视觉质量和时间一致性的同时，显著提高了文化相关性。数据集和代码可在https://github.com/AIM-SCU/MAVEN获取。

英文摘要

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available at https://github.com/AIM-SCU/MAVEN

URL PDF HTML ☆

赞 0 踩 0

2604.00555 2026-06-05 cs.AI cs.CL cs.SE

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

企业智能体系统中的本体约束神经推理：一种面向领域 grounded AI 智能体的神经符号架构

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University, San Francisco Foundation（金门大学，旧金山基金会）； AgenticOS (FAOS)（AgenticOS（FAOS））； Associate Director, Data, Digital & IT Novartis Healthcare Pvt. Ltd.（数据、数字与IT部门，诺华健康有限公司）； Novartis Healthcare Pvt. Ltd., Hyderabad, India（诺华健康有限公司，海得拉巴，印度）

AI总结本文提出了一种神经符号架构，通过本体约束神经推理解决企业大语言模型在幻觉、领域漂移和无法在推理层面强制执行监管合规性方面的限制，展示了该架构在提升智能体的指标准确性和角色一致性方面的显著效果。

详情

Comments: 24 pages, 6 tables, 6 figures, 1 algorithm, 65 references. Replication study: 1,800 runs (600 per model) across 5 regulated industries (3 English, 2 Vietnamese) and 3 LLMs (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B). v3 changes: deep-review trim from 34pp. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-3

AI中文摘要

企业采用大语言模型（LLMs）受到幻觉、领域漂移和无法在推理层面强制执行监管合规性的限制。我们提出了一种在基础智能体操作系统（FAOS）平台中实现的神经符号架构，通过本体约束神经推理解决这些限制。我们引入了一个三层本体框架——角色、领域和交互本体——以地面化基于LLM的企业智能体。我们正式化了不对称的神经符号耦合：当前企业系统约束智能体输入（上下文组装、工具发现、治理阈值），但不约束输出，我们提出机制扩展这种耦合到输出侧验证（响应检查、推理验证、合规性强制）。一个受控实验（1,800次运行，覆盖五个行业和三个LLM：Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B）发现本体耦合的智能体在所有三个模型中在指标准确性和角色一致性上显著优于无地面化智能体（p < .001），具有较大的效应量（Kendall's W = .46-.64）。改进最大出现在LLM参数化知识最弱的地方——特别是越南本地化领域，其中本体提升是英语领域的2倍。贡献：（1）一个正式的三层企业本体模型；（2）神经符号耦合模式的分类学；（3）通过SQL推导评分进行本体约束的工具发现；（4）提出的一种用于输出侧本体验证的框架；（5）关于参数化知识效应的实证证据——本体地面化价值与LLM训练数据覆盖领域成反比；（6）跨模型复制，确立模型独立性；（7）一个服务于22个行业垂直领域的生产系统，拥有650多个智能体。

英文摘要

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. We introduce a three-layer ontological framework--Role, Domain, and Interaction ontologies--grounding LLM-based enterprise agents. We formalize asymmetric neurosymbolic coupling: current enterprise systems constrain agent inputs (context assembly, tool discovery, governance thresholds) but not outputs, and we propose mechanisms extending this coupling to output-side validation (response checking, reasoning verification, compliance enforcement). A controlled experiment (1,800 runs across five industries and three LLMs: Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B) finds ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001) and Role Consistency (p < .001) across all three models with large effect sizes (Kendall's W = .46-.64). Improvements are greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains, where ontology lift is 2x that of English domains. Contributions: (1) a formal three-layer enterprise ontology model; (2) a taxonomy of neurosymbolic coupling patterns; (3) ontology-constrained tool discovery via SQL-pushdown scoring; (4) a proposed framework for output-side ontological validation; (5) empirical evidence for the inverse parametric knowledge effect--ontological grounding value is inversely proportional to LLM training-data coverage of the domain; (6) cross-model replication establishing model-independence; (7) a production system serving 22 industry verticals with 650+ agents.

URL PDF HTML ☆

赞 0 踩 0

2511.20577 2026-06-05 cs.LG

MSTN: A Lightweight and Fast Model for General TimeSeries Analysis

MSTN: 一种轻量且快速的通用时间序列分析模型

Sumit S Shevtekar, Chandresh K Maurya

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Indian Institute of Technology Indore（印度理工学院印度尔）

AI总结本文提出了一种轻量且快速的MSTN模型，通过多尺度时间网络架构，结合早期内部聚合原理，有效处理时间序列中的非平稳性、非线性动态和多时间尺度行为，实现了在多个时间尺度上的灵活建模，同时保持了模型的轻量级和低延迟特性。

详情

Comments: 30 pages, published in Transactions on Machine Learning Research (TMLR)

AI中文摘要

现实世界的时间序列往往表现出强烈的非平稳性、复杂的非线性动态以及在多个时间尺度上的行为，从快速的局部波动到缓慢演变的长期趋势。然而，许多现代架构施加了刚性的、固定尺度的结构先验，如基于补丁的标记化、预定义的感受野或冻结的主干编码器，这可能会过度正则化时间动态并限制对突发高幅事件的适应性。为此，我们引入了多尺度时间网络（MSTN），一种基于早期内部聚合原理的混合神经架构。MSTN集成了三个互补的组件：（i）多尺度卷积编码器，用于捕捉细粒度的局部结构；（ii）序列建模模块，通过递归或注意力机制学习长距离依赖；（iii）自我门控融合阶段，结合挤压激活和单个密集层，动态重新加权和融合多尺度表示。ETA确保下游模块在O(1)时间内运行，而编码器保留O(L^2)（Transformer）或O(L)（BiLSTM）。这种设计使MSTN能够灵活地建模从毫秒到长期的时间模式，同时避免了通常与长上下文模型相关的计算负担。在广泛的基准测试中，包括填补缺失值、长期预测、分类和跨数据集泛化，MSTN实现了最先进的性能，在27个数据集中的21个上建立了新的最佳结果，同时保持轻量级（MSTN-BiLSTM约为0.40M参数，MSTN-Transformer约为1.06M参数）并适合低延迟推理（<1秒，通常在毫秒级），资源受限的部署。

英文摘要

Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behavior expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders - which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the Multi-scale Temporal Network (MSTN), a hybrid neural architecture grounded in an Early Temporal Aggregation principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze-excitation and a single dense layer to dynamically reweight and fuse multi-scale representations. ETA ensures downstream modules operate in O(1) time, while the encoder retains O(L^2) (Transformer) or O(L) (BiLSTM). This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering imputation, long-term forecasting, classification, and cross-dataset generalization, MSTN achieves state-of-the-art performance, establishing new best results on 21 of 27 datasets, while remaining lightweight (~0.40M params for MSTN-BiLSTM and ~1.06M for MSTN-Transformer) and suitable for low-latency inference (<1 sec, often in milliseconds), resource-constrained deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.16138 2026-06-05 cs.LG cs.AI hep-ex

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

代理神经架构协同设计包（SNAC-Pack）

Jason Weitz, Dmitri Demler, Benjamin Hawks, Aaron Wang, Nhan Tran, Javier Duarte

发表机构 * University of California San Diego（加州大学圣地亚哥分校）； ETH Zurich（苏黎世联邦理工学院）； Fermi National Accelerator Laboratory（费米国家加速器实验室）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结本文提出SNAC-Pack，一种面向硬件的自动化机器学习框架，用于神经架构协同设计和端到端FPGA部署，通过多目标全局搜索和硬件代理模型减少合成成本，并结合量化感知训练和迭代幅度剪枝来压缩模型，最终在FPGA上实现高效部署。

详情

Comments: 15 Pages, 3 Figures, AutoML (International Conference on Automated Machine Learning) 2026

AI中文摘要

神经架构搜索（NAS）是一种强大的自动模型设计方法，但现有方法往往只优化准确率或依赖如位操作（BOPs）等代理指标，这些指标与硬件成本的相关性较差。在FPGA部署中，成本由查找表、DSP、触发器、BRAM和延迟等多维预算主导。我们提出了代理神经架构协同设计包（SNAC-Pack），一种开源的AutoML框架，用于硬件感知的神经架构协同设计和端到端FPGA部署。SNAC-Pack使用Optuna和NSGA-II进行多目标全局搜索，将试验加载到共享的SQLite存储中，以实现计算节点之间的并行工作。硬件代理模型输出每个试验的资源和延迟估计，避免了否则会主导搜索循环的合成成本。随后的局部搜索阶段结合量化感知训练（QAT）和迭代幅度剪枝，在联合压缩循环中应用。最后，通过hls4ml Python库将最终模型合成到FPGA固件中。YAML配置和可选的代理前端使用户能够在新数据集上运行管道而无需修改框架。我们在大型强子对撞机的喷射分类和超导量子比特读出中展示了SNAC-Pack，发现了紧凑的架构，这些架构在任务指标上匹配或超过强基线，同时减少FPGA资源利用，并在量子比特读出情况下将设计空间探索过程从数月的手动微调减少到数小时的自动化搜索。

英文摘要

Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture Codesign Package (SNAC-Pack), an open-source AutoML framework for hardware-aware neural architecture codesign and end-to-end FPGA deployment. SNAC-Pack runs a multi-objective global search with Optuna and NSGA-II, loading trials to a shared SQLite store that enables parallel workers across compute nodes. A hardware surrogate model outputs per-trial resource and latency estimates, avoiding the synthesis cost that would otherwise dominate the search loop. A local search stage then applies quantization-aware training (QAT) together with iterative magnitude pruning in a combined compression loop, after which the final model is synthesized to FPGA firmware via the hls4ml Python library. A YAML configuration and an optional agentic frontend let users run the pipeline on new datasets without modifying the framework. We demonstrate SNAC-Pack on jet classification at the Large Hadron Collider and superconducting qubit readout, discovering compact architectures that match or exceed strong baselines on the task metric while reducing FPGA resource utilization and, in the qubit readout case, reducing the design space exploration process from months of manual fine-tuning to hours of automated search.

URL PDF HTML ☆

赞 0 踩 0

2509.10825 2026-06-05 cs.LG cs.AI stat.ML

CUBE: Contrastive Understanding by Balanced Experiments

CUBE: 通过平衡实验实现对比理解

Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

发表机构 * Department of Computer Engineering（计算机工程系）； Gachon University（加荣大学）

AI总结本文提出CUBE框架，通过平衡低-高探针解释已训练的预测模型，揭示模型的主要效应和交互作用，验证了其在合成和现实表格任务中的有效性。

2605.15454 2026-06-05 cs.CL cs.LG stat.ML

Reasoning Models Don't Just Think Longer, They Move Differently

推理模型不只思考更久，它们的移动方式不同

Anders Gjølbye, Lars Kai Hansen, Sanmi Koyejo

发表机构 * Technical University of Denmark（丹麦技术大学）； Stanford University（斯坦福大学）

AI总结本文研究了推理训练模型在生成链式思维时的轨迹差异，发现通过长度校正后，不同领域中难度与轨迹几何的耦合关系存在显著差异，尤其是在代码领域中，推理训练模型表现出更直接的轨迹和更一致的局部曲率。

详情

Comments: Preprint

AI中文摘要

经过训练的推理语言模型通常在更难的问题上消耗更多标记，但更长的思维链并不表明模型只是计算更多步骤或遵循不同的内部轨迹。我们通过在编程、数学和布尔可满足性问题中研究链式思维生成过程中的隐藏状态轨迹来区分这一区别。原始轨迹几何强烈受到生成长度的影响：更长的生成会机械地改变路径统计，因此在没有调整的情况下，基于难度的比较是误导的。在残差化轨迹统计后，难度在所有研究的领域中系统地与修正后的轨迹几何相关联。在代码领域中，最清晰的推理特定分离出现在更难的问题中，推理训练模型显示出更直接的修正轨迹和更一致的局部曲率，而与匹配的指令训练基线相比，这种差异更小。在数学和布尔可满足性问题中，修正后的难度-几何耦合较弱，但仍存在。提示阶段的线性探测不反映代码领域的分离，行为注释显示更强的修正耦合与策略转变和不确定性监控同时出现。这些发现确立了长度校正作为生成时间轨迹分析的先决条件，并表明推理训练可以与不同的修正轨迹几何相关联，这种效果的强度取决于领域。

英文摘要

Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.

URL PDF HTML ☆

赞 0 踩 0

2605.13075 2026-06-05 cs.CL cs.AI

Scaling few-shot spoken word classification with generative meta-continual learning

通过生成性元持续学习扩大少样本语音词分类

Louise Beyers, Batsirayi Mupamhi Ziki, Ruan van der Merwe

发表机构 * University of Cape Town（开普敦大学）

AI总结本文研究了在仅获得每个类别五个样本的情况下，通过生成性元持续学习（GeMCL）算法对1000个类别进行少样本语音词分类的潜力，并展示了其在性能稳定性及适应速度上的优势。

详情

AI中文摘要

少样本语音词分类大多针对少量类别进行开发，因此更大规模的少样本语音词分类潜力尚未被挖掘。本文探讨了在仅获得每个类别五个样本的情况下，通过生成性元持续学习（GeMCL）算法训练的语音词分类器能否依次学习区分1000个类别。我们通过使用GeMCL算法训练模型并与重复训练或微调的基线模型进行比较，证明了这种扩展能力的存在。我们发现GeMCL产生了极高的性能稳定性，尽管它并不总能超越重复全微调的HuBERT模型或冻结HuBERT模型配以重复训练的分类器头，但其性能与后者相当，同时适应速度提高了2000倍，仅用不到一半的数据量，在两个数量级更少的时间内进行训练。

英文摘要

Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

URL PDF HTML ☆

赞 0 踩 0

2605.14028 2026-06-05 cs.CV

Unified Pix Token And Word Token Generative Language Model

统一像素标记与词标记的生成语言模型

Haun Leung, ZiNan Wang

发表机构 * Buaa.edu.cn（北京航空航天大学）

AI总结本文提出一种统一像素标记和词标记的生成语言模型，通过引入图像无监督预训练、颜色折叠、全局条件注意力近似等方法，提升模型在图像细节识别上的能力，实验表明该模型在小模型和有限数据下仍表现优异。

详情

Comments: 13 pages, 6 figures

AI中文摘要

自从视觉Transformer（ViT）出现以来，它已被广泛应用于生成语言模型和生成视觉模型中。尤其是在当前最先进的开源多模态模型中，通过CLIP或SigLIP方法获得的ViT被用作视觉编码器的骨干网络，帮助它们获得视觉理解能力。但这种方法在细节视觉理解上存在局限，例如在图像中难以识别小文本或数字。为了解决这些问题，我们提出了一种新的模型，将像素标记和词标记统一到生成语言模型中。该新模型还具有每个图像像素都有其自己的标记嵌入、颜色折叠、全局条件注意力近似和图像无监督预训练等特性。我们使用我们的新模型进行了图像无监督预训练实验，以探索其潜力。实验结果表明，即使在小模型和有限训练数据下，其性能也很好。我们相信我们的模型也符合扩展定律，只要模型参数和训练数据增加，其性能将继续提高。

英文摘要

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

URL PDF HTML ☆

赞 0 踩 0

2604.20329 2026-06-05 cs.CV cs.AI

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut

发表机构 * Google（谷歌）

AI总结本文研究了图像生成器在视觉理解中的通用学习能力，通过引入Vision Banana模型，展示了图像生成训练如何像语言模型预训练一样，使模型在多种视觉任务中取得最佳性能，证明了图像生成预训练在构建基础视觉模型中的核心作用。

详情

Comments: Project Page: http://vision-banana.github.io

AI中文摘要

近期的研究表明，图像和视频生成器表现出零样本视觉理解行为，这种行为类似于大型语言模型（LLM）通过生成式预训练发展出语言理解和推理的新兴能力。尽管长期以来人们推测能够生成视觉内容意味着能够理解它，但缺乏证据表明生成式视觉模型已发展出强大的理解能力。在本文中，我们证明图像生成训练的作用类似于LLM预训练，使模型学习到强大的、通用的视觉表示，从而在各种视觉任务中取得最先进的性能。我们引入了Vision Banana，一个通过指令微调Nano Banana Pro（NBP）在原始训练数据和少量视觉任务数据混合中构建的通用模型。通过将视觉任务的输出空间参数化为RGB图像，我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的多种视觉任务中取得了最先进的结果，超越或匹敌零样本领域专家，包括Segment Anything Model 3在分割任务中的表现，以及Depth Anything系列在度量深度估计中的表现。我们展示了这些结果可以通过轻量级指令微调实现，而不牺牲基础模型的图像生成能力。优越的结果表明图像生成预训练是一种通用视觉学习者。它还表明图像生成是视觉任务的统一和通用接口，类似于文本生成在语言理解和推理中的作用。我们正见证计算机视觉中的重大范式转变，其中生成式视觉预训练在构建生成和理解的基础视觉模型中发挥核心作用。

英文摘要

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.13830 2026-06-05 cs.AI cs.LG

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

对树集成模型的敏感性量化：一种符号和组合方法

Ajinkya Naik, Chaitanya Garg, S. Akshay, Ashutosh Gupta, Kuldeep S. Meel

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔分校）； University of Toronto（多伦多大学）

AI总结本文提出了一种针对树集成模型的敏感性量化方法，通过离散化输入空间并枚举易受敏感性影响的区域，结合代数决策图（ADD）编码和分拆子问题，实现高效计算。实验表明，所提工具XCount在速度和可扩展性方面优于其他方法。

详情

AI中文摘要

决策树集成（DTE）是一种广泛应用于AI分类任务的流行模型，用于多个安全关键领域，因此对这些模型的验证已成为过去十年的研究热点之一。其中一个问题就是敏感性问题，它询问给定一个DTE，是否一小部分特征的变化会导致输入的误分类。在本工作中，我们的目标是构建一个针对DTEs的定量敏感性概念，通过离散化模型的输入空间并枚举易受敏感性影响的区域。我们提出了一种新的算法技术，可以在保证认证误差和置信度范围内高效地完成此计算。我们的方法基于将问题编码为代数决策图（ADD），并进一步将其拆分为可高效解决的子问题，使计算成为组合和可扩展的。我们在不同规模的基准上评估了我们的技术的性能，与相同问题编码下的模型计数器进行比较。实验结果表明，我们的工具XCount在速度上显著优于其他方法，并且在集成规模增加时表现良好。

英文摘要

Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

URL PDF HTML ☆

赞 0 踩 0

2605.05367 2026-06-05 cs.CV cs.AI

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D: 从单目视频高保真重建沙特手语3D虚拟形象

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

发表机构 * University of Jeddah（朱德大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学科学与技术）

AI总结本文提出Tamaththul3D方法，通过几何逆运动学对前臂链进行对齐，结合2D监督肩部优化，实现了阿拉伯语手语的高保真3D虚拟形象重建，并在五个不同语言类型的手语数据集上实现了泛化能力。

详情

AI中文摘要

现有的3D手语虚拟形象重建方法仅在西方手语上开发和评估，且没有任何阿拉伯手语数据集的3D参数注解，这阻碍了阿拉伯聋人社区基于虚拟形象的无障碍应用发展。我们发布了首个SMPL-X参数注解的Ishara-500沙特手语数据集，使阿拉伯手语的定量评估和下游手语生成成为可能。我们引入Tamaththul3D，一种通过几何逆运动学对齐手部和身体估计，随后通过2D监督肩部优化的重建流程。闭式积分与特定身体和手估计器的选择无关：任何SMPL-X兼容的身体估计器和任何MANO兼容的手估计器均可替换，我们通过单独替换每个模块来证明这一点。Tamaththul3D在手部误差上比先前方法低达32%，运行速度比最强基线快32倍，并在没有数据集特定适应的情况下泛化到五个不同语言类型的手语数据集。

英文摘要

Existing 3D sign language avatar reconstruction methods are developed and evaluated exclusively on Western sign languages, and no 3D parametric annotations exist for any Arabic Sign Language dataset, a gap that blocks the development of avatar-based accessibility applications for the Arab Deaf community. We release the first SMPL-X parametric annotations for the Ishara-500 Saudi Sign Language dataset, enabling quantitative evaluation and downstream sign language generation for Arabic Sign Language. We introduce Tamaththul3D, a reconstruction pipeline that aligns hand and body estimates through geometric inverse kinematics on the forearm chain followed by 2D-supervised shoulder refinement. The closed-form integration is decoupled from the specific choice of body and hand estimators: any SMPL-X-compatible body estimator and any MANO-compatible hand estimator can be substituted, as we demonstrate by swapping each module independently. Tamaththul3D achieves up to 32% lower hand error than prior methods, runs 32x faster than the strongest baseline, and generalizes across five typologically distinct sign languages without dataset-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.12376 2026-06-05 cs.AI

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

ProfiliTable: 通过代理工作流实现基于轮廓的表格数据处理

Wei Liu, Yang Gu, Xi Yan, Zihan Nan, Beicheng Xu, Keyao Ding, Bin Cui, Wentao Zhang

发表机构 * School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University（计算机科学学院及高可信软件技术重点实验室（教育部），北京大学）； Academy for Advanced Interdisciplinary Studies, Peking University（先进跨学科研究学院，北京大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； School of Software & Microelectronics, Peking University（软件与微电子学院，北京大学）； School of CS & Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University（计算机科学学院及北京软件与硬件协同人工智能系统重点实验室，北京大学）； Center for Machine Learning Research, Peking University（机器学习研究中心，北京大学）

AI总结本研究提出ProfiliTable，一种基于动态轮廓的代理工作流框架，用于解决表格数据处理中的模糊指令、复杂任务结构和缺乏结构化反馈问题，通过交互探索、知识增强合成和反馈驱动细化，实现闭环优化，实验表明其在复杂多步骤场景中优于现有基线方法。

详情

AI中文摘要

表格处理（包括清洗、转换、增强和匹配）是现实数据管道中的基础但易出错的阶段。尽管最近基于LLM的方法在自动化此类任务方面显示出潜力，但实践中常常因指令模糊、任务结构复杂和缺乏结构化反馈而遇到困难，导致生成的代码在语法上正确但语义上错误。为了解决这些挑战，我们提出了ProfiliTable，一种以动态轮廓为核心的自主多代理框架，通过交互探索、知识增强合成和反馈驱动细化，构建并迭代完善统一的执行上下文。ProfiliTable集成了（i）一个执行ReAct风格数据探索的Profiler，用于构建语义理解；（ii）一个检索精选操作符的Generator，用于合成任务感知的代码；以及（iii）一个Evaluator-Summarizer循环，通过注入执行评分和诊断洞察实现闭环优化。在覆盖18种表格任务类型的多样化基准测试中，ProfiliTable在复杂多步骤场景中 consistently 超过强基线方法。这些结果突显了动态轮廓在可靠地将模糊用户意图转化为稳健且合规的表格转换中的关键作用。

英文摘要

Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.

URL PDF HTML ☆

赞 0 踩 0

2504.10063 2026-06-05 cs.CL cs.AI math.AT

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

基于注意力图拓扑分歧的LLM幻觉检测

Alexandra Bazarova, Andrei Volodichev, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev

发表机构 * Applied AI Institute（应用人工智能研究所）； SB AI Lab（SB人工智能实验室）； HSE University（俄罗斯高等经济学院）； CNRS, Universite Paris Cite（法国国家科学研究中心，巴黎Cité大学）

AI总结本文提出TOHA方法，通过分析注意力矩阵的拓扑结构来检测LLM中的幻觉现象，实验表明该方法在多个基准测试中表现优异，且对标注数据和计算资源需求较低。

详情

Comments: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

AI中文摘要

幻觉，即生成事实性错误内容，仍然是大型语言模型（LLMs）面临的关键挑战。我们介绍了TOHA，一种基于拓扑的幻觉检测器，在RAG设置中，该方法利用拓扑分歧度度量来量化由注意力矩阵诱导的图的结构特性。检查提示与响应子图之间的拓扑分歧揭示了一致的模式：特定注意力头中较高的分歧值与幻觉输出相关，且与数据集无关。广泛的实验，包括问题回答和摘要任务的评估，表明我们的方法在多个基准测试中实现了最先进的或具有竞争力的结果，同时需要最少的标注数据和计算资源。我们的发现表明，分析注意力矩阵的拓扑结构可以作为LLMs事实可靠性的一种高效且稳健的指标。

英文摘要

Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.09989 2026-06-05 cs.RO cs.CV

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

StereoPolicy：通过立体视觉改进机器人操作策略

Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

发表机构 * Stanford University（斯坦福大学）； Northwestern University（西北大学）； Lambda, Inc（Lambda公司）

AI总结该研究提出StereoPolicy，一种利用立体视觉提升机器人操作策略的框架，通过同步立体图像对增强几何推理，无需构建显式3D表示，在多个仿真和真实机器人任务中优于RGB、RGB-D、点云等基线方法。

详情

AI中文摘要

最近的机器人模仿学习进展产生了能够从视觉输入中操控多样化物体的强大视觉-运动策略。然而，单目观测缺乏深度信息，这对于在杂乱或几何复杂的场景中进行精确操作至关重要。显式的深度图和点云在现实世界操作中往往噪声大且易碎。我们引入了StereoPolicy，一种视觉-运动策略学习框架，直接利用同步的立体图像对来改进几何推理，而无需构建显式的3D表示。StereoPolicy通过预训练的2D视觉编码器处理每张图像，并通过基于交叉注意力的Stereo Transformer融合左右特征，隐式地捕捉空间对应关系和视差线索。该框架与基于扩散和预训练的视觉-语言-动作（VLA）策略集成，在三个仿真基准和七个真实机器人桌面和双臂移动操作任务中，相比RGB、RGB-D、点云和多视角基线方法均实现了持续改进。我们的结果表明，立体视觉能够将预训练的2D表示与3D几何理解联系起来，以提升机器人操作性能。

英文摘要

Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.09192 2026-06-05 cs.AI

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

证据与计划：用于技能蒸馏的在线轨迹验证

Yang Zhou, Zihan Dong, Zhenting Wang, Can Jin, Shiyu Zhao, Bangwei Guo, Difei Gu, Linjun Zhang, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University（罗杰斯大学）

AI总结本文提出了一种基于轨迹的后验蒸馏指数（PDI）来评估技能与任务环境证据的契合度，通过SPARK框架实现环境验证轨迹的生成，从而提升技能的效率和可迁移性。

详情

AI中文摘要

代理技能可以通过使用人类编写的程序性文档显著提高任务成功率，但其质量在没有环境基础验证的情况下难以评估。现有的技能生成方法严重依赖于偏好日志而不是直接的环境交互，通常产生微不足道甚至退化的收益。我们发现这是一个根本的时间瓶颈：稳健的技能应基于后验，从经验环境交互中蒸馏，而不是先验计划。在本研究中，我们引入了后验蒸馏指数（PDI），这是一个轨迹级指标，量化了蒸馏技能与任务-环境证据的契合程度。为了操作化PDI，我们提出了SPARK（用于自主可运行任务和技能生成的结构化流程），以保留任务执行证据以实现全面的轨迹级分析。SPARK生成用于计算PDI的环境验证轨迹，并将其用作在线诊断和干预信号，以确保后验技能的形成。在86个可运行任务上，SPARK生成的技能始终优于无技能基线，并在学生模型上优于人工编写技能（推理成本比教师模型低高达1000倍）。这些发现表明，PDI引导的蒸馏产生了高效且可迁移的技能，这些技能基于任务-环境交互。我们发布代码在https://github.com/EtaYang10th/spark-skills。

英文摘要

Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at https://github.com/EtaYang10th/spark-skills .

URL PDF HTML ☆

赞 0 踩 0

2605.08318 2026-06-05 cs.LG cs.AI cs.NA math.NA physics.comp-ph stat.ML

When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

当注意力胜过傅里叶：用于不规则域上的PDE求解的多尺度变换器

Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal

发表机构 * Physics Lab, Yee Collins Research Group（Yee Collins研究组物理实验室）

AI总结本文研究了深度学习模型在求解偏微分方程（PDE）时的架构选择问题，探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。引入了多尺度注意力变换器（MSAT），该架构将时空解的历史编码为令牌序列，并通过复合监督目标进行端到端训练。在五个基准问题上，与九种基线方法（包括物理信息神经网络、神经算子和状态空间模型）进行了全面的实证评估，展示了在复杂几何问题上的最佳泛化能力。

详情

Comments: Substantial Revision Required

AI中文摘要

我们研究了深度学习模型在求解偏微分方程（PDE）时的架构选择问题，探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。我们介绍了多尺度注意力变换器（MSAT），一种深度学习架构，将时空解的历史编码为令牌序列，并通过复合监督目标进行端到端训练。我们对九种基线方法（包括物理信息神经网络、神经算子和状态空间模型）进行了全面的实证评估，覆盖了PINNacle套件中的五个基准问题，使用相同的训练/测试分割和参考数据。MSAT在复杂几何问题上实现了最先进的泛化能力（Heat2D-CG的L²相对误差为0.0101，比FNO提高了3.7倍），在34秒的总推理时间下，比Mamba-NO的120,812秒快得多。对物理正则化组件的消融研究揭示了精确的归纳偏置权衡：物理先验减少了扩散主导问题的测试误差，但会退化混沌和回流流动制度的泛化能力，直接刻画了先验规格错误的边界。近似误差界作为域边界复杂性κ的函数，为这些实证发现提供了理论基础，并为架构选择提供了一个原则性的规则。

英文摘要

We study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $κ$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

URL PDF HTML ☆

赞 0 踩 0

2605.08253 2026-06-05 cs.LG cs.AI

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

路径耦合贝尔曼流用于分布式强化学习

Boyang Xu, Qing Zou, Siqin Yang, Hao Yan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出路径耦合贝尔曼流（PCBF），一种连续时间的分布式强化学习方法，通过学习回报分布的流匹配来解决现有方法在边界不匹配和高方差-bootstrap问题，实验表明其在分布保真度和训练稳定性方面有所提升。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

分布式强化学习（DRL）模型完整回报分布，但现有有限支持或分位数方法依赖于投影，而近期基于流的方法在流源处可能遭受边界不匹配，或在当前和后续噪声独立时出现高方差的bootstrap问题。本文提出路径耦合贝尔曼流（PCBF），一种连续时间DRL方法，通过学习回报分布的流匹配使用源一致的贝尔曼耦合路径：当前路径从t=0所需的基先验开始，到达t=1的贝尔曼目标，并在中间时间保持路径上的线性关系到后续流（不需要时间t的边际满足分布贝尔曼固定点对所有t）。PCBF通过共享基噪声耦合当前和后续回报流，并使用λ参数化的控制变异目标：λ=0恢复无偏样本贝尔曼目标，而λ>0通过可控的偏倚换取方差减少。在可解析的MRPs、OGBench和D4RL上的实验表明，PCBF在分布保真度和训练稳定性方面有所提升，并在离线RL性能上具有竞争力。

英文摘要

Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $λ$-parameterized control-variate target: $λ{=}0$ recovers an unbiased sample Bellman target, while $λ{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

URL PDF HTML ☆

赞 0 踩 0

2605.08215 2026-06-05 cs.CV cs.LG cs.RO

Test-Time Training for Visual Foresight Vision-Language-Action Models

测试时训练用于视觉前瞻视觉-语言-动作模型

Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出了一种测试时训练方法，用于增强视觉前瞻视觉-语言-动作模型在面对分布外数据时的鲁棒性，通过引入适应性更新过滤机制来减少测试时更新带来的实际挑战。

详情

Comments: Accepted at ICML 2026 Workshop on Continual Adaptation at Scale (CATS)

AI中文摘要

Visual Foresight VLA (VF-VLA) 已成为最近 VLA 中的重要架构选择，因其出色的性能。然而，VF-VLA 的固有设计使其特别容易受到分布外（OOD）偏移的影响。由于动作的质量直接取决于预测未来视觉信息的准确性，OOD 条件会影响两个阶段。为了解决这一脆弱性，我们提出了测试时训练视觉前瞻 VLA（$T^3$VF），这是一种受观察启发的测试时训练方法，即预测的未来图像及其后续观察形成自然的监督对。为了进一步解决由于随意测试时更新而产生的实际挑战，我们引入了自适应更新过滤机制。经验上，$T^3$VF 在不改变任何架构或辅助模块的情况下，以适度的额外推理成本缓解了 VF-VLA 的 OOD 脆弱性。

英文摘要

Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.

URL PDF HTML ☆

赞 0 踩 0

2605.07482 2026-06-05 cs.LG cs.AI

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

SHRED: 通过自蒸馏与对数势降低实现无保留集的去记忆

Zizhao Hu, Ameya Godbole, Johnny Tian-Zheng Wei, Mohammad Rostami, Jesse Thomason, Robin Jia

发表机构 * University of Southern California（南加州大学）； USC Information Sciences Institute（USC信息科学研究所）

AI总结本文提出了一种无需保留集的去记忆方法SHRED，通过自蒸馏与对数势降低，在去记忆的同时保持模型的实用性，优于传统需要保留集的方法。

详情

AI中文摘要

针对大语言模型（LLMs）的机器去记忆问题，旨在选择性地移除记忆中的内容，如私人数据、受版权文本或危险知识，而无需昂贵的全量重新训练。现有大多数方法需要一个经过精心挑选的保留集以防止一般模型用途的灾难性退化，这会增加额外的数据依赖性，使部署复杂化。我们提出SHRED（通过高惊奇度的无保留集熵降低的自蒸馏），一种无需保留集的去记忆方法，基于一个关键洞察：并非所有遗忘集实例中的token都同等地包含记忆信息。高信息token集中了模型的记忆知识，而低信息token反映了一般语言能力。SHRED分为两个阶段。（1）选择：我们对遗忘集实例进行前向传递，收集每个token的自回归概率，并选择底部（最低概率，最高香农信息）作为遗忘位置；剩余位置保留为良性锚点。（2）训练：我们构建了修改的KL目标，降低记忆token在遗忘位置的logit，同时在良性位置保持原始分布。模型通过单一的顶部KL自蒸馏目标进行训练，同时驱动遗忘和实用性保持。我们评估了SHRED在四个标准去记忆基准上的表现，并证明其在遗忘效果和模型实用性之间建立了新的帕累托最优权衡，优于保留集依赖的方法。我们的分析显示，SHRED对重新学习攻击和成员推断攻击具有鲁棒性，并且在多次连续去记忆运行后仍能保持稳定的实用性。

英文摘要

Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token's logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.

URL PDF HTML ☆

赞 0 踩 0

2605.07096 2026-06-05 cs.LG cs.AI stat.ME

Query-efficient model evaluation using cached responses

通过缓存响应实现高效的模型评估

Hayden Helm, Ben Johnson, Carey Priebe

发表机构 * arXiv.org ； University of Maryland（马里兰大学）

AI总结本文提出了一种基于数据核视角空间（DKPS）的方法，利用已缓存的模型响应来预测基准性能，从而减少评估新模型所需的查询数量，提高了模型评估的效率。

详情

AI中文摘要

在部署新模型之前，评估其在现有基准上的表现通常是必要的。对于现代评估框架来说，生成并评估所有查询的响应可能成本过高。实际上，先前评估模型的响应往往被缓存——这为利用此额外信息来减少准确评估新模型所需查询数量提供了潜在机会。在本文中，我们介绍了一种预测基准性能的方法，该方法利用缓存的模型响应，基于数据核视角空间（DKPS），一种在黑箱设置下量化模型之间关系的方法。理论上，我们证明了基于DKPS的方法在某些条件下是查询高效的。实证上，我们展示了基于DKPS的方法在查询预算大幅减少的情况下，能够达到与基线相同的平均绝对误差。最后，我们提出了一种离线方法，用于选择一组查询，以最大化参考模型上的拟合质量，从而在随机查询选择的基础上提高预测准确性。

英文摘要

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

URL PDF HTML ☆

赞 0 踩 0

2604.10528 2026-06-05 cs.CV

BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

BareBones: 视觉语言模型中零样本几何理解的基准测试

Aaditya Baranwal, Vishal Yadav, Abhishek Rajora

发表机构 * University of Central Florida（佛罗里达大学中央分校）； University of Calgary（卡尔加里大学）

AI总结提出BareBones基准，通过去除RGB纹理仅保留轮廓，测试26个视觉语言模型在零样本几何形状理解上的表现，发现模型存在严重的纹理偏置悬崖。

详情

Comments: Accepted at CVPR (13th FGVC Workshop) 2026

AI中文摘要

尽管视觉语言模型（VLM）在多种多模态任务中展现出卓越的零样本识别能力，但这些架构是否真正理解几何结构，还是仅仅利用RGB纹理和上下文先验作为统计捷径，仍然是一个未解之谜。现有的评估未能分离这一机制，将语义推理与纹理映射混为一谈，并依赖于不精确的标注，这些标注无意中泄露了环境线索。为解决这一空白，我们引入了$ extbf{BareBones}$，一个旨在压力测试纯几何形状理解的零样本基准。我们整理了六个数据集中几何不同类别的像素级轮廓：五个已建立的分割来源（ImageNet-S、DIS5K、ThinObject5K、PASCAL VOC、CUB-200）以及我们新颖的旗舰集合WTP-Bench，建立了一个无噪声的几何分类体系。WTP-Bench是一个极端的、细粒度的视觉谜题，迫使模型仅从边界轮廓中识别类间几何概念。我们对26个最先进的专有和开源权重VLM（例如GPT-4.1、Gemini、Claude Sonnet 4.5、LLaVA）的评估揭示，在去除RGB纹理的情况下，模型性能一致且严重崩溃，我们将这一现象称为$ extit{纹理偏置悬崖}$。通过记录普遍的结构性盲点，BareBones为真正的几何基础建立了一个严格的衡量标准。项目页面：https://eternal-f1ame.github.io/WTP-Bench/

英文摘要

While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\textbf{BareBones}$, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the $\textit{Texture Bias Cliff}$. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding. Project Page: https://eternal-f1ame.github.io/WTP-Bench/

URL PDF HTML ☆

赞 0 踩 0

2604.27343 2026-06-05 cs.CV

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

JI-ADF：联合-个体学习与自适应决策融合用于多模态皮肤病变分类

Phan Nguyen, Dat Cao, Hien Kha, Hien Chu, Minh Le, Trang Pham, Nguyen Quoc Khanh Le

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结本文提出JI-ADF框架，通过整合皮肤镜图像、临床照片和结构化患者数据，实现基于临床的皮肤病变分类，采用多模态表示学习和自适应决策融合机制，提升跨模态推理能力，并在MILK10k数据集上验证了其在实际临床场景中的可靠性。

详情

AI中文摘要

皮肤病变分类对早期皮肤病诊断至关重要，但许多现有计算机辅助系统主要依赖皮肤镜图像，而未能充分利用临床实践中常规可用的多模态证据。为解决这一问题，我们提出JI-ADF，一种三模态深度学习框架，整合皮肤镜图像、临床照片和结构化患者元数据，用于基于临床的皮肤病变分类。所提出的架构结合了联合多模态表示学习、模态特定的辅助监督以及自适应决策融合机制，该机制在每个样本基础上动态校准模态贡献。为进一步增强跨模态推理并保持模态特定证据，我们进一步引入了多模态融合注意力（MMFA）模块。我们在大规模MILK10k基准上评估了JI-ADF，该基准反映了真实世界临床获取条件和严重的类别不平衡。所提出的方法在病变类别上表现出强大且均衡的性能，提高了灵敏度和Dice分数，同时保持高特异性和良好的校准。广泛的分析，包括模态消融、校准评估和Grad-CAM可视化，进一步证实了模型的鲁棒性和临床意义的行为。这些结果表明，JI-ADF为实际临床场景中的多模态皮肤病变分类提供了可靠且实用的基础。

英文摘要

Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

URL PDF HTML ☆

赞 0 踩 0