arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03110 2026-06-04 cs.CL

Coherence Maximization Improves Pluralistic Alignment

一致性最大化改进多元对齐

Taslim Mahbub, Yiding Pei, Shi Feng

发表机构 * George Washington University（乔治·华盛顿大学）

AI总结提出内部一致性最大化（ICM）方法，通过最大化标签的互可预测性生成个性化示例，无需人工监督即可将模型与目标群体价值观对齐，并证明示例一致性比单独准确性更重要。

详情

AI中文摘要

将AI系统与多样化的人类价值观对齐需要基于具体示例的价值规范，但在没有广泛人工监督的情况下生成此类示例仍然是一个开放的挑战。我们研究了这些示例的有效性因素，使用内部一致性最大化（ICM）——通过最大化标签的互可预测性来推断标签——生成特定于人的示例，将模型引导至目标群体的价值观，无需人工监督。在涵盖分类、偏好和开放式生成的四个基准测试中，ICM推断的上下文示例与黄金标签的性能相匹配。至关重要的是，一致性比单独的标签准确性更重要：在准确性保持不变的情况下，更一致的示例比不一致的示例具有更好的泛化能力。对于预训练数据中代表性不足的人物，在模型对人物价值观最不确定的问题上进行有针对性的反馈，比在任意问题上使用相同数量的标签产生更好的泛化效果。这些结果将一致性确定为可扩展价值规范的关键设计原则，利用了预训练语言模型中已经编码的多样化人类视角。

英文摘要

Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these examples effective, using Internal Coherence Maximization (ICM) -- which infers labels by maximizing their mutual predictability -- to generate persona-specific examples that steer a model toward a target group's values, without human supervision. Across four benchmarks spanning classification, preference, and open-ended generation, ICM-inferred in-context examples match the performance of gold labels. Crucially, coherence matters beyond individual label accuracy: with accuracy held constant, more coherent examples generalize substantially better than incoherent ones. For personas underrepresented in pretraining data, targeted human feedback on the questions where the model is least certain about a persona's values yields better generalization than the same number of labels on arbitrary questions. These results identify coherence as a key design principle for scalable value specification, leveraging the diverse human perspectives already encoded in pretrained language models.

URL PDF HTML ☆

赞 0 踩 0

2606.02914 2026-06-04 cs.AI cs.CL

Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

牙科医疗中的大型AI模型：从通用系统到领域特定基础模型

Sema Helali, Lina Abu Nada, Sausan Al Kawas, Alaa Abd-Alrazaq, Faleh Tamimi, Rafat Damseh

发表机构 * University of Al Ain, UAE（阿联酋阿恩大学）； Sharjah University, UAE（阿联酋谢尔杰大学）； Cornell University, Qatar（卡塔尔康奈尔大学）； McGill University（麦吉尔大学）

AI总结本文通过系统综述，提出二维分类框架，比较语言生成模型、判别视觉基础模型和牙科特定基础模型在牙科任务中的表现，发现集成管道优于单一模型，并指出数据不对称、幻觉和缺乏标准化基准等障碍。

详情

AI中文摘要

背景：口腔疾病影响全球近35亿人，但大规模AI模型在牙科中的临床潜力尚不明确。出现了三类不同的模型：语言生成模型、判别视觉基础模型和牙科特定基础模型，目前缺乏统一综述来审视它们的关系和共同局限性。方法：遵循PRISMA-ScR指南，系统检索四个数据库（PubMed、Google Scholar、Scopus、arXiv），由两名评审员独立筛选。应用纳入/排除标准后，纳入97项研究（2020-2026年）。我们提出了一个二维分类框架，按架构范式和牙科专业化程度对模型进行组织。结果：语言生成模型在基于文本的任务（临床推理、执照考试、患者沟通）中表现出色，但在依赖图像的诊断中表现不一致。改编的SAM和CLIP变体在牙齿分割和病变检测中取得了强劲结果。牙科特定模型（DentVFM、DentVLM、OralGPT）在复杂多模态任务中表现最强。集成管道始终优于单一模型方法。观察到数据不对称：牙科特定预训练几乎完全集中在视觉领域，反映了大规模牙科文本语料库的稀缺。结论：通用模型和牙科特定模型发挥互补作用；最有效的系统在结构化管道中结合两者。安全自主部署需要解决三个持续障碍：生成模型中的幻觉、有限的标注牙科数据集以及缺乏标准化的临床评估基准。

英文摘要

Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in dentistry remains poorly understood. Three distinct model categories have emerged: language-generative models, discriminative vision foundation models, and dental-specific foundation models, with no unified review examining their relationships and collective limitations. Methods: Following PRISMA-ScR guidelines, we systematically searched four databases (PubMed, Google Scholar, Scopus, arXiv), screened independently by two reviewers. After applying inclusion/exclusion criteria, 97 studies (2020-2026) were included. We propose a two-dimensional classification framework organizing models by architectural paradigm and dental specialization degree. Results: Language-generative models excel at text-based tasks (clinical reasoning, licensing exams, patient communication) but show inconsistent performance on image-dependent diagnostics. Adapted SAM and CLIP variants achieve strong tooth segmentation and lesion detection results. Dental-specific models (DentVFM, DentVLM, OralGPT) demonstrate strongest performance on complex multimodal tasks. Integrated pipelines consistently outperform single-model approaches. A data asymmetry is observed: dental-specific pretraining concentrates almost entirely in the vision domain, reflecting scarce large-scale dental text corpora. Conclusions: General-purpose and dental-specific models play complementary roles; the most effective systems combine both within structured pipelines. Safe autonomous deployment requires resolving three persistent barriers: hallucination in generative models, limited annotated dental datasets, and absent standardized clinical evaluation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.02894 2026-06-04 cs.CV

Tiny Collaborative Inference for Occlusion-Robust Object Detection

用于遮挡鲁棒目标检测的微型协同推理

Chieh-Tung Cheng, Mustafa Aslanov, Eiman Kanjo

发表机构 * Imperial College London（帝国理工学院伦敦分校）； Nottingham Trent University（诺丁汉特伦特大学）

AI总结针对超低端边缘设备，结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化，评估决策级融合（WBF）相比特征级融合在遮挡场景下提升mAP达+0.2736，并验证了多视角融合与Wi-Fi对等部署的可行性。

详情

AI中文摘要

小型边缘设备，如物联网监控节点和搜索救援（SAR）平台，越来越期望本地运行计算机视觉。然而，在超低端硬件上，目标检测受到可用内存和计算、多个设备协作时的通信成本以及遮挡导致的精度损失的限制。本文通过结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化，评估了在小于1 MB SRAM的设备上的遮挡鲁棒目标检测。我们评估了两种协作推理策略：特征级融合（拼接中间特征图）和通过加权框融合（WBF）的决策级融合。在测试的遮挡设置下，WBF优于特征级融合，在非对称遮挡场景中最高可提升+0.2736 mAP。将融合扩展到三个视角进一步提高了精度（最高+0.3827 mAP），同时增加了通信开销（每次交换约1.3 KB）。硬件实验从主机辅助的USB中继基线开始，然后转移到两个Coral Dev Board Micro单元上的Wi-Fi对等部署，其中WBF在设备上运行，通信能量相对于推理仍然很小。在一个代表性的301.9秒自主会话中，包含108帧，融合输出在61帧上观察到，而仅Board 2为47帧，帧级覆盖增益为+29.8%。我们还包含了一个小型探索性的去中心化联邦学习（DFL）可行性说明，但由于在非独立同分布本地数据下性能仍然有限，我们不将其作为主要结果。结果支持决策级融合作为提高小规模边缘目标检测中遮挡鲁棒性的可行选项，包括在超低端硬件上无需主机的多板操作。

英文摘要

Edge AI nodes for search and rescue are increasingly expected to run computer vision locally, yet ultra-low-end hardware imposes hard constraints on memory, compute, and inter-device communication. This work addresses occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and Lite quantisation. Two collaborative inference strategies are evaluated: feature-level fusion, concatenating intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). WBF outperforms feature-level fusion under all tested occlusion conditions, yielding gains of up to +0.2736 mAP in asymmetric scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) at modest communication overhead (~1.3 KB per exchange). Hardware experiments progress from a host-assisted USB-relay baseline to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF executes on-device with negligible communication energy relative to inference. In a 301.9 s autonomous session of 108 frames, fused output is produced on 61 frames versus 47 for a single board - a coverage gain of +29.8%. A decentralised federated learning feasibility note is included but not treated as a primary result, as performance remains limited under non-iid data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.02886 2026-06-04 cs.LG cs.AI cs.CE math.PR physics.ao-ph

Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels

基于经验神经正切核的极端天气预报可扩展不确定性量化

Jose Marie Antonio Miñoza, Rex Gregor Laylo, Sebastian C. Ibañez

发表机构 * Center for AI Research（人工智能研究中心）； Department of Education（教育部门）； Makati Philippines（马卡蒂菲律宾）

AI总结本文提出基于神经正切核的不确定性量化方法，利用最后一层经验特征，通过方差崩溃机制和分解性能分析，实现无需重训练的极端天气自适应预测区间。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情

DOI: 10.1145/3770855.3818106

AI中文摘要

深度学习天气模型现在匹配数值天气预报的准确性，同时运行速度快几个数量级，但产生确定性预测而没有不确定性估计，这对于极端天气事件期间的高风险决策是一个关键差距。本文提出基于神经正切核的不确定性量化（NTK-UQ），使用最后一层经验特征。理论分析预测，UQ质量通过两种机制依赖于架构。首先，方差崩溃机制解释了UQ何时失败：当特征值截断秩接近特征空间的有效秩时，GP校正项消耗几乎所有的先验方差，破坏了热带气旋与常规条件之间的区分；具有集中谱（谱算子）的架构需要激进截断（k≤10），而基于注意力的模型容忍满秩计算。其次，分解性能取决于极端天气的非高斯、重尾结构：独立成分分析利用高阶统计量（峰度、负熵）来隔离重尾极端事件特征，实现了比仅捕获二阶方差的奇异值分解更高的区分度。一个数据驱动的选择规则根据特征谱集中比选择ICA或SVD，正确地为所有四种评估架构指定了更优的分解。与分裂共形预测（自然的后验基线）相比，NTK-UQ在90%覆盖率下实现了31-37%更窄的预测区间，并且独特地产生随极端事件严重程度缩放的自适应区间，而共形预测无法通过构造实现。该框架无需重训练；推理时的不确定性每个样本仅需一次矩阵-向量乘积。

英文摘要

Deep learning weather models now match numerical weather prediction accuracy while running orders of magnitude faster, but produce deterministic forecasts without uncertainty estimates, a critical gap for high-stakes decisions during extreme weather events. This paper proposes Neural Tangent Kernel-based uncertainty quantification (NTK-UQ) using last-layer empirical features. Theoretical analysis predicts that UQ quality is architecture-dependent through two mechanisms. First, a variance collapse mechanism explains when UQ fails: when the eigenvalue truncation rank approaches the effective rank of the feature space, the GP correction term consumes nearly all prior variance, destroying discrimination between tropical cyclones and routine conditions; architectures with concentrated spectra (spectral operators) require aggressive truncation ($k \leq 10$), while attention-based models tolerate full-rank computation. Second, decomposition performance depends on the non-Gaussian, heavy-tailed structure of extreme weather: Independent Component Analysis exploits higher-order statistics (kurtosis, negentropy) to isolate heavy-tailed extreme-event features, achieving higher discrimination than singular value decomposition, which captures only second-order variance. A data-driven selection rule chooses ICA or SVD from the feature eigenspectrum concentration ratio, correctly prescribing the superior decomposition for all four evaluated architectures. Compared to split conformal prediction (the natural post-hoc baseline), NTK-UQ achieves 31--37\% sharper prediction intervals at 90\% coverage, and uniquely produces \emph{adaptive} intervals that scale with extreme event severity, which conformal prediction cannot achieve by construction. The framework requires no retraining; inference-time uncertainty requires only a single matrix-vector product per sample.

URL PDF HTML ☆

赞 0 踩 0

2606.02636 2026-06-04 cs.RO cs.AI

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

过犹不及：当 sim2real 努力阻碍策略学习（以及如何应对）

Kyle Morgenstein, Bharath Masetty, Stephen Welch, Luis Sentis

发表机构 * Apptronik ； University of Texas at Austin（得克萨斯大学奥斯汀分校）

AI总结本文指出 sim2real 努力与策略学习之间存在激励错位，导致模拟器锁定和策略探索不足，并提出通过 sim2sim2real 范式仅以机器人运动学为设计约束的潜在解决方案。

2606.02576 2026-06-04 cs.CV cs.LG

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

ProtoAda: 原型引导的自适应适配器扩展与几何整合用于多模态持续指令微调

Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China（南京大学人工智能学院）； State Key Laboratory of Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）

AI总结提出ProtoAda框架，通过格式感知任务原型和几何感知参数整合，解决多模态持续指令微调中任务路由错误和梯度干扰问题。

详情

AI中文摘要

多模态大语言模型通过指令微调取得了强大性能，但实际部署需要它们持续获取新的视觉语言能力，这使得多模态持续指令微调至关重要。为了减少任务间干扰并促进协作，近期方法常采用稀疏架构，如基于图像-文本相似度路由的LoRA专家混合。然而，具有不同响应结构的任务可能共享高度相似的视觉语言语义，从而被错误地路由到同一专家；仅凭图像-文本相似度不足以进行可靠的任务分配。例如，一个需要坐标预测的定位任务专家，在学习语义相似的VQA任务后，可能偏向于生成短文本答案。这种格式盲目的任务分配将异构响应类型整合到共享参数中，引发梯度干扰和无效的专家协作。为解决此问题，我们提出ProtoAda，一种原型引导的自适应微调框架。ProtoAda引入格式感知任务原型，使任务分配和路由与任务语义及输出结构对齐，并以几何感知方式整合格式兼容的更新，有效重用并逐步优化现有参数。在多个基准上的大量实验表明，ProtoAda取得了优越性能，尤其是在答案结构易被顺序微调破坏的任务上。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.

URL PDF HTML ☆

赞 0 踩 1

2606.02521 2026-06-04 cs.LG cs.CV

Drifting Preference Optimization for One-Step Generative Models

一步生成模型的漂移偏好优化

Zhou Jiang, Yandong Wen, Zhen Liu

发表机构 * Westlake University（西湖大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出 DrPO 方法，通过在线采样和排序构建非参数偶极偏好场及参考漂移，实现一步式生成模型的无梯度偏好微调。

Comments 24 pages, 9 figures

详情

AI中文摘要

一步式文本到图像生成器因只需一次前向传播即可生成图像而具有吸引力，但其偏好微调仍然困难：标准对齐方法通常依赖于策略似然、去噪轨迹、可微奖励梯度或测试时优化。我们提出漂移偏好优化（DrPO），一种针对确定性一步生成器的在线偏好微调方法。对于每个提示，DrPO 从当前生成器中采样候选图像，用目标奖励对其进行排序，并利用高分和低分样本合成特征空间更新方向。该更新是一个非参数偶极偏好场加上从冻结的基础生成器估计的参考漂移，并通过解耦的特征空间回归目标进行优化。目标奖励仅用于排序，因此 DrPO 可以使用大型、黑盒或不可微的奖励进行训练，而推理仍只需一次生成器调用。我们在 SD-Turbo 和 SDXL-Turbo 上使用多个目标奖励和基准（包括 HPSv3 和 GenEval）评估了 DrPO。DrPO 在匹配有效批量设置下，通过移除奖励模型反向传播，比无奖励梯度的一步偏好基线提高了对齐度，并将 HPSv3 训练计算量减少了 3.51 倍。初步离线实验表明，基于样本的梯度合成也可用于在线奖励排序之外。

英文摘要

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.

URL PDF HTML ☆

赞 1 踩 0

2606.02403 2026-06-04 cs.CL cs.AI

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

AutoForest: 从生物医学研究中自动生成森林图，实现端到端的证据提取与综合

Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan, Oisín Redmond, Joao Bettencourt-Silva, Gurdeep Mannu, Spiros Denaxas, Rui Bebiano Da Providencia E Costa, Anya Belz, Yufang Hou

发表机构 * IBM Research（IBM研究院）； Dublin City University（都柏林城市大学）； UCL（伦敦大学学院）； University of Oxford（牛津大学）； IT:U Interdisciplinary Transformation University Austria（奥地利 interdisciplinary Transformation 大学）

AI总结提出AutoForest系统，通过端到端的证据提取与统计综合，直接从生物医学论文自动生成可发表的森林图，加速证据综合并降低元分析门槛。

Comments Accepted to ACL2026 (System Demonstrations Track)

详情

AI中文摘要

系统评价依赖森林图来综合生物医学研究中的定量证据，但生成森林图仍然是一个碎片化且劳动密集型的过程。研究人员必须解读复杂的临床文本，手动从试验中提取结果数据，定义适当的干预措施和对照，协调不一致的研究设计，并执行元分析计算——通常需要使用需要结构化输入和领域专业知识的专门软件。虽然最近的研究表明，大型语言模型可以从非结构化文本中提取研究级数据，但现有系统没有自动化从原始文档到综合森林图的完整流程。为了解决这一差距，我们引入了AutoForest，这是第一个端到端系统，可以直接从生物医学论文生成可发表的森林图。给定一篇或多篇研究论文，AutoForest自动建议ICO（干预、对照、结果）元素，提取结果数据，执行统计综合，并渲染最终的森林图。我们描述了系统架构、用户界面，并通过一项涉及临床医生的用户研究，展示了其在真实世界示例上的有效性，表明AutoForest可以加速证据综合并大幅降低进行元分析的门槛。

英文摘要

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations-typically using specialised software that demands structured inputs and domain expertise. While recent work has demonstrated that large language models can extract study-level data from unstructured text, no existing system automates the complete pipeline from raw documents to synthesised forest plots. To address this gap, we introduce AutoForest, the first end-to-end system that generates publication-ready forest plots directly from biomedical papers. Given one or more study papers, AutoForest automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders the final forest plot. We describe the system architecture, user interface and demonstrate its effectiveness on real-world examples through a user study involving clinicians, showing how AutoForest can accelerate evidence synthesis and substantially lower the barrier to conducting meta-analyses.

URL PDF HTML ☆

赞 0 踩 0

2606.02166 2026-06-04 cs.LG

EEG-FuseFormer: A Transformer-Driven Feature Fusion Framework for Seizure Onset Prediction

EEG-FuseFormer: 一种用于癫痫发作预测的Transformer驱动特征融合框架

Vigneshwar Hariharan, Chithra Reghuvaran, Arlene John, Nhat Pham, Omer Rana, Deepu John, Ganesh Neelakanta Iyer

发表机构 * National University of Singapore（新加坡国立大学）； University College Dublin（都柏林大学）； University of Twente（特文特大学）； Cardiff University（卡迪夫大学）

AI总结提出EEG-FuseFormer框架，融合CNN-LSTM和ResNet-18提取的时空特征与频谱特征，利用Transformer编码器进行融合，在CHB-MIT数据集上达到98.85%的平均召回率，优于多数现有方法。

Comments IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2026

详情

AI中文摘要

癫痫是全球最常见的神经系统疾病之一，以反复发作为特征，严重影响生活质量。尽管诊断技术有所进步，但由于癫痫事件不可预测，减轻患者面临的风险仍然具有挑战性。准确预测癫痫发作有助于降低患者风险。本文提出EEG-FuseFormer，一种基于Transformer的特征融合框架，用于癫痫发作预测，该框架结合了从卷积神经网络-长短期记忆网络（CNN-LSTM）和ResNet-18网络中提取的中间特征。CNN-LSTM架构直接从原始信号中捕获时空特征，而ResNet-18从脑电图信号的短时傅里叶变换（STFT）表示中提取特征。使用Transformer编码器进行融合，并通过全连接密集层生成最终预测。使用CHB-MIT数据集验证所提模型。结果表明，所提模型实现了98.85%的平均召回率，优于大多数现有方法。本研究评估了所提特征融合模型在跨患者测试场景中的泛化能力。在跨患者验证框架内，对有限目标患者数据进行微调（目标适应）相比传统跨患者验证方法，获得了更高的召回率、精确率和F1分数。最后，在不同硬件平台上评估了模型的运行时计算复杂度，以突出性能与复杂度的权衡。

英文摘要

Epilepsy is one of the most common neurological disorders globally, characterized by recurring seizures and significantly impacting the quality of life. Despite advancements in diagnostic techniques, the mitigation of risks faced by epilepsy patients remains challenging due to the unpredictability of seizure events. An accurate forecast of seizure onset helps to reduce risks in epilepsy patients. In this paper, we propose EEG-FuseFormer, a transformer-based feature fusion framework for seizure-onset prediction that combines intermediate features extracted from Convolutional Neural Networks-Long Short-Term Memory (CNN-LSTM) and ResNet-18 networks. The CNN-LSTM architecture captures both spatial and temporal features directly from the raw signal, whereas the ResNet-18 extracts features from the Short-Time Fourier Transform (STFT) representation of the EEG signals. Fusion is carried out using a transformer encoder, and the final prediction is generated using fully connected dense layers. The CHB-MIT dataset was used to validate the proposed model. The results show that the proposed model achieves a mean recall of 98.85% and outperforms most of the state-of-the-art methods. This study evaluates the ability of the proposed feature fusion model to generalize in cross-patient testing scenarios. Fine-tuning pre-trained models on limited target patient data (target adaptation) within the cross-patient validation framework results in higher recall, precision, and F1-score metrics in comparison to the conventional cross-patient validation approach. Finally, the runtime-based computational complexity of the model is assessed across diverse hardware platforms to highlight the performance-complexity trade-off.

URL PDF HTML ☆

赞 0 踩 0

2606.01961 2026-06-04 cs.AI

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

AutoMedBench: 迈向基于智能体AI模型的医学自动研究

Junqi Liu, Selena Song, Yuhan Wang, Jiawei Mao, Hardy Chen, Xiaoke Huang, Tianhao Qi, Pengfei Guo, Yucheng Tang, Yufan He, Can Zhao, Andriy Myronenko, Dong Yang, Daguang Xu, Yuyin Zhou

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）； NVIDIA

AI总结提出AutoMedBench，一个工作流感知的基准，通过五阶段工作流（计划、设置、验证、推理、提交）评估自主智能体在医学AI研究中的行为，发现验证阶段最弱而设置阶段最强，验证和提交失败占主导。

详情

AI中文摘要

自主智能体越来越被期望支持端到端的医学AI研究工作流程，超越孤立的预测任务或短形式的临床问答。然而，现有的医学智能体基准主要评估最终输出，对研究过程中智能体行为的可见性有限。为填补这一空白，我们提出了AutoMedBench，一个工作流感知的基准，用于跨多种医学成像和多模态推理任务的自主医学AI研究，将智能体执行组织成统一的五阶段工作流（S1-S5）：计划、设置、验证、推理和提交。它包含长时域任务，每次运行平均33个智能体回合，涵盖五个研究轨道：分割、图像增强、视觉问答（VQA）、报告生成和病变检测。每个任务在两种难度级别（Lite和Standard）下评估，它们使用相同的数据和指标，但在任务简报脚手架的数量上有所不同，每次运行使用最终任务性能和S1-S5阶段得分进行评分，从而实现从初始任务简报到最后提交工件的阶段级分析。在数千次记录运行中，阶段级评分显示，验证是平均最弱的工作流阶段，而设置是最强的，这表明当前智能体更擅长使流程可执行，而不是验证其可靠性。运行后错误分析进一步显示，验证和提交失败主导了标记错误，分别占触发代码的37.7%和38.1%，而任务理解错误很少，占0.9%，并且触发一个错误代码的运行平均总体得分比无错误代码的运行低48%。

英文摘要

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

URL PDF HTML ☆

赞 0 踩 0

2606.01770 2026-06-04 cs.LG cs.AI

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

自适应自动框架：面向开放式任务流的智能体系统部署的持续自我改进

Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu

发表机构 * Emory University（埃默里大学）； Amazon（亚马逊）； The Pennsylvania State University（宾夕法尼亚州立大学）； UIUC（伊利诺伊大学香槟分校）； Northeastern University（东北大学）

AI总结提出自适应自动框架（Adaptive Auto-Harness），通过状态化多智能体进化器、带求解时路由的框架树和人工引导机制，解决开放式任务流中自动框架性能退化问题，在多个流上超越现有基线。

详情

AI中文摘要

自动框架系统（如A-Evolve、GEPA和Meta-Harness）通过从执行反馈中优化提示、技能、工具、记忆和支持基础设施来改进LLM智能体，但它们通常在固定的离线基准上进行评估。实际部署中呈现的是开放式任务流：历史记录无固定终点增长，异构任务需要不同的框架，问题分布随时间变化。这些挑战使得单一反复密集更新的框架变得脆弱，导致性能退化，准确率早期达到峰值后下降。这激发了具有任务自适应性的持续框架构建。我们引入了自适应自动框架（Adaptive Auto-Harness），一个针对此类流的框架和系统。该框架将到 oracle 框架的差距分解为进化损失和适应损失。系统通过状态化多智能体进化器、带求解时路由的框架树以及针对历史缺乏所需信号情况的人工引导钩子来解决这些损失。在预测市场、安全竞赛和事件预测流中，自适应自动框架优于五个现有的自动框架基线，消融实验将收益归因于更好的构建、路由或针对性的人工引导。代码可在 https://github.com/A-EVO-Lab/AdaptiveHarness 获取。

英文摘要

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in \href{https://github.com/A-EVO-Lab/a-evolve/tree/release/adaptive-auto-harness}{Link}.

URL PDF HTML ☆

赞 0 踩 0

2606.01649 2026-06-04 cs.CV

PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation

PhyScene3D：物理一致的交互式3D桌面场景生成

Weixing Chen, Zhuoqian Feng, Yang Liu, Yexin Zhang, Yifan Wen, Yinghong Liao, Weichao Qiu, Guanbin Li, Liang Lin

发表机构 * Sun Yat-sen University, China（中山大学）； Peng Cheng Laboratory（鹏城实验室）； Guangdong Key Laboratory of Big Data Analysis and Processing（广东省大数据分析与处理重点实验室）； Huawei（华为）

AI总结提出PhyScene3D框架，通过认知拓扑推理链和物理感知去噪对齐，解决3D桌面场景生成中的物理一致性问题，显著降低碰撞率。

Comments 23 pages, 5 figures, accepted by ICML 2026

详情

AI中文摘要

生成物理一致的3D桌面场景是交互式和通用机器人学习中的一个基本但尚未充分探索的问题。挑战源于密集的对象层次结构和不规则的可供性。这里，交互场景指的是一个物理有效、无碰撞的环境，可直接加载到物理模拟器中。现有方法，从解耦的符号求解器到端到端回归模型，通常遭受误差传播或过拟合到包含广泛物理违规的噪声监督。为了解决这些限制，我们引入了PhyScene3D，一个将生成重新表述为类人构造过程的框架。提出的认知拓扑推理链（CTRC）将场景合成分解为顺序的、锚点条件的过程。它采用基于3D AABB的放置方案，施加了强大的结构归纳偏置。为了解决不完美的监督和物理不可行性，我们引入了物理感知去噪对齐（PADA）。它将可微分的符号距离场（SDF）与测试时优化（TTO）相结合，将生成的场景投影到物理可行的流形上，同时保留语义意图。实验表明，PhyScene3D在语义准确性和物理有效性方面均优于最先进的方法，相对于人工标注的训练数据，场景级碰撞率降低了40%。

英文摘要

Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.

URL PDF HTML ☆

赞 0 踩 0

2606.01573 2026-06-04 cs.CV

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

VG²GT: 体素-高斯泼溅视觉几何基础变换器

Yibin Zhao, Yihan Pan, Jun Nan, Wenli Yang, Liwei Chen, Jianjun Yi

发表机构 * East China University of Science and Technology（东华大学）； Shanghai Open University（上海开放大学）； Shanghai Xiaoyuan Innovation Center（上海小元创新中心）

AI总结提出VG²GT，利用冻结的视觉基础模型、多尺度可微体素模块和体素特征直接回归高斯原语参数，通过随机实体体渲染监督深度图，实现几何精确的高斯场景重建，在多个数据集上达到最优性能。

详情

AI中文摘要

高斯泼溅在3D重建和新视角合成方面显示出强大的潜力。然而，大多数现有方法需要精确的相机参数和逐场景优化，而使用像素对齐高斯原语的前馈方法常常遭受伪影和非均匀原语的困扰。在本文中，我们提出了VG²GT，一种体素-高斯泼溅视觉几何基础变换器。VG²GT利用冻结的预训练视觉基础模型（VFM），结合多尺度可微体素模块以增强几何理解，并直接从体素特征分裂和回归高斯原语参数。在训练过程中，通过随机实体体渲染监督深度图，使得在保持视觉基础模型完全冻结的同时，实现几何准确的高斯场景重建。这种设计使VG²GT能够无缝插入任何基于补丁特征的VFM，同时大幅降低所需的训练成本。VG²GT在广泛使用的DTU、Replica、TAT和ScanNet数据集上优于当前最先进的方法。

英文摘要

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.01537 2026-06-04 cs.CV cs.LG

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

PaCX-MAE: 生理增强的胸部X光掩码自编码器

Yancheng Liu, Kenichi Maeda, Manan Pancholy

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Tokyo（东京大学）； University of Michigan（密歇根大学）

AI总结提出PaCX-MAE跨模态蒸馏框架，通过双对比预测目标将生理先验注入胸部X光编码器，在保持单模态推理的同时提升生理相关任务性能。

Comments Accepted at the ICML 2026 3rd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)

详情

AI中文摘要

临床诊断通常需要结合影像与生理测量，但部署的模型通常处理单模态数据。我们提出PaCX-MAE，一种跨模态蒸馏框架，将生理先验注入胸部X光（CXR）编码器，同时在推理时严格保持单模态。PaCX-MAE通过双对比预测目标增强域内掩码自编码，使CXR表示与配对的ECG和实验室嵌入对齐。在九个基准上的广泛评估表明，该方法在领域特定MAE上取得一致改进，特别是在依赖生理的任务上（例如，MedMod上AUROC提升2.7；VinDr上F1提升6.5）。该方法在1%标注数据下表现出高度标签效率，并保持解剖保真度，在分割任务上与MAE持平。零样本和注意力分析证实，PaCX-MAE成功学习关注生理指标，如心脏轮廓，这在标准视觉预训练中缺失。

英文摘要

Clinical diagnosis often requires combining imaging with physiological measurements, yet deployed models typically operate on unimodal data. We present PaCX-MAE, a cross-modal distillation framework that injects physiological priors into chest X-ray (CXR) encoders while remaining strictly unimodal at inference. PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective, aligning CXR representations with paired ECG and laboratory embeddings. Extensive evaluation across nine benchmarks demonstrates consistent improvements over domain-specific MAE, particularly on physiology-dependent tasks (e.g., +2.7 AUROC on MedMod; +6.5 F1 on VinDr). The method proves highly label-efficient in the 1% regime and preserves anatomical fidelity, achieving parity with MAE on segmentation tasks. Zero-shot and attention analyses confirm that PaCX-MAE successfully learns to attend to physiological indicators, such as the cardiac silhouette, absent in standard visual pretraining.

URL PDF HTML ☆

赞 0 踩 0

2606.01495 2026-06-04 cs.LG cs.CL

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

CART: 上下文锚定循环Transformer——一种具有学习稳定性的参数高效架构

Chad A. Capps

发表机构 * Independent Researcher（独立研究员）

AI总结提出CART，一种通过共享核心块循环和冻结键值张量实现参数高效的语言模型，并引入线性时不变门控保持稳定性，实验表明在参数匹配时性能略低于密集基线。

Comments 31 pages, 4 figures. Code, training scripts, and the full experiment database (results.db) are available at https://github.com/ccapps42/CART

详情

AI中文摘要

我们提出CART（上下文锚定循环Transformer），一种参数高效的语言模型，它在深度上重复使用单个共享核心块R次。与先前每次迭代重新计算键值张量的循环Transformer不同，CART从多层前奏中一次性计算K和V，并通过多头潜在注意力让循环核心交叉关注这些冻结的张量。一个学习得到的线性时不变（LTI）门控保持循环稳定性：其谱半径在所有36个完全训练配置中稳定在窄带内（rho在[0.79, 0.83]之间）。我们在单个消费级GPU上分两个阶段评估CART：首先在3000步进行64配置筛选，然后对36个配置（P=6，R∈{6,8,10}，三个种子）训练30500步（约10亿token）。在宽度d∈{256,512,768,1024}上，两个模式成立：前奏深度P主导循环次数R，并且R的第一阶段排名在完全训练时反转（在d≥512时R=6变为最佳）。在绑定d=1024的参数对比测试中，CART未能击败参数匹配的密集基线，在存储参数对比中损失1-2%，在有效参数对比中损失约10%。诊断消融将有效参数差距分为约5%来自权重共享和约5%来自异质的前奏/锚点/核心/尾声框架；循环核心机制（超连接、LTI门控、循环索引嵌入）单独来看是退化的。变R推理在训练R的两侧性能下降，这是该方案下测试时深度扩展的一个负面结果。

英文摘要

We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully-trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64-configuration screen at 3,000 steps, then 36 configurations (P=6, R in {6,8,10}, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in {256,512,768,1024}: prelude depth P dominates loop count R, and the Stage-1 ranking of R reverses at full training (R=6 becomes best at d>=512). At the binding d=1024 parameter-parity test, CART does not beat a parameter-matched dense baseline, losing by 1-2% at stored-parameter parity and by ~10% at effective-parameter parity. Diagnostic ablations split the effective-parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent-core machinery (hyper-connections, LTI gate, loop-index embedding) is individually vestigial. Variable-R inference degrades on both sides of the trained R, a negative result for test-time depth scaling under this recipe.

URL PDF HTML ☆

赞 0 踩 0

2606.01212 2026-06-04 cs.CL cs.AI cs.CR cs.IR

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

DiscourseFlip: 面向黑盒检索增强生成的非直述式语篇级观点操纵攻击

Yuyang Gong, Miaokun Chen, Jiawei Liu, Zhuo Chen, Guoxiu He, Wei Lu, XiaoFeng Wang, Xiaozhong Liu

发表机构 * Wuhan University（武汉大学）； East China Normal University（华东师范大学）； Nanyang Technological University（南洋理工大学）； Worcester Polytechnic Institute（沃思堡理工学院）

AI总结提出一种基于图引导的代理攻击方法DiscourseFlip，通过语义查询网络中的协同影响在有限预算下最大化语篇级观点偏差，实验证明其有效性和隐蔽性，并揭示现有防御的不足。

详情

AI中文摘要

检索增强生成（RAG）系统被广泛部署且影响力日益增强，但其对外部语料库的依赖暴露了来自中毒检索内容的新安全风险。现有的RAG攻击主要关注单个查询或狭窄主题局部查询集，这限制了其实际影响范围，并在现实场景中提供有限的伪装。在本文中，我们引入了语篇级观点操纵，这是一种新的威胁模型，其中跨语义查询网络的协同影响会在整体、多主题查询空间上诱导观点转变。我们在黑盒设置中形式化了这种威胁，并提出了DiscourseFlip，一种基于代理的、图引导的攻击，动态分配有限的中毒预算以最大化语篇级观点偏差。大量实验表明，DiscourseFlip在上下文化查询网络上持续诱导目标观点转变，并在覆盖范围和有效性方面显著优于现有基线。用户研究进一步证实，DiscourseFlip有效且能很好地伪装以躲避用户检测。此外，系统分析表明，现有的缓解策略对语篇级操纵无效，这凸显了迫切需要更鲁棒和自适应的防御措施来应对语篇级漏洞。

英文摘要

Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora exposes new security risks from poisoned retrieval content. Existing RAG attacks are largely focusing on individual queries or narrow topic-local query sets, which limits their practical reach and offers limited camouflage in real-world settings. In this paper, we introduce discourse-level opinion manipulation, a new threat model in which coordinated influence across a semantic query network induces opinion shifts over a holistic, multi-topic query space. We formalize this threat in a black-box setting and propose DiscourseFlip, an agentic, graph-guided attack that dynamically allocates a limited poisoning budget to maximize discourse-level opinion deviation. Extensive experiments demonstrate that DiscourseFlip consistently induces targeted opinion shifts across the contextualized query network and significantly outperforms existing baselines in terms of coverage and effectiveness. User studies further confirm that DiscourseFlip is effective while remaining well camouflaged from user detection. Moreover, systematic analyses show that existing mitigation strategies are ineffective against discourse-level manipulation, underscoring the urgent need for more robust and adaptive defenses to address discourse-level vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.01023 2026-06-04 cs.CV cs.AI

Data Collection for Training Quality-Control AI in Carpet Manufacturing

地毯制造中用于训练质量控制AI的数据收集

Akbar Erkinov

发表机构 * Independent Researcher（独立研究者）

AI总结针对地毯生产中视觉检测慢、主观且不一致的问题，提出一种在线机器视觉系统设计，通过同步线扫描相机和组合照明实时检测缺陷，并系统收集标注数据以持续训练质量控制模型，最终通过DMAIC方法量化质量改进。

Comments 10 pages, 3 figures

详情

AI中文摘要

视觉检测仍然是机织和簇绒地毯生产中主要的质量控制实践，但在现代织机的线速度和宽度下，它缓慢、主观且不一致。我们提出了一种在线机器视觉系统的设计方案，其主要目的有两个：实时检测地毯幅面，以及同样重要的是，系统地收集和标注缺陷图案的图像，以便在设备使用寿命内训练日益强大的质量控制模型。该方案基于一个具体的工业环境：在一个机织地毯生产设施中进行的六西格玛（DMAIC）项目，该项目预计在增加织机后会出现生产瓶颈，且基线缺陷率较高，质量故障带来的财务风险显著。我们描述了一个基于同步线扫描相机并组合明场和掠射照明的成像子系统，推导了在多米宽幅面上分辨细微结构缺陷所需的分辨率和吞吐量要求，并定义了地毯特定的缺陷分类。然后，我们提出了一种分阶段建模策略，从基于无缺陷材料的无监督异常检测开始，遵循MVTec异常检测基准中地毯类别的范例，并通过人在环的标注飞轮成熟为有监督的检测和分割模型。最后，我们将检测性能与DMAIC目标联系起来，展示逃逸缺陷的减少如何转化为过程质量和过程西格玛水平的提升。贡献在于提供了一个端到端、可部署的蓝图，将数据收集视为首要工程目标而非事后考虑。

英文摘要

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inconsistent at the line speeds and widths of modern looms. We present a design proposal for an in-line machine-vision system whose primary purpose is twofold: to inspect the carpet web in real time and, equally importantly, to systematically collect and label images of defect patterns so that increasingly capable quality-control models can be trained over the life of the installation. The proposal is grounded in a concrete industrial setting: a Six Sigma (DMAIC) project at a woven-carpet production facility that anticipated a production bottleneck following the installation of additional weaving machines, with a substantial baseline defect rate and significant financial exposure associated with quality failures. We describe an imaging subsystem based on synchronized line-scan cameras with combined bright-field and grazing illumination, derive the resolution and throughput requirements needed to resolve fine structural defects across a multi-metre web, and define a carpet-specific defect taxonomy. We then lay out a staged modelling strategy that begins with unsupervised anomaly detection trained on defect-free material, following the paradigm exemplified by the carpet category of the MVTec Anomaly Detection benchmark, and matures through a human-in-the-loop annotation flywheel into supervised detection and segmentation models. Finally, we connect detection performance to the DMAIC objectives, showing how reductions in escaped defects translate into improved process quality and process sigma levels. The contribution is an end-to-end, deployable blueprint that treats data collection as a first-class engineering objective rather than an afterthought.

URL PDF HTML ☆

赞 0 踩 0

2606.00747 2026-06-04 cs.CV cs.AI

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

SkyShield：占用作为低空无人机自主飞行的安全接口

Jie Gao, Jie Ma, Kaihui Lin, Kai Ye, Miaohui Zhang, Pingyang Dai, Liujuan Cao

发表机构 * Xiamen University（厦门大学）； Jiangxi Academy of Sciences（江西省科学院）

AI总结针对低空无人机自主飞行中的三维空间理解问题，提出首个前视单目语义占用基准SkyShield、动态感知度量KAR-mIoU和几何优先基线SkyOcc，将占用作为安全接口。

详情

AI中文摘要

对于低空无人机自主飞行，三维空间理解不仅仅是感知目标，更是人类指令与物理飞行之间的安全接口。在20米以下的人尺度城市空域中，薄几何结构、遮挡、植被和城市杂乱决定了飞行器能否安全进入前方空间。然而，现有的无人机数据集主要提供2D标注或3D框，而面向驾驶的占用基准假设稳定的地面级传感器装置。两者都缺少低空飞行的定义性场景：一个前视单目相机从移动的飞行器上观察占据和自由空间，具有逐帧变化的6自由度姿态和相机外参。为填补这一空白，我们提出了SkyShield，据我们所知，这是首个面向20米以下城市无人机飞行的前视单目语义占用基准。基于CARLA构建，SkyShield包含36K个前视无人机样本，涵盖多种城市场景和天气条件，每张图像配以逐帧6自由度无人机姿态、逐帧动态相机几何、无人机状态和前视截锥体语义占用标签。我们进一步提出了KAR-mIoU，一种以无人机为中心且动态感知的度量，通过运动可达性和碰撞时间重新加权体素级评估，揭示传统mIoU隐藏的安全关键风险。为应对这一具有挑战性的新场景，我们提供了SkyOcc，一种几何优先的单目基线，将逐帧无人机姿态集成到投影中，融合时序占用特征，并应用安全先验优化以保留稀疏的碰撞关键结构。SkyShield、KAR-mIoU和SkyOcc共同将占用确立为低空空中自主飞行的安全接口。代码和数据集将公开发布。

英文摘要

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety interface between human instructions and physical flight. In human-scale urban airspace below 20 meters, thin geometry, occlusions, vegetation, and urban clutter define whether an aerial agent can safely enter the space ahead. However, existing UAV datasets mainly provide 2D annotations or 3D boxes, while driving-oriented occupancy benchmarks assume stable ground-level sensor rigs. Both miss the defining regime of low-altitude flight: a front-facing monocular camera observing occupied and free space from a moving aerial body with frame-wise changing 6-DoF pose and camera extrinsics. To bridge this gap, we introduce SkyShield, to the best of our knowledge the first front-view monocular semantic occupancy benchmark for urban UAV flight below 20 meters. Built on CARLA, SkyShield contains 36K front-view UAV samples across diverse urban scenes and weather conditions, pairing each image with frame-wise 6-DoF UAV pose, frame-wise dynamic camera geometry, UAV states, and front-frustum semantic occupancy labels. We further propose KAR-mIoU, a UAV-centric and dynamics-aware metric that re-weights voxel-level evaluation by kinematic reachability and time-to-collision, revealing safety-critical risks hidden by conventional mIoU. To tackle this challenging new setting, we provide SkyOcc, a geometry-first monocular baseline that integrates frame-wise UAV attitude into projection, fuses temporal occupancy features, and applies safety-prior optimization to preserve sparse collision-critical structures. Together, SkyShield, KAR-mIoU, and SkyOcc establish occupancy as a safety interface for low-altitude aerial autonomy. Code and dataset will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.00732 2026-06-04 cs.AI cs.LG

SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition

SHARP: 基于睡眠的分层加速重放用于长程非平稳时间模式识别

Jayanta Dey, Shikhar Srivastava, Itamar Lerner, Christopher Kanan, Dhireesha Kudithipudi

发表机构 * Department of Computer Engineering, University of Texas at San Antonio, USA（德克萨斯大学圣安东尼奥分校计算机工程系）； Department of Computer Science, University of Rochester, USA（罗切斯特大学计算机科学系）； Department of Psychology, University of Texas at San Antonio, USA（德克萨斯大学圣安东尼奥分校心理学系）

AI总结提出SHARP框架，通过将时间学习分解为记忆模块和模式识别模块，并引入离线睡眠阶段加速重放时间结构记忆，实现长程非平稳序列模式的高效学习。

详情

AI中文摘要

学习长程非平稳时间模式仍然是现代序列模型的核心挑战，特别是在严格的流式设置中。在这些设置中，数据按顺序到达，必须单次处理，不能同时回顾过去的观测。标准架构，包括循环神经网络和变换器，受到截断时间反向传播或显式输入窗口长度的限制，无法进行长程信用分配。为了解决这些限制，我们提出了SHARP（基于睡眠的分层加速重放），一个将时间学习分解为两个互补组件的框架：一个累积过去输入的结构化历史的记忆模块，以及一个在该记忆上操作的模式识别模块。这种分离通过消除跨多步时间反向传播进行长程信用分配的需求，实现了对非平稳动态的资源高效和计算高效适应。受啮齿动物在慢波睡眠期间观察到的加速重放启发，SHARP引入了离线（睡眠）阶段，其中时间结构的记忆痕迹以加速形式重放并整合到更高层次的记忆表示中，从而改善长程上下文保留。通过受控模拟和消融研究，我们表征了所提出框架的关键属性。在text8和PG-19等基准数据集上，我们证明SHARP通过保留先前见过数据的下一个令牌预测性能，同时继续从当前流中学习并泛化到未来未见数据，改进了循环基线。这些增益得益于其分层结构，该结构以线性时间计算成本实现了指数级增长的有效时间上下文。

英文摘要

Learning long-range non-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming settings. In these settings, data arrive sequentially and must be processed in a single pass without simultaneously revisiting past observations. Standard architectures, including recurrent neural networks and transformers, are constrained by either truncated backpropagation through time horizon or explicit input window length for long range credit assignment. To address these limitations, we propose SHARP (Sleep-based Hierarchical Accelerated Replay), a framework that decomposes temporal learning into two complementary components: a memory module that accumulates a structured history of past inputs, and a pattern-recognition module that operates over this memory. This separation enables resource- and compute-efficient adaptation to non-stationary dynamics by eliminating the need for backpropagation through time across many steps for long-range credit assignment. Inspired by the accelerated replay observed in rodents during slow-wave sleep, SHARP incorporates offline (sleep) phases in which temporally structured memory traces are replayed in an accelerated form and integrated into higher-level memory representations, improving long-range context retention. Through controlled simulations and ablation studies, we characterize the key properties of the proposed framework. In benchmark datasets such as text8 and PG-19, we demonstrate that SHARP improves over recurrent baselines by retaining next-token predictive performance on previously seen data while continuing to learn from the current stream and generalizing to future unseen data. These gains are enabled by its hierarchical structure, which yields an exponentially increasing effective temporal context with only linear-time computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.00356 2026-06-04 cs.CL

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

自动解释标签的泛化程度：跨语言、文字和改写的受控研究

Sripad Karne

发表机构 * Columbia University（哥伦比亚大学）

AI总结通过塞尔维亚双文字系统控制实验，研究稀疏自编码器特征的自然语言标签是否真正泛化到不同语言和文字，发现标签在语义内容匹配上存在显著偏差，且随网络深度增加而加剧。

详情

AI中文摘要

稀疏自编码器（SAE）特征越来越多地用于解释语言模型，自动生成的自然语言标签是理解每个特征含义的主要接口。我们询问这些标签是否泛化：标记为某个概念的特征是否真的跨语言和文字追踪该概念？使用塞尔维亚双文字系统作为受控测试平台——通过确定性音译将同一语言以拉丁字母和西里尔字母书写——我们首先发现，由不同语言、文字和措辞中的相同内容激活的SAE特征集具有显著重叠（峰值Jaccard相似度0.57，随机基线0.13），表明存在真正的跨语言语义特征。然后我们测试自动解释标签是否跟上步伐。它们通常没有：标签描述语义内容的特征在塞尔维亚语中错过相同含义的频率比英语中高出多达4倍，并且错过塞尔维亚西里尔字母比塞尔维亚拉丁字母更多——这两种文字是彼此的确定性音译——表明失败追踪了每种形式在训练中的表现程度。差距随着网络深度增加而扩大，但标签没有给出任何失败指示。这些结果表明，自动解释标签可能反映特征在良好表示输入上的行为，而不是概念本身。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed--the same language written in both Latin and Cyrillic via deterministic transliteration--we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap (mean Jaccard 0.39 vs. 0.13 random baseline, peaking at 0.57), suggesting genuine cross-lingual semantic features. We then test whether auto-interpretation labels keep pace. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to 4x more often thanwithin English, and miss Serbian Cyrillic more than Serbian Latin--two scripts that are deterministic transliterations of each other--suggesting the failures align with how well each form is represented in training. The gap grows with network depth, yet the labels give no indication that they fail. These results suggest that auto-interpretation labels may reflect a feature's behavior on well-represented inputs rather than the concept itself.

URL PDF HTML ☆

赞 0 踩 0

2606.00260 2026-06-04 cs.CV cs.LG

LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition

LastAct: 轨迹引导的最新活动定位用于实时智能家居活动识别

Zishuai Liu, Ruili Fang, Jin Lu, Fei Dou

发表机构 * School of Computing, University of Georgia（佐治亚大学计算学院）

AI总结提出LastAct框架，通过轨迹图像序列和边界定位器解决滑动窗口中的边界污染问题，实现实时智能家居活动识别。

详情

AI中文摘要

基于环境传感器的人类活动识别（HAR）支持健康监测和辅助生活等智能家居应用。然而，在实际部署中，传感器事件以连续流的形式到达，活动边界未知。因此，滑动窗口推理会产生许多跨越转换并包含混合活动的窗口，造成边界污染，违反了大多数基准和模型使用的预分割实例假设。此外，许多管道通过将传感器ID视为独立标记来未充分利用空间上下文。我们提出了LastAct，一个面向轨迹的流式智能家居HAR框架，旨在处理混合窗口下的最新活动，同时显式建模空间结构。LastAct将传感器事件投影到家庭平面图上，形成保持空间连续性的布局对齐轨迹图像序列。一个轻量级门控识别受污染的窗口，边界定位器估计最后一个转换，从而实现边界引导的掩码，强调边界后的证据并抑制过时的上下文。为了提高效率，我们重用预计算的布局对齐模板缓存以避免重复渲染。实验表明，在四个公开的智能家居数据集上，采用接近真实的混合活动协议，LastAct在纯窗口上达到竞争性或更优的性能，并在交叉/混合窗口上获得显著的Macro-F1增益，展示了在接近真实的滑动窗口机制下更强的鲁棒性。

英文摘要

Human Activity Recognition (HAR) from ambient sensors enables smart-home applications such as health monitoring and assisted living. In realistic deployments, however, sensor events arrive as a continuous stream and activity boundaries are unknown. Sliding-window inference therefore produces many windows that straddle transitions and contain mixed activities, creating boundary contamination that violates the pre-segmented instance assumption used by most benchmarks and models. Moreover, many pipelines under-use spatial context by treating sensor IDs as independent tokens. We present LastAct, a trajectory-centric framework for streaming smart-home HAR that targets the most recent activity under mixed windows while explicitly modeling spatial structure. LastAct projects sensor events onto the home floorplan to form a layout-aligned trajectory image sequence that preserves spatial continuity. A lightweight gate identifies contaminated windows, and a boundary localizer estimates the last transition to enable boundary-guided masking that emphasizes post-boundary evidence and suppresses stale context. For efficiency, we reuse a precomputed layout-aligned template cache to avoid repeated rendering. Empirically, across four public smart-home datasets under near-realistic mixed-activity protocols, LastAct achieves competitive or superior performance on pure windows and yields substantial Macro-F1 gains on cross/mixed windows, demonstrating improved robustness under near-realistic sliding-window regimes.

URL PDF HTML ☆

赞 0 踩 0

2606.00012 2026-06-04 cs.CL cs.AI

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

DraDDP：多模态多方对话话语解析数据集

Shannan Liu, Peifeng Li, Yaxin Fan, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对现有研究局限于文本或双方对话的问题，构建了基于美剧的首个公开英文多模态多方对话话语解析数据集DraDDP，并验证了多模态信息在捕捉对话结构和关系类型中的价值。

Journal ref Findings of the Association for Computational Linguistics (ACL 2026)

详情

AI中文摘要

多方对话话语解析旨在识别对话中话语之间的依赖结构和关系类型。以往的研究大多局限于文本模态或双方对话，无法满足多模态和多方对话场景。本文基于美国电视剧，构建了首个公开的英文多模态多方对话话语解析数据集DraDDP。该数据集包含495个对话片段，共6,374条话语和9.1小时的并行视频内容，涵盖了丰富的多方交互场景。此外，我们在DraDDP上评估了该任务，并深入分析了不同模态的影响，建立了全面的基准。实验结果表明，多模态信息在捕捉对话结构和关系类型方面具有重要价值。我们将公开发布数据集、标注指南和代码，以促进多模态对话理解的未来研究。

英文摘要

Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet the multimodal and multi-party settings. In this paper, we construct the first publicly available English multimodal dataset DraDDP for multi-party dialogue discourse parsing, based on American TV dramas. DraDDP contains 495 dialogue segments with 6,374 utterances and 9.1 hours of parallel video content, covering rich multi-party interaction scenarios. Moreover, we establish comprehensive benchmarks by evaluating this task on DraDDP and conducting in-depth analysis on the impact of different modalities. Experimental results demonstrate the value of multimodal information in capturing dialogue structures and relation types. We will publicly release the dataset, annotation guidelines, and code to promote future research in multimodal dialogue understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31483 2026-06-04 cs.CL cs.AI

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

BenHalluEval：孟加拉语大语言模型的多任务幻觉评估框架

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Ishmam Tashdeed, Md Taukir Azam Chowdhury

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology（伊斯兰科技大学计算机科学与工程系）； Department of Computer Science and Engineering, University of California（加州大学计算机科学与工程系）

AI总结针对孟加拉语大语言模型幻觉评估的空白，提出BenHalluEval框架，涵盖四项任务，构建12000个幻觉候选，并提出双轨校准指标BenHalluScore，揭示模型间幻觉校准的显著差异。

Comments Preprint. Under review

详情

AI中文摘要

尽管孟加拉语是世界上使用人数第六多的语言，但此前尚无工作系统评估大语言模型（LLMs）在孟加拉语上的幻觉。我们提出了BenHalluEval，一个针对孟加拉语的细粒度幻觉评估框架，涵盖四项任务：生成式问答（GQA）、孟加拉语-英语混合问答、摘要和推理。我们利用GPT-5.4从三个现有孟加拉语数据集中构建了12,000个幻觉候选，涵盖十二种任务特定的幻觉类型，并在双轨协议下评估了七个LLM，涵盖推理导向、多语言和孟加拉语中心类别，该协议独立测量真实实例上的假阳性率（轨道A）和幻觉候选上的幻觉检测率（轨道B）。为了同时惩罚两种失败模式并防止均匀响应偏差导致的分数膨胀，我们提出了BenHalluScore，一种双轨校准指标，在模型和任务上范围从7.72%到55.42%，揭示了幻觉校准的显著差异。链式思维提示作为一种缓解策略应用，会改变响应分布，但未能一致改善幻觉判别。BenHalluEval建立了首个针对孟加拉语的专用幻觉基准，并突显了单轨和仅提示评估方法在低资源语言环境中的不足。数据集和代码可在https://anonymous.4open.science/r/BanglaHalluEval-EB77获取。

英文摘要

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

URL PDF HTML ☆

赞 0 踩 0

2605.31039 2026-06-04 cs.CV

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

GGT-100K：面向泛化真实世界图像恢复的生成式真实标签

Xiangtao Kong, Jixin Zhao, Lingchen Sun, Rongyuan Wu, Lei Zhang

发表机构 * VISUAL COMPUTING LAB POLYU（PolyU视觉计算实验室）； The Hong Kong Polytechnic University（香港理工大学）； OPPO Research Institute（OPPO研究院）

AI总结提出利用生成式多模态基础模型从真实低质量图像合成高质量目标作为真实标签，构建包含10万对数据的GGT-100K数据集，显著提升多种图像恢复模型的真实世界泛化能力。

详情

AI中文摘要

真实世界图像恢复（IR）受限于高质量配对训练数据的稀缺。合成数据集丰富但常无法模拟真实退化，而真实配对数据集昂贵且难以获取。因此，在这些数据集上训练的IR模型在真实场景中泛化能力有限。本文提出利用生成式多模态基础模型（MFMs）从真实低质量（LQ）图像生成高质量（HQ）目标，即生成式真实标签（GGT）。我们首先对包括Nano-Banana-2和GPT-Image-2在内的九种最先进MFMs，在多种场景和退化类型的图像上进行了系统评估。结果表明，采用基于VLM自适应提示的Nano-Banana-2在合成感知真实且内容忠实的高质量目标方面能力最强，可作为LQ输入的GGT。随后，我们使用Nano-Banana-2构建GGT合成流水线，包括多阶段质量控制以确保数据可靠性，并构建了GGT-100K，一个包含103,707个训练对的LQ-HQ配对数据集，覆盖多样场景和复杂真实退化。还建立了500个图像对的测试集。大量实验表明，GGT-100K持续提升多种IR模型的真实世界泛化能力，尤其对微调生成模型进行IR任务有显著益处。我们的结果表明，MFMs可作为面向恢复的数据生成的实用工具，GGT-100K是扩展真实世界IR模型泛化边界的有用资源。

英文摘要

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.

URL PDF HTML ☆

赞 0 踩 0

2605.30947 2026-06-04 cs.CL

Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

将人工智能研究扩展到人文学科：一个用于证据基础学术的多智能体框架

Yating Pan, Jiajun Zhang, Jun Wang, Qi Su

发表机构 * Department of Information Management（信息管理系）； Research Center for Digital Humanities（数字人文研究中心）； School of Foreign Languages（外国语言学院）； Institute for Artificial Intelligence（人工智能研究院）

AI总结提出SPIRE多智能体框架，通过将人文学科操作建模为协作智能体角色，结合多尺度细读检索，实现基于证据的论证，在古典文献基准上优于现有方法。

Comments 28 pages, 3 figures. Code, data catalogues, and reproduction scripts: https://github.com/YatingPan/SPIRE. Lead corresponding author: Jun Wang; corresponding author: Qi Su

详情

AI中文摘要

基于LLM的研究智能体在科学和工程领域取得了快速进展，这些领域的研究围绕可执行的实验、代码和定量信号组织。然而，人文学科学术需要一种不同的推理模式：对原始资料进行解释性、基于证据的论证，其中学术价值取决于忠实引用、可验证来源和细读。现有的研究智能体仍然主要针对执行和检索进行优化，而非基于证据的解释性推理。为了解决这一差距，我们引入了SPIRE（学术原语启发的研究引擎），一个用于基于证据的人文学科学术的多智能体框架。借鉴学术原语理论，SPIRE将人文学科中反复出现的操作转化为协作的智能体角色（来源发现、证据注释、比较、来源检查、抽样、引用绑定和论证综合），并基于多尺度细读基础，包括段落、上下文内图社区和跨上下文语义聚类。在一个针对古典中文和希腊罗马拉丁学术的同行评审论文基准上，SPIRE比Naive LLM、Text RAG和GraphRAG更可靠地恢复引用的原始来源证据，并在答案准确性、深度、覆盖范围和证据质量方面获得更高的盲审评分。消融实验表明，学术操作智能体和细读检索都对基于证据的论文有所贡献。代码、数据目录和复现脚本已在https://github.com/YatingPan/SPIRE发布。

英文摘要

LLM-based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence-grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence-grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly-Primitives-Inspired Research Engine), a multi-agent framework for evidence-grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi-scale close-reading substrate of passages, intra-context graph communities, and cross-context semantic clusters. On a peer-reviewed-paper benchmark over classical Chinese and Greco-Roman Latin scholarship, SPIRE recovers cited primary-source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind-judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly-operation agents and close-reading retrieval contribute to evidence-grounded essays. Code, data catalogues, and reproduction scripts are released at https://github.com/YatingPan/SPIRE.

URL PDF HTML ☆

赞 0 踩 0

2605.30705 2026-06-04 cs.CV cs.LG

Equivariant Latent Alignment via Flow Matching under Group Symmetries

群对称下通过流匹配的等变潜在对齐

Sunghyun Kim, Jaehoon Hahm, Jeongwoo Shin, Joonseok Lee

发表机构 * University of Illinois Urbana-Champaign, Illinois, USA（伊利诺伊大学厄巴纳-香槟分校）； Seoul National University, Seoul, Korea（首尔国立大学）

AI总结针对现有方法在潜在空间中存在的等变错位问题，提出基于流的残差潜在流框架，通过纠正错位潜在表示来增强旋转群SO(n)下的等变一致性，提升新视角合成质量。

详情

AI中文摘要

几何感知生成模型和新视角合成方法在视觉保真度和一致性方面展现出强大潜力。同时，等变表示学习已成为构建潜在空间的有力框架，其中分析已知的群变换可以直接作用，捕捉数据中的几何结构，并增强新视角合成的可解释性和泛化性。然而，我们发现现有方法常遭受潜在错位问题，即潜在空间中预期的群作用与实际所需的变换之间存在差异。因此，学习到的潜在表示往往无法一致地保持底层群对称性所施加的等变关系。为解决此问题，我们提出残差潜在流，一种基于流的框架，用于纠正错位的潜在表示，从而提高对底层等变关系的遵从性。我们的综合实验表明，在旋转群SO(n)下，我们的方法显著减少了潜在错位，并提高了新视角合成的质量。

英文摘要

Geometry-aware generative models and novel view synthesis approaches have shown strong potential in visual fidelity and consistency. In parallel, equivariant representation learning has emerged as a powerful framework for constructing latent spaces where analytically known group transformations could act directly, capturing geometric structure in data and enhancing both interpretability and generalization in novel view synthesis. However, we identify that existing approaches often suffer from latent misalignment, a discrepancy between the intended group action and the actually required transformations in the latent space. Consequently, the learned latents often fail to consistently preserve the equivariant relations imposed by the underlying group symmetry. To address this, we propose Residual Latent Flow, a flow-based framework that corrects the misaligned latents, thereby improving compliance with the underlying equivariance relation. Our comprehensive experiments show that our method significantly reduces latent misalignment and improves novel view synthesis quality, under rotation groups SO(n).

URL PDF HTML ☆

赞 0 踩 0

2605.28210 2026-06-04 cs.AI cs.CY cs.HC q-bio.NC

The Illusion of Opting in AI-Mediated Consequential Decisions

AI中介的后果性决策中的选择错觉

Eugene Yu Ji

发表机构 * GitHub

AI总结基于Ullmann-Margalit的选择概念，揭示当前AI系统造成一种“选择错觉”，即看似有意义的后果性选择实则削弱了主体的真正选择能力，并提出通过存在诚实、生态理性和反事实修复三个规范要义来保护和发展元能力。

Comments 11 pages, 1 figure, 2 tables

详情

AI中文摘要

借鉴Ullmann-Margalit的选择概念（变革性、不可逆性、被排除替代方案的阴影），我们表明当前AI系统引发了一个深刻的伦理问题，而现有AI伦理尚未充分捕捉：选择错觉，即个人和群体遭遇看似有意义的后果性选择的欺骗性外观，而成为真正能够选择所需的主体性却被削弱。针对将AI主要视为给定目标优化器的进路，我们认为应通过AI系统是否保护和发展对抗选择错觉的元能力来评估：这种元能力是社会和制度支撑的主体能力，通过它手段和目的得以形成、争论、修订和拥有。这种重新框架对于弱势群体尤为紧迫，当AI中介的路径误导行为和行动时，他们最无力承担选择错觉的成本。我们为AI中介的后果性决策提出三个规范要义：存在诚实，承认预测的局限性；生态理性，将指导置于异质的生活生态中；以及反事实修复，当AI中介的决策路径失败时，承认并修复被排除的替代方案。

英文摘要

Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

URL PDF HTML ☆

赞 0 踩 0

2605.24358 2026-06-04 cs.LG cs.AI

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

图数据上具有差异化网络效应的处理效应估计

Xiaofeng Lin, Han Bao, Hisashi Kashima

发表机构 * Kyoto University（京都大学）； The Institute of Statistical Mathematics（统计数学研究所）； Tohoku University（东北大学）； RIKEN AIP（理化学研究所AIP）

AI总结针对图数据中个体处理效应估计受邻居干扰且存在差异化网络效应的问题，提出一种结合部分注意力机制和消息放大器的干扰建模方法，以捕获邻居重要性和规模差异，提升估计精度。

Comments Accepted by the research track of the KDD 2026 conference

详情

AI中文摘要

从观测图数据中估计个体处理效应（ITE）对于商业和医学等领域的决策至关重要。由于干扰的存在，该任务具有挑战性，因为个体结果可能受到其邻居的处理和协变量的影响。现有方法尝试对这种干扰进行建模以实现准确的ITE估计。然而，一个关键问题常常被忽视：差异化网络效应（DNE），即由具有不同重要性和规模的邻居组成的局部网络所产生的影响。捕获DNE至关重要；否则，由于对干扰的错误刻画，我们将得到不精确的ITE估计，从而导致错误的决策。为了解决这一挑战，我们提出了一种新颖的干扰建模机制，该机制结合了两个部分注意力机制和一个消息放大器。部分注意力机制自动估计不同邻居在干扰中的重要性，而消息放大器根据邻居的规模调整干扰建模机制的结果，所有这些使得模型能够捕获DNE。在三个真实世界图上的实验表明，我们的方法在从图数据估计ITE方面优于现有方法，这证实了显式捕获DNE的重要性。

英文摘要

Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine. This task is challenging due to interference, where individual outcomes can be influenced by the treatments and covariates of their neighbors. Existing methods attempt to model such interference for accurate ITE estimation. However, a critical issue is often overlooked: differentiated networked effect (DNE), an effect caused by local networks consisting of neighbors with varying importance and scales. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE. Experiments on three real-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE.

URL PDF HTML ☆

赞 0 踩 0

2605.30021 2026-06-04 cs.CL

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

在不损失对齐的情况下恢复多样性：面向后训练大语言模型的DPO配方

Vinay Samuel, Yapei Chang, Mohit Iyyer

发表机构 * University of Maryland, College Park（马里兰大学 College Park 分校）

AI总结提出REDIPO数据构建流程，通过离线DPO从基础模型生成中恢复多样性答案，同时保持指令模型的对齐性能。

Comments Under Review. 26 pages, 3 figures, 16 tables

详情

AI中文摘要

许多开放式指令有多个有效答案，用户可以从看到这些答案中受益，但后训练往往将LLM的输出空间缩小到一小部分规范响应。我们引入REDIPO，一种离线DPO数据构建流程，用于恢复不同的有效答案模式，同时保留指令模型的对齐优势。对于每个提示，REDIPO从基础模型和指令模型中采样响应，用指令模型重写基础模型响应，过滤候选以确保安全和指令遵循质量，并构建偏好对，在具有相似指令遵循奖励的候选者中偏向边际多样的响应。在Qwen3-4B、OLMo-3-7B和LLaMA-3.1-8B上，相对于指令检查点，REDIPO将NoveltyBench distinct_k分别提高了134%、33%和44%，而DivPO在同一模型上将多样性改变了0%、-6%和-4%。这些增益在很大程度上保持了MTBench、IFEval和Arena-Hard的性能，并降低了直接类别HarmBench攻击成功率。消融实验表明，边际多样性对选择和基础响应重写驱动了多样性增益，而过滤和质量边界配对有助于保持对齐。总体而言，我们的结果表明，通过精心构建的偏好数据，可以重新引入基础模型生成中的多样化有效答案，同时保留后训练的对齐优势。我们在https://github.com/vsamuel2003/RiDiPO发布代码和数据。

英文摘要

Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models. These gains largely maintain MTBench, IFEval, and Arena-Hard performance, and reduce direct-category HarmBench attack success rate. Ablations show that marginal-diversity pair selection and base-response rewriting drive the diversity gains, while filtering and quality-bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base-model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post-training. We release our code and data at https://github.com/vsamuel2003/ReDiPO.

URL PDF HTML ☆

赞 0 踩 0

2605.29861 2026-06-04 cs.CL cs.AI

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

迈向可验证的多模态深度研究：用于交错报告生成的多智能体框架

Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）

AI总结提出多智能体框架Ptah，通过规划、研究和写作阶段生成交错文本与视觉证据的多模态报告，并引入验证器确保事实准确性和跨模态一致性。

Comments In progress

详情

AI中文摘要

大型语言模型（LLMs）已将自主智能体从深度搜索（检索简洁的事实答案）推进到深度研究（将分散的证据综合成长篇报告）。然而，由于缺乏确定性真实值的开放式合成以及需要将文本论证与视觉证据交错，可验证的多模态深度研究仍然具有挑战性。我们提出 extsc{Ptah}，一个用于交错报告生成的多智能体框架。 extsc{Ptah}通过规划、研究和写作阶段编排从用户查询到渲染网页报告的完整生命周期，其中专门智能体构建视觉感知计划、收集基于声明的证据、在 extit{视觉工作记忆}中维护与源对齐的图像，并通过声明式多模态工具使用撰写报告。验证智能体作为框架的接受函数，在整个工作流中强制执行事实依据、引用保真度和跨模态一致性。我们进一步引入 extsc{Ptah}Eval，一个评估协议，通过图像级和呈现级评估增强现有基准。在深度研究基准上的实验表明， extsc{Ptah}生成的面向人类的多模态报告比强基线更可靠、视觉信息更丰富且更实用。

英文摘要

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Coherence Maximization Improves Pluralistic Alignment

Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

Tiny Collaborative Inference for Occlusion-Robust Object Detection

Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Drifting Preference Optimization for One-Step Generative Models

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

EEG-FuseFormer: A Transformer-Driven Feature Fusion Framework for Seizure Onset Prediction

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

Data Collection for Training Quality-Control AI in Carpet Manufacturing

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

Equivariant Latent Alignment via Flow Matching under Group Symmetries

The Illusion of Opting in AI-Mediated Consequential Decisions

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation