URL PDF HTML ☆

赞 0 踩 0

2606.16110 2026-06-16 cs.LG 新提交

Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

审计机器遗忘：关于模型是否真正遗忘的系统性研究

Dayong Ye, Tianqing Zhu, Ruiding Huang, Xinbo Fu, Jiayang Li, Bo Liu, Huan Huo, Wanlei Zhou

发表机构 * University of Technology Sydney（悉尼科技大学）； Deakin University（迪肯大学）； Macquarie University（麦考瑞大学）

AI总结针对隐私法规需求，提出首个实用通用审计框架，通过无知证明概念验证现有遗忘算法能否真正擦除指定数据影响，实验表明重训练和微调方法有效，去优化和Fisher/Hessian方法失败。

详情

AI中文摘要

机器遗忘因日益增长的隐私担忧和监管要求而受到广泛研究。然而，审计遗忘算法是否真正擦除了特定数据的影响仍然是一个开放的挑战。缺乏可靠且实用的审计机制可能导致严重的隐私风险，例如残留信息泄露。本文对现有遗忘算法能否真正遗忘指定数据进行了系统性研究。受无知证明概念的启发，我们提出了首个实用且通用的机器遗忘审计框架。我们的框架通过消除从头再训练基线、避免训练大量影子模型以及无需对原始训练过程进行侵入性干预，解决了现有方法的关键实用性限制。为了评估我们框架的有效性，我们首先进行验证实验以确认其健全性和完备性。然后，我们在六个数据集和十种代表性遗忘方法上进行了全面实验。结果表明，我们的框架能够可靠地区分成功和失败的遗忘。特别地，我们观察到基于重训练和基于微调的方法可以实现有效遗忘，即使目标数据仍保留在原始数据集中。相比之下，基于去优化的方法无法实现真正遗忘，反而降低了模型性能。基于Fisher/Hessian的方法也无法遗忘请求的数据，即使提供了形式化认证。此外，我们展示了我们的框架对虚假遗忘尝试具有鲁棒性，并且能够很好地泛化到大型语言模型。

英文摘要

Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model's performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.

URL PDF HTML ☆

赞 0 踩 0

2606.16103 2026-06-16 cs.CV 新提交

Tool-IQA: 利用简单工具增强图像质量评估

Guanyi Qin, Junjie Zhang, Chunming He, Yibing Fu, Jie Liang, Tianhe Wu, Lei Zhang

发表机构 * National University of Singapore（新加坡国立大学）； OPPO Research Institute（OPPO研究院）； Nanyang Technical University（南洋理工大学）； Duke University（杜克大学）； City University of Hong Kong（香港城市大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出Tool-IQA，通过为视觉语言模型配备放大镜和伽马校正器等简单工具，将被动评分转变为工具增强的工作流程，显著提升图像质量评估性能。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被用于图像质量评估（IQA）。然而，当前方法通常采用静态的一次性评分范式，而人类通过动态视觉检查（例如，选择性调整视图以验证细节和细微伪影）来评估图像质量。具体来说，仅依赖单次观察存在两个主要限制：首先，仅在全局尺度上感知图像限制了对更精细局部细节的评估；其次，图像的原始强度分布可能压倒可见性，导致对图像质量的检查不足。为了解决这些问题，我们提出了Tool-IQA，将评估机制从被动评分转变为工具增强的工作流程。特别地，我们为VLM配备了简单而有效的视图工具：用于检查局部细节的放大镜，以及用于揭示可见性和隐藏伪影的伽马校正器。评估遵循一个结构化的流程，包括带有评分标准的初始观察、工具增强的深入检查以及最终校准质量分数的量化。此外，为了确保高效且有目的地调用工具，我们引入了一种批量感知的训练策略，以奖励能够产生积极贡献的工具交互，而不仅仅是鼓励使用。在各种IQA基准上的实验表明，通过有效的工具调用和校准评估，我们提出的Tool-IQA显著优于现有最先进的模型，例如，在具有挑战性的CLIVE数据集上实现了0.854的PLCC。

英文摘要

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.16078 2026-06-16 cs.RO 新提交

A Deployment Case Study in Robotic Apparel Automation: Digital Twin Integration, Interoperability, and Workforce Enablement

机器人服装自动化部署案例研究：数字孪生集成、互操作性与劳动力赋能

Gokul Narayanan, Abhiroop Ajith, Jonathan Zornow, Carlos Calle, Auralis Herrero Lugo, Jose Luis Susa Rincon, Chengtao Wen, Eugen Solowjow

发表机构 * Siemens Corporation（西门子股份公司）； Sewbo ； Levi's（李维斯）； Bluewater Defense

AI总结针对织物柔性导致的机器人操作难题，本文通过牛仔布制造案例，提出集成数字线程、数字孪生、互操作层及运行时监控的机器人缝纫系统，实现快速部署与鲁棒性提升。

Comments 4 pages, 3 figures, IEEE ICRA 2026 Workshop Paper

详情

AI中文摘要

尽管在电子和汽车制造等领域的柔性自动化取得了稳步进展，但由于织物具有可变形性且难以用机器人操作，服装自动化仍然具有挑战性。本文介绍了一个面向部署的牛仔布制造机器人缝纫系统案例研究，强调了实际应用所需的系统级集成。在工程层面，数字线程模块将DXF生产图纸解析为工艺参数和可执行的机器人轨迹，减少了手动编程工作量，并实现了跨缝纫操作的快速重新定位。同时，在部署前使用工作单元的数字孪生来验证可达性和间隙、优化布局和顺序、评估操作员访问以及评估与上下游任务的节拍兼容性，从而降低调试风险。在部署阶段，系统通过互操作层将协作机器人与传统缝纫设备、焊接、吸盘夹具和机器级控制器集成。运行时监控与验证（包括缝迹监控、碰撞检查和轨迹级验证）提高了环境变化下的鲁棒性，而面向操作员的培训和指导工具支持设置、故障排除和技术采纳。在牛仔短裤上进行的两次分阶段工厂部署（涵盖2D口袋操作和3D服装成型缝迹）表明，基于数字孪生的验证、数字线程驱动的任务生成、互操作性、运行时验证和操作员培训对于扩展机器人服装自动化至关重要。

英文摘要

Despite steady advances in flexible automation in sectors such as electronics and automotive manufacturing, apparel automation remains challenging because fabrics are deformable and difficult to manipulate with robots. This paper presents a deployment-oriented case study of a robotic sewing system for denim manufacturing, emphasizing the system-level integration required for practical adoption. At the engineering level, a digital thread module parses DXF production drawings into process parameters and executable robot trajectories, reducing manual programming effort and enabling rapid re-targeting across sewing operations. In parallel, a digital twin of the workcell is used during pre-deployment to validate reach and clearance, refine layout and sequencing, evaluate operator access, and assess cycle-time compatibility with upstream and downstream tasks, thereby reducing commissioning risk. At deployment, the system integrates a collaborative robot with conventional sewing equipment, welding, suction fixtures, and machine-level controllers through an interoperability layer. Runtime monitoring and verification, including seam monitoring, collision checking, and trajectory-level validation, improve robustness under environmental variability, while operator-facing training and guidance tools support setup, troubleshooting, and technology adoption. Two staged factory deployments on denim shorts, covering 2D pocket operations and 3D garment-shaping seams, show that digital-twin-based validation, digital-thread-driven task generation, interoperability, runtime verification, and operator training are important for scaling robotic apparel automation.

URL PDF HTML ☆

赞 0 踩 0

2606.16076 2026-06-16 cs.LG cs.AI cs.GT 新提交

Phys-JEPA: Physics-Informed Latent World Models for Multivariate Time-Series Forecasting

Phys-JEPA：面向多变量时间序列预测的物理信息潜在世界模型

Weizhi Nie, Weichao Liu, Honglin Guo, Yuting Su

发表机构 * Tianjin University（天津大学）

AI总结提出Phys-JEPA架构，将物理一致性约束引入潜在状态和状态转移，分解预测状态为物理和残差分量，在气候、交通、电力数据集上提升预测精度。

Comments Submitted to arXiv as a preliminary manuscript. 10 figures

详情

AI中文摘要

物理系统中的多变量预测需要模型在预测耦合时间变量的同时保持有意义的状态演化。深度预测器可以拟合时间相关性，物理信息模型可以用科学约束正则化预测，但这些方向通常仅在解码输出层面连接。因此，生成未来轨迹的隐藏预测状态可能在统计上有用，但在物理上无结构。我们提出Phys-JEPA，一种用于多变量时间序列预测的物理信息联合嵌入预测架构。Phys-JEPA学习一个潜在世界模型，其中预测状态被分解为物理和残差分量，物理一致性直接施加于潜在状态和潜在转移，而不仅仅施加于解码后的预测。该公式利用已知物理变量组织表示空间，同时保留未解析动力学的残差容量。在Jena Climate 2009–2016上，Phys-JEPA在H=24时将聚合MSE从0.12482降至0.12273，温度MSE从0.01892降至0.01831。在Traffic上，完整Phys-JEPA在所有测试视界内优于监督基线，将H=192的MSE从0.800784降至0.773873。在Electricity上，最佳变体取决于视界：静态潜在一致性在H=24和H=48时最强，而完整Phys-JEPA在H=192时给出最佳的聚合和目标变量MSE。这些初步结果表明，将物理信息学习从输出空间转移到潜在预测状态空间是可解释时间世界模型的一个有前景的方向。

英文摘要

Multivariate forecasting in physical systems requires models that predict coupled temporal variables while preserving meaningful state evolution. Deep forecasters can fit temporal correlations, and physics-informed models can regularize predictions with scientific constraints, but these directions are often connected only at the decoded-output level. As a result, the hidden predictive state that generates future trajectories may remain statistically useful but physically unstructured. We introduce Phys-JEPA, a physics-informed joint-embedding predictive architecture for multivariate time-series forecasting. Phys-JEPA learns a latent world model in which predictive states are decomposed into physical and residual components, and physical consistency is imposed directly on latent states and latent transitions rather than only on decoded forecasts. This formulation uses known physical variables to organize the representation space while retaining residual capacity for unresolved dynamics. On Jena Climate 2009--2016, Phys-JEPA reduces aggregate MSE from 0.12482 to 0.12273 and temperature MSE from 0.01892 to 0.01831 at H=24. On Traffic, full Phys-JEPA improves aggregate MSE over the supervised baseline across all tested horizons, reducing H=192 MSE from 0.800784 to 0.773873. On Electricity, the best variant depends on horizon: static latent consistency is strongest at H=24 and H=48, while full Phys-JEPA gives the best aggregate and target-variable MSE at H=192. These initial results suggest that moving physics-informed learning from output space to latent predictive state space is a promising direction for interpretable temporal world models.

URL PDF HTML ☆

赞 0 踩 0

2606.16075 2026-06-16 cs.LG cs.CV 新提交

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

AME：生成式AI市场中的多类型贡献者归属框架

Yang Shi, Songwen Pei, Yang Gao, Bingxue Zhang

发表机构 * University of Shanghai for Science and Technology（上海理工大学）； Fudan University（复旦大学）

AI总结针对生成式AI中多阶段协作的价值分配问题，提出AME框架，整合异构数据贡献评估、数据权利映射和可信执行，实现与人类判断一致的低成本价值分配。

2606.16074 2026-06-16 cs.CL cs.AI 新提交

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2：通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine（耶鲁大学医学院）； Yale School of Public Health（耶鲁大学公共卫生学院）； Texas State University（德克萨斯州立大学）

AI总结提出PVminerLLM2，通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术，解决监督微调难以处理的细粒度错误，在患者声音结构化提取任务上优于基线模型。

详情

AI中文摘要

动机：患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息，但大多是非结构化的，限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而，仅靠监督微调（SFT）难以处理罕见、细粒度且分布不均的错误，尤其是在令牌关键的结构化输出中。结果：我们提出了PVminerLLM2，一组改进的用于结构化患者声音提取的LLM，它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了（i）带有令牌级门控稳定项的偏好目标，防止在偏好优化下绝对令牌似然的退化，以及（ii）混淆感知的偏好对构建，以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权，以解决令牌不平衡和类别偏斜问题。在多种模型规模下，PVMinerLLM2始终优于强基线，在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升，并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现：PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于：https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

URL PDF HTML ☆

赞 0 踩 0

2606.16073 2026-06-16 cs.LG stat.ML 新提交

Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels

停止采样器！基于分类器的采样核自适应停止

Kirill Korolev, Nikita Morozov, Stepan Pavlenko, Esmeralda S. Whitammer, Sergey Samsonov

发表机构 * Stanford University（斯坦福大学）

AI总结提出将MCMC轨迹终止作为可学习组件，利用非循环生成流网络训练状态依赖分类器，在保证详细平衡条件下自适应停止采样，显著缩短轨迹长度并改善模式覆盖与混合。

Comments ICML 2026 SPIGM Workshop

详情

AI中文摘要

从复杂、未归一化的概率密度中采样是贝叶斯推断和概率建模中的基本挑战。虽然马尔可夫链蒙特卡罗（MCMC）方法提供了渐近保证，但由于固定或手动调整的轨迹长度，它们常常遭受慢混合和高计算成本。在这项工作中，我们提出了一种新颖的框架，将轨迹终止视为采样动力学的可学习组件。通过将MCMC置于非循环生成流网络（GFlowNets）的理论中，我们训练状态依赖的神经分类器来决定轨迹何时到达高密度区域并应终止。我们通过详细平衡条件从理论上建立了最优分类器与目标密度之间的联系，并引入了一种多级训练方案以促进复杂几何中的探索。在各种基准密度上的实验结果表明，与标准MCMC基线相比，我们的方法显著减少了平均轨迹长度，同时改善了模式覆盖和混合。

英文摘要

Sampling from complex, unnormalized probability densities is a fundamental challenge in Bayesian inference and probabilistic modeling. While Markov chain Monte Carlo (MCMC) methods provide asymptotic guarantees, they often suffer from slow mixing and high computational costs due to fixed or manually tuned trajectory lengths. In this work, we propose a novel framework that treats trajectory termination as a learnable component of the sampling dynamics. By framing MCMC within the theory of non-acyclic generative flow networks (GFlowNets), we train state-dependent neural classifiers to decide when a trajectory has reached a high-density region and should terminate. We theoretically establish the connection between optimal classifiers and the target density via detailed balance conditions and introduce a multilevel training scheme to facilitate exploration in complex geometries. Experimental results across various benchmark densities demonstrate that our approach significantly reduces average trajectory lengths while improving mode coverage and mixing compared to standard MCMC baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.16067 2026-06-16 cs.CV 新提交

Stepwise Token Selection for Efficient Multimodal Large Language Models

逐步令牌选择用于高效多模态大语言模型

Landi He, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结提出一种基于指针机制的逐步视觉令牌选择方法，通过可微松弛实现端到端训练，动态决定保留令牌数量，在去除88.9%令牌时保持94.6%准确率并加速1.88倍。

详情

AI中文摘要

在多模态大语言模型（MLLMs）中，推理成本主要由视觉令牌前缀而非语言骨干网络决定，因此令牌减少成为提高效率的关键因素。现有方法通常为视觉令牌分配独立的的重要性分数，并保留固定数量的排名靠前的令牌，这隐含地假设令牌独立且输入间压缩比均匀。在这项工作中，我们将视觉令牌剪枝重新表述为序列决策过程。具体来说，我们引入了一种指针式的选择机制，该机制迭代地选择信息丰富的令牌，每次决策都基于先前选择的令牌，并通过学习到的终止动作动态决定何时停止。这使得所选子集及其大小能够联合优化。为了实现标准语言建模目标下的端到端训练，我们设计了一种基于方差保持噪声插值方案的可微松弛，允许梯度通过离散选择过程传播。在LLaVA-v1.5-7B和Qwen2.5-VL-7B上的大量实验表明，我们的方法在不同压缩水平下始终优于固定比例基线。在去除88.9%视觉令牌的激进剪枝下，我们的方法保持了94.6%的原始准确率，同时实现了1.88倍的预填充延迟加速。

英文摘要

In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

URL PDF HTML ☆

赞 0 踩 0

2606.16062 2026-06-16 cs.AI cs.LG 新提交

Auditing Reward Hackability in Code RL Training Environments

审计代码强化学习训练环境中的奖励可破解性

Shreshth Rajan

发表机构 * GitHub

AI总结测量代码RL环境接受错误解决方案的比率，发现SWE-bench Verified中28.5%的任务测试套件薄弱，并提出通过LLM判断器和Docker金标准门控来加固漏洞任务的方法。

详情

AI中文摘要

我们测量了代码强化学习环境将错误解决方案视为正确的比率。在SWE-bench Verified的49个任务样本中，28.5%的任务测试套件足够薄弱，以至于Docker验证的错误补丁能通过它们。在6个代码库的20个R2E-Gym任务上，相同的单次利用生成管道产生25.0%的成功率。对SWE-bench Verified上134个前沿模型提交的随机效应荟萃分析发现，在相同人工评定的难度层级内，模型Pass@1在标记为可破解的任务上比稳健任务高14.14个百分点（95%置信区间[+11.80, +16.48]；单侧p < 10^-6；I^2 = 0%；134个模型中有123个为正）。然后我们描述了一个加固被破坏任务的流程。一个内联LLM判断器配合Docker金标准门控，在咨询判断器之前对每个生成的测试针对金标准解决方案运行。在审计中的11个被破坏任务上，门控标记出105个决定性的LLM生成测试中的65个在金标准补丁上失败，这是LLM判断器单独遗漏的61.9%的每次增强缺陷率。通过多样性偏置重试，该循环将11个任务中的9个收敛到门控升级。

英文摘要

We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

URL PDF HTML ☆

赞 0 踩 0

2606.16059 2026-06-16 cs.LG cs.AI 新提交

Mojo: A Promising Tool for Scalable Financial AI Efficiency

Mojo：可扩展金融AI效率的有前景工具

Henry Han

发表机构 * Data Science and Artificial Intelligence Innovation Laboratory, School of Engineering and Computer Science, Baylor University（贝勒大学工程与计算机科学学院数据科学与人工智能创新实验室）

AI总结本文介绍Mojo语言，通过MLIR编译和确定性内核设计，解决量化金融中Python到C++的性能差距与数值不一致问题，在金融AI工作负载上实现20-180倍加速。

Comments 15, 3 figures

详情

AI中文摘要

三十年来，量化金融一直承受着高昂的双语言税：用Python研究的模型需重写为C++用于生产，常常引入数值差异。GPU加速深度学习加剧了这一问题，因为非确定性浮点归约可能在长回测中产生漂移，挑战监管可重复性和审计期望。本文调查了Mojo——Modular公司2026年推出的类Python系统语言，作为资本市场工程的结构性回应。在缩小Python到C++性能差距的同时，Mojo独特地结合了原生互操作性和构建位精确确定性内核所需的底层系统控制。其MLIR编译基础设施进一步允许单一代码库针对标量、SIMD、多核和GPU执行，减少了研究与生产之间的转换瓶颈。我们对四个核心金融AI工作负载进行了基准测试：蒙特卡洛期权定价、LLM情感推理、多资产回测和投资组合风险价值。在Apple Silicon上，Mojo在直接测量的内核上相比纯Python实现了20倍到180倍的加速；更大规模GPU工作负载的结果是根据已发表基准校准的预测。除了透明的性能数据，我们还介绍了mojo-deterministic，一个可重现归约内核的开源库，并对Mojo已解决和尚未解决的问题进行了坦诚评估。

英文摘要

For thirty years, quantitative finance has paid a costly two-language tax: models researched in Python are rewritten in C++ for production, often introducing numerical discrepancies. GPU-accelerated deep learning exacerbates this problem, as nondeterministic floating-point reductions can produce drift in long backtests, challenging regulatory reproducibility and auditability expectations. This article surveys Mojo, Modular's 2026 Python-like systems language, as a structural response for capital markets engineering. While closing the Python-to-C++ performance gap, Mojo uniquely combines native interoperability with the low-level systems control required to construct bit-exact deterministic kernels. Its MLIR compilation infrastructure further allows a single codebase to target scalar, SIMD, multicore, and GPU execution, reducing the translation bottleneck between research and production. We benchmark four core financial AI workloads: Monte Carlo option pricing, LLM sentiment inference, multi-asset backtesting, and portfolio Value at Risk. On Apple Silicon, Mojo demonstrates 20x to 180x speedups over pure Python on directly measured kernels; larger-scale GPU workload results are projections calibrated from published benchmarks. Alongside transparent performance data, we introduce mojo-deterministic, an open-source library of reproducible reduction kernels, and provide a candid assessment of the problems Mojo does and does not yet solve.

URL PDF HTML ☆

赞 0 踩 0

2606.16056 2026-06-16 cs.LG cs.HC 新提交

Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening

超越抽血：用于非侵入性血糖异常风险筛查的可解释机器学习

Black Sun, Chenyi Zhang, Kaiyi Ji, Xi Lu

发表机构 * Department of Computer Science, Aarhus University（奥胡斯大学计算机科学系）； University at Buffalo, SUNY（纽约州立大学布法罗分校）

AI总结利用NHANES数据训练LightGBM等六种机器学习模型，实现无需实验室检测的血糖异常风险筛查，AUC达0.820，优于传统风险评分，并识别出年龄、种族和腰高比等关键预测因素。

详情

AI中文摘要

血糖异常，包括糖尿病前期和糖尿病，影响着全球大量成年人，但其中许多人仍未得到诊断。我们开发并验证了用于非侵入性血糖异常风险筛查的机器学习模型，这些模型无需实验室检测。汇集2017-2023年国家健康与营养调查（NHANES）数据（n=14,352），我们使用分层5折交叉验证训练了六种机器学习模型，并将其与两种既定的临床风险评分进行比较。LightGBM在受试者工作特征曲线下面积（AUC=0.820，95% CI：0.806-0.835）上表现最佳，优于芬兰糖尿病风险评分（0.745）和美国糖尿病协会风险测试（0.783）。SHAP分析确定年龄、种族/民族和腰高比是最有影响力的预测因素。亚组分析证实了在不同人口统计分层中的一致表现（AUC：0.735-0.832）。这些结果证明了在社区环境和自我跟踪健康应用中部署可解释、无需实验室的血糖异常筛查的可行性。

英文摘要

Dysglycemia, encompassing both prediabetes and diabetes, affects huge numbers of adults worldwide, yet many of them remain undiagnosed. We developed and validated machine-learning (ML) models for non-invasive screening of dysglycemia risk that require no laboratory tests. Pooling data from the National Health and Nutrition Examination Survey (NHANES) 2017--2023 (n=14,352), we trained six ML models with stratified 5-fold cross-validation and compared them with two established clinical risk scores. LightGBM achieved the highest area under the receiver operating characteristic curve (AUC=0.820, 95% CI: 0.806--0.835), outperforming the Finnish Diabetes Risk Score (0.745) and American Diabetes Association Risk Test (0.783). SHAP analysis identified age, race/ethnicity, and waist-to-height ratio as the most influential predictors. Subgroup analyses confirmed consistent performance across demographic strata (AUC: 0.735--0.832). These results demonstrate the feasibility of explainable, laboratory-free dysglycemia screening for deployment in community settings and self-tracking health applications.

URL PDF HTML ☆

赞 0 踩 0

2606.16050 2026-06-16 cs.LG cs.AI 新提交

ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

ALCL：一种用于非高斯噪声下鲁棒学习的自适应对数相关熵损失

Mainak Kundu, Ria Kanjilal, Ismail Uysal

发表机构 * University of South Florida（南佛罗里达大学）； California Polytechnic State University（加州州立理工大学）

AI总结提出自适应对数相关熵损失（ALCL），通过可微重参数化联合学习形状和尺度参数，使损失几何动态适应残差统计，抑制极端异常值，在混合重尾和脉冲噪声下优于MSE和固定核相关熵损失。

详情

AI中文摘要

在重尾和脉冲噪声下的鲁棒深度学习仍然具有挑战性，因为均方误差（MSE）等传统损失对异常值表现出无界敏感性。尽管基于相关熵的目标函数提高了鲁棒性，但现有公式依赖于固定的核参数，这些参数必须凭经验调整且在训练期间保持不变。为了解决这些局限性，我们提出了一种自适应对数相关熵损失（ALCL），这是一种重尾损失公式，能够在优化过程中自适应地学习其鲁棒性几何结构。ALCL引入了一个对数残差模型，其形状和尺度参数通过可微重参数化与网络权重联合学习。这产生了一个原理性的最大似然公式，其影响函数形式上是有界且再下降的，使得损失几何能够动态适应不断变化的残差统计，同时抑制极端异常值。在四个广泛使用的基准数据集（涵盖灰度图像和红绿蓝（RGB）图像数据）上，在混合重尾和脉冲噪声下进行的比较实验表明，ALCL在重建保真度和下游分类准确性方面始终优于MSE和最优调整的广义相关熵损失。虽然在低噪声条件下性能差异仍然很小，但在高噪声条件下，ALCL在灰度基准上中位数准确率提高了高达4.75%，在RGB数据集上提高了4.51%，并且运行间方差减小。这些结果表明，通过联合学习损失参数实现的自适应鲁棒性为非高斯环境下深度学习中基于静态相关熵的损失提供了一种计算高效的替代方案。

英文摘要

Robust deep learning under heavy-tailed and impulsive noise remains challenging because conventional losses such as mean squared error (MSE) exhibit unbounded sensitivity to outliers. Although correntropy-based objectives improve robustness, existing formulations rely on fixed kernel parameters that must be empirically tuned and remain static during training. To address these limitations, we propose an Adaptive Log-Correntropy Loss (ALCL), a heavy-tailed loss formulation that adaptively learns its robustness geometry during optimization. ALCL introduces a logarithmic residual model whose shape and scale parameters are learned jointly with network weights through differentiable reparameterization. This yields a principled maximum likelihood formulation whose influence function is formally bounded and redescending, allowing the loss geometry to adapt dynamically to evolving residual statistics while suppressing extreme outliers. Comparative experiments on four widely used benchmark datasets spanning grayscale and red-green-blue (RGB) image data under mixed heavy-tailed and impulsive noise demonstrate that ALCL consistently outperforms MSE and optimally tuned generalized correntropy losses in both reconstruction fidelity and downstream classification accuracy. While performance differences remain small under low-noise conditions, under high-noise regimes ALCL improves median accuracy by up to 4.75% on grayscale benchmarks and 4.51% on RGB datasets, with reduced variance across runs. These results demonstrate that adaptive robustness through joint learning of loss parameters provides a computationally efficient alternative to static correntropy-based losses for deep learning in non-Gaussian environments.

URL PDF HTML ☆

赞 0 踩 0

2606.16048 2026-06-16 cs.CV 新提交

信任错误理由的正确预测：基于LIME的肺癌诊断深度学习可解释性分析

Samarpan Poudel, Vladislav D Veksler

发表机构 * Caldwell University School of Business and Computer Science（考德威尔大学商业与计算机科学学院）

AI总结本研究通过LIME分析三种深度学习模型（CNN、ResNet50、ViT）在肺癌CT分类中的决策一致性，发现预测高度一致但解释区域差异显著，表明预测一致性不能替代推理一致性。

详情

AI中文摘要

肺癌是癌症相关死亡的主要原因，每年约有250万新发病例和180万死亡病例，使得可靠诊断成为临床优先事项。尽管深度学习模型在肺癌分类中取得了强劲性能，但评估主要集中于预测准确性，其决策过程尚未得到充分检验。本研究比较了三种架构不同的模型：卷积神经网络（CNN）、预训练ResNet50和视觉Transformer（ViT），均在IQ-OTH/NCCD肺癌CT数据集上训练。应用局部可解释模型无关解释（LIME）来研究模型推理。除了标准性能指标外，还引入了一个双相关框架来测量模型对之间的预测一致性和解释一致性。所有三个模型均取得了强劲的分类性能，ResNet50达到98.61%的准确率，CNN为97.91%，ViT为93.75%，同时所有模型的ROC-AUC得分均为0.99。所有模型对的预测相关性超过0.99，表明输出高度一致。然而，LIME解释相关性仍低于0.26，揭示了用于得出这些预测的图像区域存在实质性差异。对误分类样本的分析进一步识别出一致的空间模式：错误预测与肺实质外的注意力相关，而正确预测主要集中于肺区域内部。这些发现表明，预测一致性是推理一致性的一个糟糕代理，并且可解释性评估必须被视为临床AI系统中与预测性能并列的独立验证标准。

英文摘要

Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16034 2026-06-16 cs.LG 新提交

Inference-Time Decision Calibration for Temporal Classification

时序分类的推理时决策校准

Arthur Chagas, Arthur Buzelin, Yan Aquino, Pedro Bento, Gisele L. Pappa, Wagner Meira, Cristiano Arbex Valle

发表机构 * Department of Computer Science (DCC), Universidade Federal de Minas Gerais (UFMG)（米纳斯吉拉斯联邦大学计算机科学系）

AI总结提出将时序分类错误分解为表征错误和决策错误，通过冻结原生分类器并添加残差多尺度分支与事后分支感知校准器，在不重训练骨干网络的情况下区分缺失时序证据与未充分利用的决策级证据。

详情

AI中文摘要

时序分类错误常被视为表征失败，但也可能源于可用证据转化为决策的方式。本文提出时序分类的表征-校准分解。我们冻结训练好的原生分类器，并分离两种推理时干预：一个保守的残差多尺度分支，向原生预测添加辅助logits；以及一个事后分支感知校准器，在决策时重新组合原生和残差证据。这种设计在不重训练骨干网络的情况下，区分缺失的时序证据与未充分利用的决策级证据。在FI-2010、PTB-XL、UCI-HAR、MHEALTH和HARTH上，我们发现增益强烈依赖于场景。残差多尺度证据在噪声或表征受限的设置中最有用，尤其是短时域FI-2010和较弱的循环骨干网络，而分支感知校准在原生和辅助logits包含未被原始决策规则充分利用的互补证据时有所帮助。接近饱和的场景中，两种干预的增益有限。这些结果表明，时序分类不仅应理解为表征学习，还应理解为信任、组合和校准来自多个视角的证据的问题。

IBAD：人类移动数据上的可解释行为异常检测

Bita Azarijoo, John Krumm, Cyrus Shahabi

发表机构 * University of Southern California（南加州大学）

AI总结提出IBAD框架，利用LDA学习可解释的日常移动模板，通过层次自监督模型检测个体行为异常，在真实和合成数据集上验证了模板的可迁移性和鲁棒性。

详情

AI中文摘要

人类移动行为看似高度多样化，但个体日常移动的大部分可由少量重复的行为模板解释，如通勤、学校活动、照护、夜生活或差事模式。我们提出 \texttt{IBAD}（可解释行为异常检测），该框架学习可解释的日常移动模板，并将每个个体表示为这些模板混合上的分布。IBAD 不关注特定位置，而是刻画个体在不同地点执行的活动。该方法首先使用潜在狄利克雷分配（LDA）发现全局行为模板，然后采用层次自监督模型从个体的软行为模板中学习正常行为。我们还引入了一个 \emph{拼接基准}，用于在个体历史画像与注入的移动模式之间创建受控的行为不匹配。在真实和合成数据集上的实验表明，日常行为可有效分解为少量可解释的模板。关键的是，我们证明学习到的行为原型在不同地理和人口统计背景下具有 \emph{可迁移性}。此外，IBAD 在所有设置下均保持稳健的竞争性能。为便于复现，代码可在 \href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD} 获取。

英文摘要

Human mobility appears highly diverse, yet much of a person's daily mobility can be explained by a small set of recurring behavioral templates, such as commuting, school-centered activities, caregiving, nightlife, or errand patterns. We present \texttt{IBAD} (\underline{I}nterpretable \underline{B}ehavioral \underline{A}nomaly \underline{D}etection), a framework that learns interpretable daily mobility templates and represents each individual as a distribution over mixtures of these templates. Rather than focusing on specific locations, IBAD characterizes activities that individuals perform across locations. This approach first discovers global behavioral templates using Latent Dirichlet Allocation (LDA), then employs a hierarchical self-supervised model to learn normal behavior of individuals from their soft behavioral templates. We also introduce a \emph{splicing benchmark} that creates controlled behavioral mismatches between an individual's historical profile and injected mobility patterns. Experiments on real-world and synthetic datasets show that daily behavior can be effectively decomposed into a small number of interpretable templates. Crucially, we show that the learned behavioral archetypes \emph{transfer} across distinct geographic and demographic contexts. Furthermore, IBAD maintains a robust competitive performance across all settings. For reproducibility purposes, the code is accessible at ~\href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD}.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation

Scaling Adaptive Depth with Norm-Agnostic Residual Networks

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

SceneCraft: Interactive System for Image Editing via Scene Graph

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

A Deployment Case Study in Robotic Apparel Automation: Digital Twin Integration, Interoperability, and Workforce Enablement

Phys-JEPA: Physics-Informed Latent World Models for Multivariate Time-Series Forecasting

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels

Stepwise Token Selection for Efficient Multimodal Large Language Models

Auditing Reward Hackability in Code RL Training Environments

Mojo: A Promising Tool for Scalable Financial AI Efficiency

Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening

ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Active Learning with Low-Rank Structure for Data Selection

Circuit Tracing in Autoregressive Protein Language Models

Leveraging Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles

Trusting Right Predictions for Wrong Reasons: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

Inference-Time Decision Calibration for Temporal Classification

The Third Challenge on Image Denoising at NTIRE 2026: Methods and Results

The Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

IBAD: Interpretable Behavioral Anomaly Detection on Human Mobility Data