arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪 全部专题
2606.16113 2026-06-16 cs.AI cs.LG 新提交

RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation

RecourseBench: 一个用于可复现算法追责评估的模块化框架

Zahra Khotanlou, Hashir Ahmed, Chenghao Tan, Ahmed Abdelaal, Amir-Hossein Karimi

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出RecourseBench框架,通过模块化、可复现性和交互性三大承诺,实现追责方法的统一评估,并集成28种方法,首次通过自动化定量测试强制方法级可复现性。

详情
AI中文摘要

算法追责方法提供反事实解释,告知个体需要采取哪些行动来推翻不利的模型决策。尽管方法学进展迅速,但原则性比较仍然难以实现;现有框架通常难以扩展,缺乏互操作性,并且缺乏系统验证来确保集成的方法忠实复现其最初报告的结果。我们引入了\emph{RecourseBench},一个围绕三大承诺(即模块化、可复现性和交互性)构建的统一评估框架。该框架将流程分解为五个完全解耦的层——数据、预处理、模型、追责方法和评估——由抽象接口和动态注册表管理。为了解决先前基准测试中的可复现性差距,我们引入了一个四级分类系统,其中每个集成的方法都通过自动化测试套件针对其最初报告的结果进行验证。我们还提供了一个交互式Web界面,用于在方法、数据集和模型架构之间进行灵活的、配置驱动的比较。我们的框架目前集成了28种最先进的追责方法,据我们所知,这是第一个通过自动化定量测试明确强制执行方法级可复现性的追责基准。

英文摘要

Algorithmic recourse methods provide counterfactual explanations that inform individuals of the actions required to overturn an unfavorable model decision. Despite rapid methodological progress, principled comparison remains elusive; existing frameworks are often difficult to extend and lack both interoperability and systematic verification that integrated methods faithfully reproduce their originally reported results. We introduce \emph{RecourseBench}, a unified evaluation framework built around three commitments namely, modularity, reproducibility, and interactivity. The framework decomposes the pipeline into five fully decoupled layers -- Data, Preprocessing, Model, Recourse Method, and Evaluation -- governed by abstract interfaces and a dynamic registry. To address the reproducibility gap in prior benchmarks, we introduce a four-tier classification system in which every integrated method is validated by an automated test suite against its originally reported results. We further provide an interactive web interface for flexible, configuration-driven comparison across methods, datasets, and model architectures. Our framework currently integrates 28 state-of-the-art recourse methods and, to our knowledge, constitutes the first recourse benchmark to explicitly enforce method-level reproducibility through automated, quantitative testing.

2606.16112 2026-06-16 cs.LG cs.AI 新提交

Scaling Adaptive Depth with Norm-Agnostic Residual Networks

缩放自适应深度:范数无关残差网络

Tomás Figliolia, Beren Millidge

发表机构 * Zyphra San Francisco, CA(Zyphra旧金山加州)

AI总结 针对残差网络中残差流范数随深度增长导致深层更新被抑制的问题,提出范数无关残差架构NAG,通过分离幅度和方向信息保持各层贡献,并实现可解释的自适应深度跳过机制,在等计算量下匹配全深度性能。

详情
AI中文摘要

残差架构在深度学习中无处不在,但它们存在一个微妙的结构性限制:残差流的范数会随深度迅速增长。因此,来自后层的更新相对于累积的残差状态变得很小。这降低了它们对表示的影响,并限制了模型在深度上扩展的益处。为了解决这个问题,我们引入了NAG,一种范数无关的残差架构,它将残差流中的幅度与方向信息分离,在整个深度中保留有意义的层贡献,并防止后层更新被残差范数增长系统地抑制。重要的是,NAG仅引入可忽略数量的额外参数,并依赖于易于内核融合的简单操作,从而在实践中保持训练效率。我们表明,该架构优于基线Transformer,其增益随深度增加而显著增大,从而能够有效训练更深的模型。范数无关的公式还产生了一种可解释的深度混合(MoD)机制,该机制自适应地跳过注意力和MLP层。除了作为训练后的精度-计算权衡外,该机制还可以用作预训练时的扩展策略:在等FLOP训练下,通过减少每token前向传播成本节省的计算量可以再投资于在更多token上训练,同时保持总参数数量和KV缓存预算固定。在我们的实验中,约20%-25%的适度深度混合率在相等训练计算量下匹配全深度基线性能,同时大幅减少执行的层参数数量和前向传播FLOPs。这些结果将深度稀疏性确定为固定计算量训练的新扩展轴,从而能够实现非常深但FLOP高效的模型。

英文摘要

Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

2606.16111 2026-06-16 cs.CL 新提交

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

面向帕累托最优工具集成智能体的帕累托排名策略优化

Junyi Li, Xiaowei Qian, Yingyi Zhang, Wenlin Zhang, Guojing Li, Sheng Zhang, Xiao Han, Yichao Wang, Xiangyu Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ParetoPO框架,通过超体积引导动态标量化和帕累托排名优势计算,在多目标下优化工具使用语言模型的准确性与效率权衡。

Comments ICML 2026 Spotlight Paper

详情
AI中文摘要

近期工具集成语言智能体的进展显著提升了其解决复杂推理任务的能力。然而,现有对齐方法主要关注最大化任务准确率,而忽略了工具使用效率等辅助目标,这些目标对于实际部署至关重要。为解决这一差距,我们提出ParetoPO,一个两阶段多目标优化框架,用于在竞争目标下对齐使用工具的大型语言模型(LLMs)。在第一阶段,ParetoPO利用超体积引导的动态标量化,基于全局帕累托前沿进展自适应调整奖励权重。在第二阶段,它用基于帕累托排名的优势计算替代标量化学习信号,通过优势感知的信用分配促进非支配轨迹。该设计能够在多个冲突目标间实现细粒度的动作级优化。在数学推理和多跳问答任务上的实验结果表明,与静态和启发式基线相比,ParetoPO始终能发现具有更优准确率-效率权衡的策略。

英文摘要

Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

2606.16110 2026-06-16 cs.LG 新提交

Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

审计机器遗忘:关于模型是否真正遗忘的系统性研究

Dayong Ye, Tianqing Zhu, Ruiding Huang, Xinbo Fu, Jiayang Li, Bo Liu, Huan Huo, Wanlei Zhou

发表机构 * University of Technology Sydney(悉尼科技大学) Deakin University(迪肯大学) Macquarie University(麦考瑞大学)

AI总结 针对隐私法规需求,提出首个实用通用审计框架,通过无知证明概念验证现有遗忘算法能否真正擦除指定数据影响,实验表明重训练和微调方法有效,去优化和Fisher/Hessian方法失败。

详情
AI中文摘要

机器遗忘因日益增长的隐私担忧和监管要求而受到广泛研究。然而,审计遗忘算法是否真正擦除了特定数据的影响仍然是一个开放的挑战。缺乏可靠且实用的审计机制可能导致严重的隐私风险,例如残留信息泄露。本文对现有遗忘算法能否真正遗忘指定数据进行了系统性研究。受无知证明概念的启发,我们提出了首个实用且通用的机器遗忘审计框架。我们的框架通过消除从头再训练基线、避免训练大量影子模型以及无需对原始训练过程进行侵入性干预,解决了现有方法的关键实用性限制。为了评估我们框架的有效性,我们首先进行验证实验以确认其健全性和完备性。然后,我们在六个数据集和十种代表性遗忘方法上进行了全面实验。结果表明,我们的框架能够可靠地区分成功和失败的遗忘。特别地,我们观察到基于重训练和基于微调的方法可以实现有效遗忘,即使目标数据仍保留在原始数据集中。相比之下,基于去优化的方法无法实现真正遗忘,反而降低了模型性能。基于Fisher/Hessian的方法也无法遗忘请求的数据,即使提供了形式化认证。此外,我们展示了我们的框架对虚假遗忘尝试具有鲁棒性,并且能够很好地泛化到大型语言模型。

英文摘要

Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model's performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.

2606.16103 2026-06-16 cs.CV 新提交

SceneCraft: Interactive System for Image Editing via Scene Graph

SceneCraft: 基于场景图的交互式图像编辑系统

Duc-Manh Phan, Ngoc-Dai Tran, Duy-Khang Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh, Vietnam(胡志明市理科大学) Vietnam National University, Ho Chi Minh, Vietnam(越南国家大学胡志明市) University of Dayton, Dayton, Ohio, USA(代顿大学)

AI总结 提出SceneCraft框架,通过场景图表示图像,用户直接操作图结构进行复杂编辑,自动生成精确提示,降低语言歧义,提升编辑质量和用户控制。

详情
AI中文摘要

生成式AI的最新进展使得自然语言驱动的图像编辑成为可能,但现有系统在处理包含多个交互对象的复杂场景时常常失败,因为它们严重依赖用户精心制作精确的文本提示。为了解决缺乏结构化控制的问题,我们提出了SceneCraft,一种新颖的交互式框架,通过将图像表示为可编辑的场景图来桥接用户意图和模型执行。用户无需通过试错来猜测文本提示,而是直接与可视化图交互以执行复杂的空间和关系操作。这些图修改会自动转换为精确的、上下文感知的编辑提示,有效消除语言歧义。为了确保鲁棒和多样化的结果,结构化提示被分派到多个最先进的生成模型。跨多种编辑场景的评估表明,SceneCraft提供了更直观的控制机制,显著减少了手动提示工程的认知负担,同时生成的输出在质量和保真度上获得用户一致更高的评价。

英文摘要

Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

2606.16093 2026-06-16 cs.CL cs.AI 新提交

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

基于可学习混合的GSS-Transformer混合架构的长上下文建模

Kuzey Torlak, Hüseyin Arda Arslan, Anıl Dervişoğlu, Beyza Nur Deniz, Onur Boyar

发表机构 * Kadıköy Anadolu High School(卡德柯伊安纳多卢高中) Politecnico di Torino(都灵理工大学) Istanbul Technical University(伊斯坦布尔理工大学) Boğaziçi University(博阿齐奇大学) IBM Research - Tokyo(IBM 东京研究院)

AI总结 提出并行混合架构PHA,通过可学习混合机制融合GSS、GQA和FFN,在长上下文建模中实现Transformer级困惑度与更高效率。

Comments 16 pages, 9 tables, 4 figures

详情
AI中文摘要

建模长距离依赖仍然是自然语言处理中的核心挑战。Transformer架构通过自注意力实现强性能,但计算复杂度随序列长度呈二次方增长($O(N^2)$),而状态空间模型(SSM)线性扩展($O(N)$)但存在选择性召回瓶颈,难以从压缩状态中检索精确信息。这导致了效率与困惑度之间的基本权衡。为应对这些挑战,我们提出了\textit{并行混合架构(PHA)},它将门控状态空间(GSS)、分组查询注意力(GQA)和前馈网络(FFN)作为独立的并行分支运行,并通过可学习混合机制融合。PHA不强制SSM近似注意力或将两种范式串行化,而是让每个分支专门化:GSS捕获全局上下文,注意力执行选择性检索,FFN提供补充处理。在WikiText-103上,PHA在125M参数下达到16.51 PPL,优于Hedgehog(16.70)和H3-125M(23.70)。扩展到180M参数得到16.42 PPL,与纯注意力基线结果相当,同时在长上下文下吞吐量提高24%,内存使用降低40%。在OpenWebText上,我们的125M模型达到19.72 PPL,优于标准Transformer(20.60)和GSS混合基线(19.80)。这些结果表明,将序列建模范式分离为并行专家,能够在长上下文语言建模中实现Transformer级困惑度,同时显著提升效率。

英文摘要

Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

2606.16092 2026-06-16 cs.CV cs.AI 新提交

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

VinQA:面向真实世界多模态文档问答的交错视觉元素长文本答案生成

Young Rok Jang, Hyesoo Kong, Kyunghwan An, Jae Sub Huh, Gyeonghun Kim, Stanley Jungkyu Choi

发表机构 * LG AI Research(LG AI研究院)

AI总结 提出VinQA数据集和两种编码方法(页面编码与模态编码),用于生成交错引用视觉元素的长文本答案;通过M-GroSE评估框架和微调Qwen2.5-VL模型,显著缩小与专有模型的性能差距。

Comments Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

详情
AI中文摘要

真实世界的文档将文本与表格、图表、照片和示意图以多样化的布局组合在一起,然而现有关于多模态大语言模型(MLLMs)用于文档问答的研究主要产生纯文本回复,未能充分利用这些视觉元素。我们引入VinQA,一个用于长文本答案生成的数据集,其中引用的视觉元素与其支持文本明确交错,并基于相关文档页面。为支持此任务,我们研究了两种将原始文档页面图像输入MLLM的编码方法及其视觉元素引用机制:(1)页面编码,直接编码带有视觉元素边界框的整页图像,并将这些框选区域视为可引用单元;(2)模态编码,解析每个页面以提取文本并裁剪视觉元素,分别编码,并将这些裁剪元素用作可引用单元。在我们的实验中,我们提出M-GroSE,一个扩展GroUSE的多模态评估框架,用于从完整性、答案相关性、忠实性和不可回答性四个维度评估答案。我们还报告了Visual Source F1以直接衡量视觉引用准确性。尽管专有前沿模型在VinQA测试集上仍获得最佳总体分数,但在训练集上微调开源Qwen2.5-VL模型显著提升了其性能并缩小了这一差距。模态编码最初对于具有长文本、多视觉元素和多样化引用需求的复杂文档更为稳健。然而,在VinQA上训练后,页面编码达到了可比水平,即使没有模态编码中使用的显式解析也能有效竞争。最后,基于MLLM的评判器Visual G-Eval确认,微调后的模型在语义恰当的位置插入视觉元素,并附有忠实的支持文本。

英文摘要

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

2606.16084 2026-06-16 cs.AI cs.CL 新提交

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

深海的韵律:抹香鲸叫声中双重模式的计算语言学检验

Mudit Sinha, Sanika Chavan

发表机构 * Independent Researchers(独立研究员)

AI总结 使用1483个抹香鲸叫声,通过计算语言学方法检验其是否具有双重模式结构,发现下层由点击节奏构成,上层显示序列依赖,下层为节奏型而非分段型。

Comments 22 pages, 2 figures, 4 tables. Preprint

详情
AI中文摘要

人类语言常被描述为在两个层次上结合结构:低层单元组合成更大的单元,然后这些单元再组合成更大的序列。我们使用多米尼加抹香鲸项目的1483个叫声,测试抹香鲸叫声中是否具有这种设计特征——双重模式。由于声学相似性可以模仿符号结构,我们将问题视为从连续音频中进行计算语言学结构发现,而不是直接关于语言或意义的断言。我们使用冻结音频编码器的共识、保留的结构测试、每统计量零假设和声学零假设可恢复性门控。证据支持一个狭窄的两层架构。在低层,点击组合成叫声不是通过稳定的有序规则,而是通过哪些点击存在以及它们之间的点击间节奏。在高层,叫声令牌显示回合级序列依赖,NSB二阶转移熵提升0.132比特(p = 0.002)。在节奏缩放下,编码器派生的点击身份强烈受速率限制,而叫声身份保持更稳定,在点击到叫声步骤中产生可测量的抽象梯度。仅节奏基线恢复了大量低层结构,但未能重现上层序列依赖信号。我们不声称语言、语义、感知或类似人类的音素。相反,我们报告了表示级别的证据,表明存在一种类似双重模式的架构,其低层是节奏型而非分段型,并提供了一个可移植的零假设控制框架,用于测试诱导声学令牌系统中的组合结构。

英文摘要

Human language has often been described as combining structure at two levels: lower-level units combine into larger units, which then combine into larger sequences. We test for this design feature, duality of patterning, in sperm whale codas using 1,483 codas from the Dominica Sperm Whale Project. Because acoustic similarity can imitate symbolic structure, we treat the problem as computational-linguistic structure discovery from continuous audio rather than as a direct claim about language or meaning. We use a consensus of frozen audio encoders, held-out structural tests, per-statistic nulls, and acoustic-null recoverability gates. The evidence supports a narrow two-tier architecture. At the lower tier, clicks compose into codas not by a stable ordered rule, but by which clicks are present together with their inter-click rhythm. At the upper tier, coda tokens show bout-level sequential dependence, with an NSB second-order transfer-entropy lift of 0.132 bits (p = 0.002). Under tempo scaling, encoder-derived click identity is strongly rate-bound, while coda identity remains substantially more stable, yielding a measurable abstraction gradient across the click-to-coda step. Rhythm-only baselines recover substantial lower-tier structure but fail to reproduce the upper-tier sequential-dependence signal. We do not claim language, semantics, perception, or human-like phonemes. Instead, we report representation-level evidence for a duality-of-patterning-like architecture whose lower tier is rhythmic rather than segmental, and provide a portable null-controlled framework for testing combinatorial structure in induced acoustic token systems.

2606.16082 2026-06-16 cs.CV cs.AI 新提交

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

Tool-IQA: 利用简单工具增强图像质量评估

Guanyi Qin, Junjie Zhang, Chunming He, Yibing Fu, Jie Liang, Tianhe Wu, Lei Zhang

发表机构 * National University of Singapore(新加坡国立大学) OPPO Research Institute(OPPO研究院) Nanyang Technical University(南洋理工大学) Duke University(杜克大学) City University of Hong Kong(香港城市大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出Tool-IQA,通过为视觉语言模型配备放大镜和伽马校正器等简单工具,将被动评分转变为工具增强的工作流程,显著提升图像质量评估性能。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被用于图像质量评估(IQA)。然而,当前方法通常采用静态的一次性评分范式,而人类通过动态视觉检查(例如,选择性调整视图以验证细节和细微伪影)来评估图像质量。具体来说,仅依赖单次观察存在两个主要限制:首先,仅在全局尺度上感知图像限制了对更精细局部细节的评估;其次,图像的原始强度分布可能压倒可见性,导致对图像质量的检查不足。为了解决这些问题,我们提出了Tool-IQA,将评估机制从被动评分转变为工具增强的工作流程。特别地,我们为VLM配备了简单而有效的视图工具:用于检查局部细节的放大镜,以及用于揭示可见性和隐藏伪影的伽马校正器。评估遵循一个结构化的流程,包括带有评分标准的初始观察、工具增强的深入检查以及最终校准质量分数的量化。此外,为了确保高效且有目的地调用工具,我们引入了一种批量感知的训练策略,以奖励能够产生积极贡献的工具交互,而不仅仅是鼓励使用。在各种IQA基准上的实验表明,通过有效的工具调用和校准评估,我们提出的Tool-IQA显著优于现有最先进的模型,例如,在具有挑战性的CLIVE数据集上实现了0.854的PLCC。

英文摘要

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

2606.16078 2026-06-16 cs.RO 新提交

A Deployment Case Study in Robotic Apparel Automation: Digital Twin Integration, Interoperability, and Workforce Enablement

机器人服装自动化部署案例研究:数字孪生集成、互操作性与劳动力赋能

Gokul Narayanan, Abhiroop Ajith, Jonathan Zornow, Carlos Calle, Auralis Herrero Lugo, Jose Luis Susa Rincon, Chengtao Wen, Eugen Solowjow

发表机构 * Siemens Corporation(西门子股份公司) Sewbo Levi's(李维斯) Bluewater Defense

AI总结 针对织物柔性导致的机器人操作难题,本文通过牛仔布制造案例,提出集成数字线程、数字孪生、互操作层及运行时监控的机器人缝纫系统,实现快速部署与鲁棒性提升。

Comments 4 pages, 3 figures, IEEE ICRA 2026 Workshop Paper

详情
AI中文摘要

尽管在电子和汽车制造等领域的柔性自动化取得了稳步进展,但由于织物具有可变形性且难以用机器人操作,服装自动化仍然具有挑战性。本文介绍了一个面向部署的牛仔布制造机器人缝纫系统案例研究,强调了实际应用所需的系统级集成。在工程层面,数字线程模块将DXF生产图纸解析为工艺参数和可执行的机器人轨迹,减少了手动编程工作量,并实现了跨缝纫操作的快速重新定位。同时,在部署前使用工作单元的数字孪生来验证可达性和间隙、优化布局和顺序、评估操作员访问以及评估与上下游任务的节拍兼容性,从而降低调试风险。在部署阶段,系统通过互操作层将协作机器人与传统缝纫设备、焊接、吸盘夹具和机器级控制器集成。运行时监控与验证(包括缝迹监控、碰撞检查和轨迹级验证)提高了环境变化下的鲁棒性,而面向操作员的培训和指导工具支持设置、故障排除和技术采纳。在牛仔短裤上进行的两次分阶段工厂部署(涵盖2D口袋操作和3D服装成型缝迹)表明,基于数字孪生的验证、数字线程驱动的任务生成、互操作性、运行时验证和操作员培训对于扩展机器人服装自动化至关重要。

英文摘要

Despite steady advances in flexible automation in sectors such as electronics and automotive manufacturing, apparel automation remains challenging because fabrics are deformable and difficult to manipulate with robots. This paper presents a deployment-oriented case study of a robotic sewing system for denim manufacturing, emphasizing the system-level integration required for practical adoption. At the engineering level, a digital thread module parses DXF production drawings into process parameters and executable robot trajectories, reducing manual programming effort and enabling rapid re-targeting across sewing operations. In parallel, a digital twin of the workcell is used during pre-deployment to validate reach and clearance, refine layout and sequencing, evaluate operator access, and assess cycle-time compatibility with upstream and downstream tasks, thereby reducing commissioning risk. At deployment, the system integrates a collaborative robot with conventional sewing equipment, welding, suction fixtures, and machine-level controllers through an interoperability layer. Runtime monitoring and verification, including seam monitoring, collision checking, and trajectory-level validation, improve robustness under environmental variability, while operator-facing training and guidance tools support setup, troubleshooting, and technology adoption. Two staged factory deployments on denim shorts, covering 2D pocket operations and 3D garment-shaping seams, show that digital-twin-based validation, digital-thread-driven task generation, interoperability, runtime verification, and operator training are important for scaling robotic apparel automation.

2606.16076 2026-06-16 cs.LG cs.AI cs.GT 新提交

Phys-JEPA: Physics-Informed Latent World Models for Multivariate Time-Series Forecasting

Phys-JEPA:面向多变量时间序列预测的物理信息潜在世界模型

Weizhi Nie, Weichao Liu, Honglin Guo, Yuting Su

发表机构 * Tianjin University(天津大学)

AI总结 提出Phys-JEPA架构,将物理一致性约束引入潜在状态和状态转移,分解预测状态为物理和残差分量,在气候、交通、电力数据集上提升预测精度。

Comments Submitted to arXiv as a preliminary manuscript. 10 figures

详情
AI中文摘要

物理系统中的多变量预测需要模型在预测耦合时间变量的同时保持有意义的状态演化。深度预测器可以拟合时间相关性,物理信息模型可以用科学约束正则化预测,但这些方向通常仅在解码输出层面连接。因此,生成未来轨迹的隐藏预测状态可能在统计上有用,但在物理上无结构。我们提出Phys-JEPA,一种用于多变量时间序列预测的物理信息联合嵌入预测架构。Phys-JEPA学习一个潜在世界模型,其中预测状态被分解为物理和残差分量,物理一致性直接施加于潜在状态和潜在转移,而不仅仅施加于解码后的预测。该公式利用已知物理变量组织表示空间,同时保留未解析动力学的残差容量。在Jena Climate 2009–2016上,Phys-JEPA在H=24时将聚合MSE从0.12482降至0.12273,温度MSE从0.01892降至0.01831。在Traffic上,完整Phys-JEPA在所有测试视界内优于监督基线,将H=192的MSE从0.800784降至0.773873。在Electricity上,最佳变体取决于视界:静态潜在一致性在H=24和H=48时最强,而完整Phys-JEPA在H=192时给出最佳的聚合和目标变量MSE。这些初步结果表明,将物理信息学习从输出空间转移到潜在预测状态空间是可解释时间世界模型的一个有前景的方向。

英文摘要

Multivariate forecasting in physical systems requires models that predict coupled temporal variables while preserving meaningful state evolution. Deep forecasters can fit temporal correlations, and physics-informed models can regularize predictions with scientific constraints, but these directions are often connected only at the decoded-output level. As a result, the hidden predictive state that generates future trajectories may remain statistically useful but physically unstructured. We introduce Phys-JEPA, a physics-informed joint-embedding predictive architecture for multivariate time-series forecasting. Phys-JEPA learns a latent world model in which predictive states are decomposed into physical and residual components, and physical consistency is imposed directly on latent states and latent transitions rather than only on decoded forecasts. This formulation uses known physical variables to organize the representation space while retaining residual capacity for unresolved dynamics. On Jena Climate 2009--2016, Phys-JEPA reduces aggregate MSE from 0.12482 to 0.12273 and temperature MSE from 0.01892 to 0.01831 at H=24. On Traffic, full Phys-JEPA improves aggregate MSE over the supervised baseline across all tested horizons, reducing H=192 MSE from 0.800784 to 0.773873. On Electricity, the best variant depends on horizon: static latent consistency is strongest at H=24 and H=48, while full Phys-JEPA gives the best aggregate and target-variable MSE at H=192. These initial results suggest that moving physics-informed learning from output space to latent predictive state space is a promising direction for interpretable temporal world models.

2606.16075 2026-06-16 cs.LG cs.CV 新提交

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

AME:生成式AI市场中的多类型贡献者归属框架

Yang Shi, Songwen Pei, Yang Gao, Bingxue Zhang

发表机构 * University of Shanghai for Science and Technology(上海理工大学) Fudan University(复旦大学)

AI总结 针对生成式AI中多阶段协作的价值分配问题,提出AME框架,整合异构数据贡献评估、数据权利映射和可信执行,实现与人类判断一致的低成本价值分配。

详情
AI中文摘要

生成式AI通过异构贡献者(包括训练数据、基础模型、微调行为和提示)之间的多阶段协作实现价值创造。然而,如何公平分配数据价值仍未得到充分探索。本文将多阶段生成式AI价值分配定义为一个新的研究问题,并识别出三个核心挑战:异构数据贡献评估、数据权利映射和可信执行。我们提出AME(归属-映射-执行)框架,这是一个统一框架,将数据贡献评估、数据权利映射和可信执行整合到单个工作流中。实验结果表明,AME框架实现了与人类参考判断更一致的数据价值分配结果,同时保持低成本的可信执行。我们的工作为生成式AI数据市场中的价值评估和收益分配提供了初步基础。

英文摘要

Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

2606.16074 2026-06-16 cs.CL cs.AI 新提交

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2:通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine(耶鲁大学医学院) Yale School of Public Health(耶鲁大学公共卫生学院) Texas State University(德克萨斯州立大学)

AI总结 提出PVminerLLM2,通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术,解决监督微调难以处理的细粒度错误,在患者声音结构化提取任务上优于基线模型。

详情
AI中文摘要

动机:患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息,但大多是非结构化的,限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而,仅靠监督微调(SFT)难以处理罕见、细粒度且分布不均的错误,尤其是在令牌关键的结构化输出中。结果:我们提出了PVminerLLM2,一组改进的用于结构化患者声音提取的LLM,它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了(i)带有令牌级门控稳定项的偏好目标,防止在偏好优化下绝对令牌似然的退化,以及(ii)混淆感知的偏好对构建,以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权,以解决令牌不平衡和类别偏斜问题。在多种模型规模下,PVMinerLLM2始终优于强基线,在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升,并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现:PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于:https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

2606.16073 2026-06-16 cs.LG stat.ML 新提交

Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels

停止采样器!基于分类器的采样核自适应停止

Kirill Korolev, Nikita Morozov, Stepan Pavlenko, Esmeralda S. Whitammer, Sergey Samsonov

发表机构 * Stanford University(斯坦福大学)

AI总结 提出将MCMC轨迹终止作为可学习组件,利用非循环生成流网络训练状态依赖分类器,在保证详细平衡条件下自适应停止采样,显著缩短轨迹长度并改善模式覆盖与混合。

Comments ICML 2026 SPIGM Workshop

详情
AI中文摘要

从复杂、未归一化的概率密度中采样是贝叶斯推断和概率建模中的基本挑战。虽然马尔可夫链蒙特卡罗(MCMC)方法提供了渐近保证,但由于固定或手动调整的轨迹长度,它们常常遭受慢混合和高计算成本。在这项工作中,我们提出了一种新颖的框架,将轨迹终止视为采样动力学的可学习组件。通过将MCMC置于非循环生成流网络(GFlowNets)的理论中,我们训练状态依赖的神经分类器来决定轨迹何时到达高密度区域并应终止。我们通过详细平衡条件从理论上建立了最优分类器与目标密度之间的联系,并引入了一种多级训练方案以促进复杂几何中的探索。在各种基准密度上的实验结果表明,与标准MCMC基线相比,我们的方法显著减少了平均轨迹长度,同时改善了模式覆盖和混合。

英文摘要

Sampling from complex, unnormalized probability densities is a fundamental challenge in Bayesian inference and probabilistic modeling. While Markov chain Monte Carlo (MCMC) methods provide asymptotic guarantees, they often suffer from slow mixing and high computational costs due to fixed or manually tuned trajectory lengths. In this work, we propose a novel framework that treats trajectory termination as a learnable component of the sampling dynamics. By framing MCMC within the theory of non-acyclic generative flow networks (GFlowNets), we train state-dependent neural classifiers to decide when a trajectory has reached a high-density region and should terminate. We theoretically establish the connection between optimal classifiers and the target density via detailed balance conditions and introduce a multilevel training scheme to facilitate exploration in complex geometries. Experimental results across various benchmark densities demonstrate that our approach significantly reduces average trajectory lengths while improving mode coverage and mixing compared to standard MCMC baselines.

2606.16067 2026-06-16 cs.CV 新提交

Stepwise Token Selection for Efficient Multimodal Large Language Models

逐步令牌选择用于高效多模态大语言模型

Landi He, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 提出一种基于指针机制的逐步视觉令牌选择方法,通过可微松弛实现端到端训练,动态决定保留令牌数量,在去除88.9%令牌时保持94.6%准确率并加速1.88倍。

详情
AI中文摘要

在多模态大语言模型(MLLMs)中,推理成本主要由视觉令牌前缀而非语言骨干网络决定,因此令牌减少成为提高效率的关键因素。现有方法通常为视觉令牌分配独立的的重要性分数,并保留固定数量的排名靠前的令牌,这隐含地假设令牌独立且输入间压缩比均匀。在这项工作中,我们将视觉令牌剪枝重新表述为序列决策过程。具体来说,我们引入了一种指针式的选择机制,该机制迭代地选择信息丰富的令牌,每次决策都基于先前选择的令牌,并通过学习到的终止动作动态决定何时停止。这使得所选子集及其大小能够联合优化。为了实现标准语言建模目标下的端到端训练,我们设计了一种基于方差保持噪声插值方案的可微松弛,允许梯度通过离散选择过程传播。在LLaVA-v1.5-7B和Qwen2.5-VL-7B上的大量实验表明,我们的方法在不同压缩水平下始终优于固定比例基线。在去除88.9%视觉令牌的激进剪枝下,我们的方法保持了94.6%的原始准确率,同时实现了1.88倍的预填充延迟加速。

英文摘要

In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

2606.16062 2026-06-16 cs.AI cs.LG 新提交

Auditing Reward Hackability in Code RL Training Environments

审计代码强化学习训练环境中的奖励可破解性

Shreshth Rajan

发表机构 * GitHub

AI总结 测量代码RL环境接受错误解决方案的比率,发现SWE-bench Verified中28.5%的任务测试套件薄弱,并提出通过LLM判断器和Docker金标准门控来加固漏洞任务的方法。

详情
AI中文摘要

我们测量了代码强化学习环境将错误解决方案视为正确的比率。在SWE-bench Verified的49个任务样本中,28.5%的任务测试套件足够薄弱,以至于Docker验证的错误补丁能通过它们。在6个代码库的20个R2E-Gym任务上,相同的单次利用生成管道产生25.0%的成功率。对SWE-bench Verified上134个前沿模型提交的随机效应荟萃分析发现,在相同人工评定的难度层级内,模型Pass@1在标记为可破解的任务上比稳健任务高14.14个百分点(95%置信区间[+11.80, +16.48];单侧p < 10^-6;I^2 = 0%;134个模型中有123个为正)。然后我们描述了一个加固被破坏任务的流程。一个内联LLM判断器配合Docker金标准门控,在咨询判断器之前对每个生成的测试针对金标准解决方案运行。在审计中的11个被破坏任务上,门控标记出105个决定性的LLM生成测试中的65个在金标准补丁上失败,这是LLM判断器单独遗漏的61.9%的每次增强缺陷率。通过多样性偏置重试,该循环将11个任务中的9个收敛到门控升级。

英文摘要

We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

2606.16059 2026-06-16 cs.LG cs.AI 新提交

Mojo: A Promising Tool for Scalable Financial AI Efficiency

Mojo:可扩展金融AI效率的有前景工具

Henry Han

发表机构 * Data Science and Artificial Intelligence Innovation Laboratory, School of Engineering and Computer Science, Baylor University(贝勒大学工程与计算机科学学院数据科学与人工智能创新实验室)

AI总结 本文介绍Mojo语言,通过MLIR编译和确定性内核设计,解决量化金融中Python到C++的性能差距与数值不一致问题,在金融AI工作负载上实现20-180倍加速。

Comments 15, 3 figures

详情
AI中文摘要

三十年来,量化金融一直承受着高昂的双语言税:用Python研究的模型需重写为C++用于生产,常常引入数值差异。GPU加速深度学习加剧了这一问题,因为非确定性浮点归约可能在长回测中产生漂移,挑战监管可重复性和审计期望。本文调查了Mojo——Modular公司2026年推出的类Python系统语言,作为资本市场工程的结构性回应。在缩小Python到C++性能差距的同时,Mojo独特地结合了原生互操作性和构建位精确确定性内核所需的底层系统控制。其MLIR编译基础设施进一步允许单一代码库针对标量、SIMD、多核和GPU执行,减少了研究与生产之间的转换瓶颈。我们对四个核心金融AI工作负载进行了基准测试:蒙特卡洛期权定价、LLM情感推理、多资产回测和投资组合风险价值。在Apple Silicon上,Mojo在直接测量的内核上相比纯Python实现了20倍到180倍的加速;更大规模GPU工作负载的结果是根据已发表基准校准的预测。除了透明的性能数据,我们还介绍了mojo-deterministic,一个可重现归约内核的开源库,并对Mojo已解决和尚未解决的问题进行了坦诚评估。

英文摘要

For thirty years, quantitative finance has paid a costly two-language tax: models researched in Python are rewritten in C++ for production, often introducing numerical discrepancies. GPU-accelerated deep learning exacerbates this problem, as nondeterministic floating-point reductions can produce drift in long backtests, challenging regulatory reproducibility and auditability expectations. This article surveys Mojo, Modular's 2026 Python-like systems language, as a structural response for capital markets engineering. While closing the Python-to-C++ performance gap, Mojo uniquely combines native interoperability with the low-level systems control required to construct bit-exact deterministic kernels. Its MLIR compilation infrastructure further allows a single codebase to target scalar, SIMD, multicore, and GPU execution, reducing the translation bottleneck between research and production. We benchmark four core financial AI workloads: Monte Carlo option pricing, LLM sentiment inference, multi-asset backtesting, and portfolio Value at Risk. On Apple Silicon, Mojo demonstrates 20x to 180x speedups over pure Python on directly measured kernels; larger-scale GPU workload results are projections calibrated from published benchmarks. Alongside transparent performance data, we introduce mojo-deterministic, an open-source library of reproducible reduction kernels, and provide a candid assessment of the problems Mojo does and does not yet solve.

2606.16056 2026-06-16 cs.LG cs.HC 新提交

Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening

超越抽血:用于非侵入性血糖异常风险筛查的可解释机器学习

Black Sun, Chenyi Zhang, Kaiyi Ji, Xi Lu

发表机构 * Department of Computer Science, Aarhus University(奥胡斯大学计算机科学系) University at Buffalo, SUNY(纽约州立大学布法罗分校)

AI总结 利用NHANES数据训练LightGBM等六种机器学习模型,实现无需实验室检测的血糖异常风险筛查,AUC达0.820,优于传统风险评分,并识别出年龄、种族和腰高比等关键预测因素。

详情
AI中文摘要

血糖异常,包括糖尿病前期和糖尿病,影响着全球大量成年人,但其中许多人仍未得到诊断。我们开发并验证了用于非侵入性血糖异常风险筛查的机器学习模型,这些模型无需实验室检测。汇集2017-2023年国家健康与营养调查(NHANES)数据(n=14,352),我们使用分层5折交叉验证训练了六种机器学习模型,并将其与两种既定的临床风险评分进行比较。LightGBM在受试者工作特征曲线下面积(AUC=0.820,95% CI:0.806-0.835)上表现最佳,优于芬兰糖尿病风险评分(0.745)和美国糖尿病协会风险测试(0.783)。SHAP分析确定年龄、种族/民族和腰高比是最有影响力的预测因素。亚组分析证实了在不同人口统计分层中的一致表现(AUC:0.735-0.832)。这些结果证明了在社区环境和自我跟踪健康应用中部署可解释、无需实验室的血糖异常筛查的可行性。

英文摘要

Dysglycemia, encompassing both prediabetes and diabetes, affects huge numbers of adults worldwide, yet many of them remain undiagnosed. We developed and validated machine-learning (ML) models for non-invasive screening of dysglycemia risk that require no laboratory tests. Pooling data from the National Health and Nutrition Examination Survey (NHANES) 2017--2023 (n=14,352), we trained six ML models with stratified 5-fold cross-validation and compared them with two established clinical risk scores. LightGBM achieved the highest area under the receiver operating characteristic curve (AUC=0.820, 95% CI: 0.806--0.835), outperforming the Finnish Diabetes Risk Score (0.745) and American Diabetes Association Risk Test (0.783). SHAP analysis identified age, race/ethnicity, and waist-to-height ratio as the most influential predictors. Subgroup analyses confirmed consistent performance across demographic strata (AUC: 0.735--0.832). These results demonstrate the feasibility of explainable, laboratory-free dysglycemia screening for deployment in community settings and self-tracking health applications.

2606.16050 2026-06-16 cs.LG cs.AI 新提交

ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

ALCL:一种用于非高斯噪声下鲁棒学习的自适应对数相关熵损失

Mainak Kundu, Ria Kanjilal, Ismail Uysal

发表机构 * University of South Florida(南佛罗里达大学) California Polytechnic State University(加州州立理工大学)

AI总结 提出自适应对数相关熵损失(ALCL),通过可微重参数化联合学习形状和尺度参数,使损失几何动态适应残差统计,抑制极端异常值,在混合重尾和脉冲噪声下优于MSE和固定核相关熵损失。

详情
AI中文摘要

在重尾和脉冲噪声下的鲁棒深度学习仍然具有挑战性,因为均方误差(MSE)等传统损失对异常值表现出无界敏感性。尽管基于相关熵的目标函数提高了鲁棒性,但现有公式依赖于固定的核参数,这些参数必须凭经验调整且在训练期间保持不变。为了解决这些局限性,我们提出了一种自适应对数相关熵损失(ALCL),这是一种重尾损失公式,能够在优化过程中自适应地学习其鲁棒性几何结构。ALCL引入了一个对数残差模型,其形状和尺度参数通过可微重参数化与网络权重联合学习。这产生了一个原理性的最大似然公式,其影响函数形式上是有界且再下降的,使得损失几何能够动态适应不断变化的残差统计,同时抑制极端异常值。在四个广泛使用的基准数据集(涵盖灰度图像和红绿蓝(RGB)图像数据)上,在混合重尾和脉冲噪声下进行的比较实验表明,ALCL在重建保真度和下游分类准确性方面始终优于MSE和最优调整的广义相关熵损失。虽然在低噪声条件下性能差异仍然很小,但在高噪声条件下,ALCL在灰度基准上中位数准确率提高了高达4.75%,在RGB数据集上提高了4.51%,并且运行间方差减小。这些结果表明,通过联合学习损失参数实现的自适应鲁棒性为非高斯环境下深度学习中基于静态相关熵的损失提供了一种计算高效的替代方案。

英文摘要

Robust deep learning under heavy-tailed and impulsive noise remains challenging because conventional losses such as mean squared error (MSE) exhibit unbounded sensitivity to outliers. Although correntropy-based objectives improve robustness, existing formulations rely on fixed kernel parameters that must be empirically tuned and remain static during training. To address these limitations, we propose an Adaptive Log-Correntropy Loss (ALCL), a heavy-tailed loss formulation that adaptively learns its robustness geometry during optimization. ALCL introduces a logarithmic residual model whose shape and scale parameters are learned jointly with network weights through differentiable reparameterization. This yields a principled maximum likelihood formulation whose influence function is formally bounded and redescending, allowing the loss geometry to adapt dynamically to evolving residual statistics while suppressing extreme outliers. Comparative experiments on four widely used benchmark datasets spanning grayscale and red-green-blue (RGB) image data under mixed heavy-tailed and impulsive noise demonstrate that ALCL consistently outperforms MSE and optimally tuned generalized correntropy losses in both reconstruction fidelity and downstream classification accuracy. While performance differences remain small under low-noise conditions, under high-noise regimes ALCL improves median accuracy by up to 4.75% on grayscale benchmarks and 4.51% on RGB datasets, with reduced variance across runs. These results demonstrate that adaptive robustness through joint learning of loss parameters provides a computationally efficient alternative to static correntropy-based losses for deep learning in non-Gaussian environments.

2606.16048 2026-06-16 cs.CV 新提交

PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

PointDiffusion: 点云领域的基于扩散的场景补全

Chidera Agbasiere, Mikhail Sannikov, Faith Ogunwoye, Erik Shaikhiev, Alex Kozinov, Ilya Mikhalchuk, Iana Zhura, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology(斯科尔科沃科学技术学院智能空间机器人实验室)

AI总结 提出多令牌高斯VAE和锚点ICP地面真值精化,实现单步扩散场景补全,在SemanticKITTI上平方倒角距离降低16倍,推理延迟降低25-143倍。

详情
AI中文摘要

从稀疏LiDAR点云重建密集3D场景是自动驾驶中的基本挑战,其中潜在扩散模型提供了一种有前景的解决方案。然而,现有方法依赖于对象级自编码器,这些自编码器在室外尺度下会崩溃为不稳定的全局表示,并且受到由里程计漂移破坏的地面真值数据的影响,这系统地降低了监督质量。此外,多步扩散推理会带来难以承受的延迟,无法实时部署。我们提出了一种新颖的多令牌高斯VAE,具有交叉注意力池化,用于稳定的场景级LiDAR压缩,并结合基于锚点的ICP地面真值精化流水线,消除了训练监督中的漂移引入噪声。这些组件共同实现了一个无支架的单步扩散补全模型,在SemanticKITTI序列08上将平方倒角距离减少了约16倍(从0.396 m^2降至0.024 m^2),分别比LiDiff和ScoreLiDAR高出17-19%和10-11%,并且推理延迟降低了25-143倍。我们的结果表明,在此设置下,数据质量主导模型设计,多令牌潜在空间为基于潜在扩散的场景补全提供了稳定的第一阶段。

英文摘要

Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

2606.16047 2026-06-16 cs.CL 新提交

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

从论证组件到图:一种具有置信门控的多智能体辩论方法用于论证关系识别

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(华沙理工大学电子与信息技术学院)

AI总结 提出一种多智能体辩论框架,通过置信门控机制仅在不确定时进行辩论,在UKP语料上达到训练无关方法最高Macro F1,并生成可读辩论记录。

Comments Accepted for publication in the proceedings of KES 2026

详情
AI中文摘要

大型语言模型(LLMs)凭借其强大的通用推理能力,在论证挖掘(AM)领域受到越来越多的评估和应用。然而,标准的无训练模型常常遗漏复杂细节,特别是在需要将文本的两个部分一起分析的上下文中。此外,自我纠正机制往往会强化推理中的初始幻觉。克服这些限制通常需要昂贵的、领域特定的监督微调。最近的研究表明,多智能体范式可以通过支持者-反对者-裁判架构的辩证改进来解决组件分类任务中的此类弱点,为该领域的无训练方法指明了有希望的方向。在本文中,我们将该框架扩展并评估于论证关系识别与分类(ARIC)任务,将其重新表述为组件对之间的辩论。此外,我们引入了一种置信门控机制,使得仅在不确定的情况下进行辩论,而在置信度高时接受初始预测。在UKP Argument Annotated Essays v2语料库上,我们证明了选择性辩论在所有无训练方法中取得了最高的Macro F1,而对所有样本进行辩论则使性能低于其中一个基线。所有生成方法在Macro F1上也优于微调的RoBERTa模型,这表明Attack类的代表性不足对监督微调的损害大于对仅推理模型的影响。此外,我们的框架生成人类可读的辩论记录,提供了单智能体和监督分类器所缺乏的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

2606.16045 2026-06-16 cs.LG cs.DS 新提交

Active Learning with Low-Rank Structure for Data Selection

基于低秩结构的数据选择主动学习

Vincent Cohen-Addad, Sasidhar Kunapuli, Vahab Mirrokni, Mahdi Nikdan, David P. Woodruff, Samson Zhou

发表机构 * Google Research(谷歌研究院) University of California, Berkeley(加州大学伯克利分校) Institute of Science and Technology Austria (ISTA)(奥地利科学技术研究所) Carnegie Mellon University(卡内基梅隆大学) Texas A&M University(德克萨斯农工大学)

AI总结 提出基于低秩近似和残差采样的数据选择框架,在温和正则条件下选择加权子集,使平均损失近似全数据集平均损失,相对误差(1+ε)加性项εΦ_k,实验优于均匀采样和聚类敏感采样。

Comments ICML 2026

详情
AI中文摘要

在数据选择问题中,目标是选择一个小型、有代表性的数据子集,用于高效训练机器学习模型。Sener 和 Savarese [ICLR 2018] 表明,给定数据的嵌入表示和合适的几何假设,基于 k-中心聚类的启发式方法可用于数据选择。Axiotis 等人 [ICML 2024] 进一步探索了这一视角,提出了基于 k-均值聚类和敏感性采样的数据选择方法。然而,这些方法依赖于数据集具有可通过聚类有效捕获的内在几何结构的假设,而许多现代数据集反而具有全局代数结构,通过低秩近似或主成分分析能更好地利用。在本文中,我们引入了一种基于低秩近似和残差采样的新数据选择框架,通过行子集选择和损失保持核心集构建的视角进行公式化。给定满足温和正则条件(可解释为 Lipschitz 连续性的代数或角度概念)的数据嵌入表示,我们证明可以选择一个加权子集,包含 $\tilde{O}\left(k + \frac{1}{\varepsilon^2}\right)$ 个数据点,其平均损失在全数据集平均损失的 $(1+\varepsilon)$ 相对误差内,附加一个加性项 $\varepsilon \Phi_k$,其中 $\Phi_k$ 表示嵌入矩阵的最优秩-$k$ 近似代价。我们通过实证评估补充了这些理论保证,表明在一系列真实世界数据集上,我们的数据选择方法比基于均匀采样或聚类敏感性采样的先前策略取得了更好的性能。

英文摘要

In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that, given an embedding representation of the data and suitable geometric assumptions, heuristics based on $k$-center clustering can be used to perform data selection. This perspective was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on $k$-means clustering and sensitivity sampling. However, these methods rely on the assumption that the dataset exhibits intrinsic geometric structure that can be effectively captured by clustering, whereas many modern datasets instead possess global algebraic structure that is better exploited by low-rank approximation or principal component analysis. In this paper, we introduce a new data selection framework based on low-rank approximation and residual-based sampling, formulated through the lens of row subset selection and loss-preserving coreset construction. Given an embedding representation of the data satisfying mild regularity conditions, which can be interpreted as algebraic or angular notions of Lipschitz continuity, we show that it is possible to select a weighted subset of $\tilde{O}\left(k + \frac{1}{\varepsilon^2}\right)$ data points whose average loss approximates the average loss over the full dataset within a $(1+\varepsilon)$ relative error, up to an additive $\varepsilon Φ_k$ term, where $Φ_k$ denotes the optimal rank-$k$ approximation cost of the embedding matrix. We complement these theoretical guarantees with empirical evaluations, demonstrating that on a range of real-world datasets, our data selection approach achieves improved performance over prior strategies based on uniform sampling or clustering-based sensitivity sampling.

2606.16044 2026-06-16 cs.LG q-bio.QM 新提交

Circuit Tracing in Autoregressive Protein Language Models

自回归蛋白质语言模型中的电路追踪

Darin Tsui, William Deinzer, Daniel Saeedi, Amirali Aghazadeh

发表机构 * Stanford University(斯坦福大学)

AI总结 提出ProGenMech框架,通过跨层稀疏编码器忠实恢复ProGen3的生成计算,并零样本发现与蛋白质生成和适应性预测相关的稀疏电路,揭示生物意义基序。

Comments Accepted into the Mechanistic Interpretability Workshop at ICML 2026. 24 pages, 14 figures

详情
AI中文摘要

蛋白质语言模型(pLMs)可以生成具有超越自然界观察到的特性的新型蛋白质序列,然而蛋白质生成背后的机制仍然知之甚少。现有的基于稀疏自编码器和跨层编码器的机械可解释性方法主要关注蛋白质表示学习模型,并未捕捉自回归生成所需的计算。在这里,我们引入了ProGenMech,一个用于生成式蛋白质语言模型的机械可解释性框架,它将跨层编码器(CLTs)扩展到ProGen3,一个为因果生成和跨度填充训练的稀疏专家混合模型。与逐层方法不同,CLTs使用来自所有前层的稀疏潜变量重建每一层,从而能够忠实地恢复层间生成计算。我们进一步开发了一个零样本电路发现框架,以识别负责蛋白质生成和适应性预测的稀疏潜电路。在因果生成和零样本适应性估计任务中,ProGenMech在恢复ProGen3的概率分布和功能评分行为方面优于局部跨层编码器基线,同时在跨度填充任务中匹配原始模型的生成分布。此外,恢复的电路揭示了与保守序列模式和蛋白质适应性景观相关的生物学上有意义的基序和功能区域,为可解释和可引导的蛋白质生成奠定了基础。

英文摘要

Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.

2606.16042 2026-06-16 cs.RO cs.AI 新提交

Leveraging Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles

利用深度学习实现自主物流车辆对载具的物体与位置识别

Christoph Legat, Tobias Miller, Marco Riess

发表机构 * Research Group on Cognitive Autonomy & Predictive Intelligence, Technical University of Applied Sciences, Augsburg, Germany(认知自主与预测智能研究组,奥格斯堡应用技术大学,德国) Grenzebach Maschinenbau GmbH, Asbach-Bäumenheim, Germany(Grenzebach Maschinenbau GmbH,德国阿斯巴赫-博伊门海姆)

AI总结 提出基于深度学习的框架,通过卷积神经网络从RGBD数据中识别载具上的预定义地标并计算其位姿,实现自主物流车辆对载具的检测与定位,实验验证了工业环境下的可靠性。

Comments 6 pages, 6 figures, IFAC World Congress2026, \c{opyright} 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

详情
AI中文摘要

本工作探索了在移动机器人中利用人工智能实现载具的自主检测和位姿估计,以便自动拾取。设计了一个深度神经网络,从RGBD数据中识别载具上的预定义地标;然后利用这些地标计算载具的位姿。该网络直接处理RGBD图像以估计地标位置,这些位置构成了确定载具位置的基础。该方法在大量实验中得到了验证,并包含软件和硬件实现。提出了一个基于深度学习的框架,用于检测载具并估计其位姿,以应用于自主物流车辆。我们的方法使用卷积神经网络从RGBD输入中识别载具上的特征参考点,并通过将这些推断出的地标与先验几何知识相结合来计算其位姿。实验表明,所得精度足以在工业环境中可靠地检测载具,证实了该方法适用于自主内部物流应用。

英文摘要

This work explores the use of artificial intelligence in mobile robotics to achieve autonomous detection and pose estimation of load carriers for automated pickup. A deep neural network is designed to recognize predefined landmarks on the carrier from RGBD data; these landmarks are then used to compute the carrier's pose. The network operates directly on RGBD images to estimate landmark positions, which form the basis for determining the carrier's location. The approach is validated in extensive experiments and comprises both software and hardware implementations. A deep learning-based framework is presented to detect load carriers and estimate their pose for use with autonomous logistics vehicles. Our method uses a convolutional neural network to identify characteristic reference points on the carrier from RGBD input and computes its pose by combining these inferred landmarks with prior geometric knowledge. Experiments show that the resulting accuracy is sufficient for reliable load carrier detection in industrial environments, confirming the suitability of the method for autonomous intralogistics applications.

2606.16036 2026-06-16 cs.CV 新提交

Trusting Right Predictions for Wrong Reasons: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

信任错误理由的正确预测:基于LIME的肺癌诊断深度学习可解释性分析

Samarpan Poudel, Vladislav D Veksler

发表机构 * Caldwell University School of Business and Computer Science(考德威尔大学商业与计算机科学学院)

AI总结 本研究通过LIME分析三种深度学习模型(CNN、ResNet50、ViT)在肺癌CT分类中的决策一致性,发现预测高度一致但解释区域差异显著,表明预测一致性不能替代推理一致性。

详情
AI中文摘要

肺癌是癌症相关死亡的主要原因,每年约有250万新发病例和180万死亡病例,使得可靠诊断成为临床优先事项。尽管深度学习模型在肺癌分类中取得了强劲性能,但评估主要集中于预测准确性,其决策过程尚未得到充分检验。本研究比较了三种架构不同的模型:卷积神经网络(CNN)、预训练ResNet50和视觉Transformer(ViT),均在IQ-OTH/NCCD肺癌CT数据集上训练。应用局部可解释模型无关解释(LIME)来研究模型推理。除了标准性能指标外,还引入了一个双相关框架来测量模型对之间的预测一致性和解释一致性。所有三个模型均取得了强劲的分类性能,ResNet50达到98.61%的准确率,CNN为97.91%,ViT为93.75%,同时所有模型的ROC-AUC得分均为0.99。所有模型对的预测相关性超过0.99,表明输出高度一致。然而,LIME解释相关性仍低于0.26,揭示了用于得出这些预测的图像区域存在实质性差异。对误分类样本的分析进一步识别出一致的空间模式:错误预测与肺实质外的注意力相关,而正确预测主要集中于肺区域内部。这些发现表明,预测一致性是推理一致性的一个糟糕代理,并且可解释性评估必须被视为临床AI系统中与预测性能并列的独立验证标准。

英文摘要

Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

2606.16034 2026-06-16 cs.LG 新提交

Inference-Time Decision Calibration for Temporal Classification

时序分类的推理时决策校准

Arthur Chagas, Arthur Buzelin, Yan Aquino, Pedro Bento, Gisele L. Pappa, Wagner Meira, Cristiano Arbex Valle

发表机构 * Department of Computer Science (DCC), Universidade Federal de Minas Gerais (UFMG)(米纳斯吉拉斯联邦大学计算机科学系)

AI总结 提出将时序分类错误分解为表征错误和决策错误,通过冻结原生分类器并添加残差多尺度分支与事后分支感知校准器,在不重训练骨干网络的情况下区分缺失时序证据与未充分利用的决策级证据。

详情
AI中文摘要

时序分类错误常被视为表征失败,但也可能源于可用证据转化为决策的方式。本文提出时序分类的表征-校准分解。我们冻结训练好的原生分类器,并分离两种推理时干预:一个保守的残差多尺度分支,向原生预测添加辅助logits;以及一个事后分支感知校准器,在决策时重新组合原生和残差证据。这种设计在不重训练骨干网络的情况下,区分缺失的时序证据与未充分利用的决策级证据。在FI-2010、PTB-XL、UCI-HAR、MHEALTH和HARTH上,我们发现增益强烈依赖于场景。残差多尺度证据在噪声或表征受限的设置中最有用,尤其是短时域FI-2010和较弱的循环骨干网络,而分支感知校准在原生和辅助logits包含未被原始决策规则充分利用的互补证据时有所帮助。接近饱和的场景中,两种干预的增益有限。这些结果表明,时序分类不仅应理解为表征学习,还应理解为信任、组合和校准来自多个视角的证据的问题。

英文摘要

Temporal classification errors are often treated as representation failures, but they can also arise from how available evidence is converted into decisions. This paper proposes a representation--calibration decomposition for temporal classification. We keep a trained native classifier frozen and separate two inference-time interventions: a conservative residual multi-scale branch that adds auxiliary logits to the native prediction, and a post-hoc branch-aware calibrator that recombines native and residual evidence at decision time. This design distinguishes missing temporal evidence from underused decision-level evidence without retraining the backbone. Across FI-2010, PTB-XL, UCI-HAR, MHEALTH, and HARTH, we find that gains are strongly regime-dependent. Residual multi-scale evidence is most useful in noisy or representation-limited settings, especially short-horizon FI-2010 and weaker recurrent backbones, while branch-aware calibration helps when native and auxiliary logits contain complementary evidence not fully exploited by the raw decision rule. Near-saturated settings show limited gains from either intervention. These results suggest that temporal classification should be understood not only as representation learning, but also as the problem of trusting, combining, and calibrating evidence from multiple views.

2606.16031 2026-06-16 cs.CV 新提交

The Third Challenge on Image Denoising at NTIRE 2026: Methods and Results

NTIRE 2026图像去噪挑战赛第三轮:方法与结果

Lei Sun, Hang Guo, Bin Ren, Shaolin Su, Xian Wang, Danda Pani Paudel, Luc Van Gool, Radu Timofte, Yawei Li

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Würzburg(维尔茨堡大学) Beijing University of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学) Nanjing University of Science and Technology(南京理工大学) University of Beira Interior(贝拉内大学) Siddaganga Institute of Technology(西达甘加理工学院) National Institute of Technology Karnataka(卡纳塔克邦国立理工学院) Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉巴伊·帕特尔国立理工学院) University of Luxembourg(卢森堡大学) University of Twente(特温特大学) University of Kragujevac(克拉古耶瓦茨大学) Prince Sultan University(苏丹王子大学) University of Tunis El Manar(突尼斯埃尔马纳尔大学) University of Electronic Science and Technology of China(电子科技大学) Wuhan University(武汉大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peng Cheng Laboratory(鹏城实验室)

AI总结 报告NTIRE 2026高噪声图像去噪挑战赛,参赛团队采用先进神经网络架构,以PSNR为指标,在无约束条件下实现最先进性能。

Comments accepted by cvprw2026

详情
AI中文摘要

本文报告了NTIRE 2026图像去噪挑战赛,特别关注高噪声场景(σ=50)。该竞赛研究了旨在从加性高斯白噪声(AWGN)污染的图像中恢复高保真细节的先进神经架构。与受约束的基准不同,本赛道强调峰值定量性能,以峰值信噪比(PSNR)衡量,且不限制参数数量或计算开销。通过综合116名注册者中20个入围团队的贡献,本报告对最新的技术创新进行了基准测试,并提供了无约束图像恢复领域当前最先进技术的全面快照。

英文摘要

This paper reports on the NTIRE 2026 Challenge on Image Denoising, specifically focusing on the high-noise regime ($σ= 50$). The competition investigates advanced neural architectures designed to restore high-fidelity details from images corrupted by additive white Gaussian noise (AWGN). Unlike constrained benchmarks, this track emphasizes peak quantitative performance, measured by Peak Signal-to-Noise Ratio (PSNR), without limitations on parameter count or computational overhead. By synthesizing contributions from 20 finalist teams out of 116 registrants, this report benchmarks the latest technical innovations and provides a comprehensive snapshot of the current state-of-the-art in unconstrained image restoration.

2606.16028 2026-06-16 cs.LG cs.IT math.FA math.IT 新提交

The Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints

正交约束下共享表示的信息论优势

Thomas Dittrich, Oliver Potocki, Philipp Grohs

发表机构 * Johann Radon Institute of Computational and Applied Mathematics, Austrian Academy of Sciences(奥地利科学院约翰·拉东计算与应用数学研究所) Faculty of Mathematics, University of Vienna(维也纳大学数学学院)

AI总结 本文通过信息论框架,证明在正交约束下,联合近似比单独近似需要更少的描述长度,揭示了共享表示在组合架构中的效率优势。

详情
AI中文摘要

现代深度学习架构越来越多地采用多任务和多模态方式,使用预训练的基础模型结合任务特定的微调模型。经验上,利用不同问题之间的相似性,而不是单独解决它们,可以显著提高整体性能。虽然多任务学习的泛化和样本复杂度性质已被广泛研究,但与单独近似相比,联合近似的参数复杂度仍不太清楚。这个问题在现代深度学习中尤为重要,因为模型越来越需要满足结构约束,如等变性、守恒律或正交性。我们证明了在一致范数下,分别针对单独和联合近似类的描述长度的下界和上界。我们通过组合一个共享的硬特征(由Rademacher-Haar小波级数实现)与Sawtooth-Walsh读出层来构建一类正交函数,以强制输出坐标的正交性。Rademacher-Haar小波的二叉树结构将近似难度集中在共同特征组件上,而读出层则充当任务特定的头部。使用信息论框架,我们获得了联合编码和单独编码可实现的最优近似率之间的显著差距。最后,我们通过归约为三角波近似,在具有Heaviside激活函数的神经网络模型中实现了这种分离。我们的结果表明,即使在正交约束下,只要任务共享一个潜在的硬特征,联合近似在组合架构中需要的比特数严格更少。这为组合多输出架构的描述长度效率提供了理论见解,并阐明了神经网络如何在几何约束下保持表达能力。

英文摘要

Modern deep learning architectures are increasingly multi-task and multi-modal, using a pretrained foundation model combined with task-specific, fine-tuned models. Empirically, exploiting similarity across different problems, instead of solving them individually, can significantly improve overall performance. While the generalization and sample complexity properties of multitask learning have been widely studied, the parametric complexity of joint approximation in comparison to separate approximation remains less well understood. The question is particularly relevant in modern deep learning, where models are increasingly required to satisfy structural constraints such as equivariance, conservation laws, or orthogonality. We prove lower and upper bounds on the description-length for separate and joint approximation classes, respectively, in uniform norm. We build a class of orthogonal functions by composing a shared hard feature, realized by a Rademacher-Haar wavelet series, with Sawtooth-Walsh readouts to enforce orthogonality of output coordinates. The dyadic tree structure of the Rademacher-Haar wavelet concentrates the approximation hardness in the common feature component, while the readouts act as task-specific heads. Using an information-theoretic framework, we obtain a sharp gap between the optimal approximation rates achievable by joint and separate coding. Finally, we realize this separation in a neural network model using Heaviside activations via reduction to triangle-wave approximation. Our results show that even under an orthogonality constraint joint approximation requires strictly fewer bits in compositional architectures, provided the tasks share a latent hard feature. This provides theoretical insight into the description-length-efficiency of compositional multi-output architectures and clarifies how neural networks can retain expressivity under geometric constraints.

2606.16026 2026-06-16 cs.CL 新提交

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

领域内监督病理报告分类:从数据整理到生产匹配评估的可复现流程

Isaac Hands, Bin Huang, Adam Spannaus, John Gounley, Heidi Hanson, Eric Durbin, Sally R. Ellingson

发表机构 * University of Kentucky(肯塔基大学) UK Markey Cancer Center(肯塔基大学马基癌症中心) Kentucky Cancer Registry(肯塔基癌症登记处) Division of Cancer Biostatistics, University of Kentucky(肯塔基大学癌症生物统计学系) Oak Ridge National Laboratory(橡树岭国家实验室) Division of Biomedical Informatics, University of Kentucky(肯塔基大学生物医学信息学系)

AI总结 提出领域内监督流程解决病理报告跨注册中心性能下降问题,通过标准化数据整理、生产匹配保留集和低假阴性率操作点选择,在418k报告集上FNR降至0.003,F1提升至0.922。

详情
AI中文摘要

我们引入了一个领域内监督流程,旨在应对阻碍监督生物医学NLP模型的分布外性能下降问题,该问题在病理报告跨癌症注册中心迁移时观察到。我们的贡献是一个可复现的配方,用于从常规收集的癌症注册数据训练监督分类器。它描述了如何构建领域内训练集和生产匹配的保留集,并选择操作点以保持非常低的假阴性率(FNR),同时将审阅者工作量控制在可管理范围内。该流程通过设施分层抽样和与注册病例关联的报告单独处理来标准化数据整理,并包括盲法人工审计以估计阳性病例患病率和标签噪声。在418k报告保留集上,肯塔基模型实现了FNR 0.003和假阳性率(FPR)0.097,优于西雅图训练的MOSSAIC OncoID基线(FNR 0.010,FPR 0.183),并将F1从0.860提升至0.922。在600份报告的盲法人工审阅中,估计阳性患病率从0.500下降到0.398,表明存在大量标签噪声,错误集中在罕见原发部位。

英文摘要

We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

2606.16023 2026-06-16 cs.LG 新提交

IBAD: Interpretable Behavioral Anomaly Detection on Human Mobility Data

IBAD:人类移动数据上的可解释行为异常检测

Bita Azarijoo, John Krumm, Cyrus Shahabi

发表机构 * University of Southern California(南加州大学)

AI总结 提出IBAD框架,利用LDA学习可解释的日常移动模板,通过层次自监督模型检测个体行为异常,在真实和合成数据集上验证了模板的可迁移性和鲁棒性。

详情
AI中文摘要

人类移动行为看似高度多样化,但个体日常移动的大部分可由少量重复的行为模板解释,如通勤、学校活动、照护、夜生活或差事模式。我们提出 \texttt{IBAD}(可解释行为异常检测),该框架学习可解释的日常移动模板,并将每个个体表示为这些模板混合上的分布。IBAD 不关注特定位置,而是刻画个体在不同地点执行的活动。该方法首先使用潜在狄利克雷分配(LDA)发现全局行为模板,然后采用层次自监督模型从个体的软行为模板中学习正常行为。我们还引入了一个 \emph{拼接基准},用于在个体历史画像与注入的移动模式之间创建受控的行为不匹配。在真实和合成数据集上的实验表明,日常行为可有效分解为少量可解释的模板。关键的是,我们证明学习到的行为原型在不同地理和人口统计背景下具有 \emph{可迁移性}。此外,IBAD 在所有设置下均保持稳健的竞争性能。为便于复现,代码可在 \href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD} 获取。

英文摘要

Human mobility appears highly diverse, yet much of a person's daily mobility can be explained by a small set of recurring behavioral templates, such as commuting, school-centered activities, caregiving, nightlife, or errand patterns. We present \texttt{IBAD} (\underline{I}nterpretable \underline{B}ehavioral \underline{A}nomaly \underline{D}etection), a framework that learns interpretable daily mobility templates and represents each individual as a distribution over mixtures of these templates. Rather than focusing on specific locations, IBAD characterizes activities that individuals perform across locations. This approach first discovers global behavioral templates using Latent Dirichlet Allocation (LDA), then employs a hierarchical self-supervised model to learn normal behavior of individuals from their soft behavioral templates. We also introduce a \emph{splicing benchmark} that creates controlled behavioral mismatches between an individual's historical profile and injected mobility patterns. Experiments on real-world and synthetic datasets show that daily behavior can be effectively decomposed into a small number of interpretable templates. Crucially, we show that the learned behavioral archetypes \emph{transfer} across distinct geographic and demographic contexts. Furthermore, IBAD maintains a robust competitive performance across all settings. For reproducibility purposes, the code is accessible at ~\href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD}.