arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 93 信号源:cs.CL, cs.AI, cs.LG
2606.18867 2026-06-18 cs.LG cs.CY stat.ML 新提交 55%

Strategic Feature Selection

战略特征选择

Jivat Neet Kaur, Pratik Patil, Divya Shanmugam, Emma Pierson, Michael I. Jordan, Nika Haghtalab, Meena Jagadeesan, Ahmed Alaa, Serena Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Texas, Austin(德克萨斯大学奥斯汀分校) Cornell Tech(康奈尔科技) Stanford University(斯坦福大学) University of Pennsylvania(宾夕法尼亚大学) Harvard University(哈佛大学) Inria, Paris(巴黎Inria)

专题命中 其他LLM :战略特征选择与分类,非LLM核心但涉及算法。

AI总结 研究通过特征选择和岭正则化应对战略操纵的分类问题,发现仅基于可操纵性排除特征通常次优,提出联合优化特征集与正则化水平的算法,并在医疗支付基准上验证。

详情
AI中文摘要

当算法预测器在高风险领域(如医疗)中指导资源分配时,这些预测器必须考虑输入特征的战略操纵。典型的解决方案是重新设计预测器本身以明确考虑战略互动。然而在实践中,决策者通常受限于调整现有预测管道中的粗粒度杠杆。例如,医疗组织通常根据感知的可操纵性选择排除哪些特征,同时使用标准正则化程序来收缩保留特征的系数。在这项工作中,我们通过特征选择及其与岭正则化的相互作用,启动了对战略分类的形式化研究。我们的主要发现是,仅基于可操纵性排除单个特征通常是次优的。我们提供了在最优正则化下特征子集性能的细粒度刻画,为政策设计提供了新的见解。受此刻画启发,我们开发了一种实用算法,用于联合选择特征集和岭正则化水平。通过一个关于医疗支付基准的真实世界案例研究,我们说明了我们的算法如何指导实践中粗粒度政策杠杆的设计。我们的结果为减轻算法决策系统中战略行为的影响提供了一个有原则的、实用的框架。

英文摘要

When algorithmic predictors inform resource allocation in high-stakes domains such as healthcare, these predictors must account for strategic manipulation of input features. The typical solution is to redesign the predictor itself to explicitly account for strategic interactions. In practice, however, decision makers are often constrained to adjusting coarser levers within existing prediction pipelines. For example, healthcare organizations often select which features to exclude based on perceived manipulability, while using standard regularization procedures to shrink the coefficients of retained features. In this work, we initiate a formal study of strategic classification through feature selection and its interaction with ridge regularization. Our main finding is that excluding individual features based on their manipulability alone is generally suboptimal. We provide a fine-grained characterization of the performance of a feature subset under optimal regularization, yielding new insights for policy design. Motivated by this characterization, we develop a practical algorithm for jointly choosing the feature set and the level of ridge regularization. Through a real-world case study on a healthcare payments benchmark, we illustrate how our algorithm can guide the design of coarse policy levers in practice. Our results provide a principled, practical framework for mitigating the effects of strategic behavior in algorithmic decision-making systems.

2606.18833 2026-06-18 cs.LG 新提交 55%

Seed-Guided Semi-Supervised Clustering by A-Contrario Anomaly Detection

基于A-Contrario异常检测的种子引导半监督聚类

Nassir Mohammad

发表机构 * Cyber Innovation Lab, Airbus, Newport, UK(空中客车公司网络创新实验室(英国纽波特))

专题命中 其他LLM :半监督聚类框架,非LLM核心。

AI总结 提出一种基于统计对偶性的半监督聚类框架,通过a-contrario推理和感知算法,利用种子标签初始化并迭代排除异常点,实现鲁棒聚类,在少量种子下达到强性能。

详情
AI中文摘要

本文介绍了一种基于分组原则与异常检测之间统计对偶性的半监督聚类框架。我们解决了噪声环境中鲁棒聚类定义的挑战——在该任务中,划分算法往往过度分配离群点,而基于密度的方法仍对启发式全局参数敏感。借鉴\textit{a-contrario}统计推理和格式塔邻近原则,我们将聚类定义为相对于均匀随机性零假设不包含任何异常点的最大数据点子集。该方法的核心是感知算法,该算法利用基于期望的原则性阈值($\mathbb{E} < 1$)来识别异常点,无需手动参数调整。通过将聚类视为异常检测的对偶问题,我们采用迭代的“通过排除进行聚类”机制。该算法由种子引导,利用最少的用户提供标签来初始化鲁棒的聚类中位数并形成初始组,随后通过接纳非异常点进行扩展。这种方法自然地隔离了边缘点、孤立噪声和新兴的未知聚类。我们在合成和真实基准数据集上评估了该方法,包括通过原始、线性降维和邻域保持嵌入表示的图像和文本数据集。结果表明,在每个聚类仅使用10-30个种子的情况下,所提出的方法在实用的低调优基准测试协议下实现了具有竞争力且通常非常强的性能,同时在固定种子聚类数和迭代次数下,对观测数和维度均保持线性可扩展性。

英文摘要

This paper introduces a semi-supervised clustering framework grounded in the statistical duality between grouping principles and anomaly detection. We address the challenge of robust cluster definition in noisy environments -- a task where partitioning algorithms often over-assign outliers and density-based methods remain sensitive to heuristic global parameters. Drawing on \textit{a-contrario} statistical reasoning and Gestalt proximity principles, we define a cluster as a maximal subset of data points containing no anomalies relative to a null hypothesis of uniform randomness. Central to this approach is the Perception algorithm, which utilises a principled expectation-based threshold ($\mathbb{E} < 1$) to identify outliers without manual parameter tuning. By treating clustering as the dual of anomaly detection, we employ an iterative ``clustering-by-exclusion'' mechanism. The algorithm is seed-guided, leveraging minimal user-provided labels to initialise robust cluster medians and form initial groups, which are subsequently expanded by admitting non-anomalous points. This approach naturally isolates fringe points, isolated noise, and emerging unknown clusters. We evaluate the method on synthetic and real-world benchmarks, including image and text datasets represented through raw, linear-reduced, and neighbourhood-preserving embeddings. Results demonstrate that with as few as 10--30 seeds per cluster, the proposed method achieves competitive and often very strong performance under a practical low-tuning benchmarking protocol, while maintaining linear scalability with respect to both observations and dimensionality for a fixed number of seeded clusters and iterations.

2606.18807 2026-06-18 cs.DS cs.LG 新提交 55%

Learning Augmented Exact Exponential Algorithms

学习增强的精确指数时间算法

Tatiana Belova, Yuriy Dementiev, Danil Sagunov

发表机构 * ITMO University(ITMO大学)

专题命中 其他LLM :学习增强指数时间算法,与LLM弱相关

AI总结 提出一种通用方法,利用略优于随机猜测的噪声预测器,可证明地减少NP难子集选择问题的搜索空间,运行时间加速随预测质量平滑扩展,且仅需预测的成对独立性或无需知道预测器精度。

详情
AI中文摘要

学习增强算法领域已经证明,机器学习预测可以在广泛的问题中绕过最坏情况下的下界。然而,到目前为止,关注点几乎完全集中在多项式时间算法上,其中预测改进了竞争比、近似保证或运行时间。在本文中,我们提出了一个问题:预测能否推动NP难问题的精确指数时间算法的前沿?我们通过提出一种通用方法对此问题给出肯定回答,该方法增强了一整类用于各种子集选择问题的最先进精确算法。我们表明,一个仅略优于随机猜测的噪声预测器足以可证明地减少搜索空间,并且由此产生的运行时间加速随预测质量平滑扩展。重要的是,我们的算法仅需要预测的成对独立性,或者,不需要知道预测器的精度——这两种设置都比通常假设的更弱且更现实。

英文摘要

The field of learning-augmented algorithms has demonstrated that machine-learned predictions can bypass worst-case lower bounds across a wide range of problems. So far, however, the focus has been almost exclusively on polynomial-time algorithms, where predictions improve competitive ratios, approximation guarantees, or running times. In this paper, we raise the question of whether predictions can push the frontier of exact exponential-time algorithms for NP-hard problems. We answer this question affirmatively by proposing a general approach that augments an entire family of state-of-the-art exact algorithms for a variety of subset selection problems. We show that a noisy predictor that is only marginally better than random guessing suffices to provably reduce the search space, and that the resulting runtime speedup scales smoothly with the prediction quality. Importantly, our algorithms require only pairwise independence of predictions or, alternatively, do not require the knowledge of the predictor's accuracy - both strictly weaker and more realistic settings than typically assumed.