arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2605.27404 2026-05-28 cs.CY cs.AI

Smaller, Younger, and More Impactful: How AI-Assisted Writing Transforms Research Teams

更小、更年轻、更具影响力:AI辅助写作如何改变研究团队

Haoyang Wang, Mingze Zhang, Yi Bu, Star Xing Zhao, Meijun Liu

AI总结 本研究利用2020年以来的PLoS和Nature系列期刊全文,通过多种回归方法发现AI辅助写作使研究团队更年轻、规模更小,且不影响甚至提升科学影响力。

详情
AI中文摘要

大科学时代长期以来以日益庞大和专门化的研究团队推动知识前沿为特征。然而,人工智能(AI)的最新进展,特别是大型语言模型(LLMs),正开始重塑学术写作和科学研究,可能打破长期以来团队规模不断扩大的趋势,并改变研究团队结构的其他维度。基于2020年以来PLoS系列和Nature系列期刊的147,074篇全文出版物,我们考察了AI辅助写作是否以及如何影响科学中的团队结构和团队成果。使用多种方法,包括普通最小二乘法、分位数回归、泊松回归、逻辑回归和倾向得分匹配,我们发现使用AI辅助写作的研究团队往往更年轻、规模更小。重要的是,这种向更紧凑、更年轻化团队的转变并非以牺牲科学影响力为代价。相反,我们观察到采用AI辅助写作的研究团队产生高影响力出版物的概率更高。这些结果凸显了AI辅助写作在重塑研究生产方式以及研究团队组建和构成方面的重要作用。我们的发现呼吁在研究评估、资助和培训方面进行政策改进,以应对这一新兴趋势。

英文摘要

The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. However, recent advances in artificial intelligence (AI), particularly large language models (LLMs), are beginning to reshape academic writing and scientific research, potentially disrupting the longstanding trend toward ever-larger teams and transforming other dimensions of research team structure. Drawing on 147,074 full-text publications from the PLoS family and the Nature portfolio since 2020, we examined whether and how AI-assisted writing influences team structure and team outcomes in science. Using multiple methods, including ordinary least square, quantile regression, Poisson regression, logistic regression and propensity score matching, we found that research teams using AI-assisted writing tend to be younger and smaller. Importantly, this shift toward more compact, junior-leaning teams does not come at the expense of scientific impact. On the contrary, we observed a higher probability of research teams that employed AI-assisted writing producing highly impactful publications. These results highlight the significant role of AI-assisted writing in reshaping not only how research is produced, but also how research teams are formed and assembled. Our findings call for policy improvements in research evaluation, funding, and training to address this emerging trend.

2605.27403 2026-05-28 cs.CY cs.AI

LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments

LLM辅助情感分析在综合计算与定性混合方法教育研究中的应用:学生书面反思作业案例研究

Xiomara Gonzalez, Gabriella Coloyan Fleming, Andrew Katz, Maya Denton, Jessica Deters

AI总结 本研究通过纵向案例,利用LLM辅助情感分析结合统计检验与主题分析,探讨学生身份变量对留学期间语言交流情感的影响,发现海外生活经历是唯一显著变量。

详情
AI中文摘要

书面反思作业为学生提供了宝贵的批判性自我评估、意义建构和学习处理的机会。此外,此类反思为定性教育研究提供了丰富的数据。然而,定性数据分析可能耗时。定性比较不同参与者群体之间的发现更为耗时,通常将比较限制在最多一个变量(例如,二元性别)。大型语言模型(LLM)最近开始被批判性地评估用作定性研究助手。利用来自留学项目的纵向学生书面反思案例(n=151),我们研究了LLM辅助情感分析如何能够实现结合计算分析和主题分析的纵向混合方法研究。首先,使用统计检验根据七个不同的学生身份/生活经历变量定量比较情感差异。然后,这些结果指导定性数据分析,以调查这些差异背后的原因。对于本科留学学生,我们发现先前的海外生活经历是唯一影响学生对语言和交流行为情感的个人变量。这一工作流程对于定性研究人员在比较不同人口群体参与者时如何更轻松地探究多个变量具有启示意义。

英文摘要

Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing. Additionally, such reflections provide rich data for qualitative education research. However, qualitative data can be time-consuming to analyze. It is even more time-intensive to qualitatively compare findings between different groups of participants, usually limiting comparison to, at most, one variable (e.g., binary gender). Large language models (LLMs) have recently begun to be critically evaluated for use as qualitative research assistants. Using a longitudinal case of written student reflections (n=151) from a study abroad program, we investigate how LLM-assisted sentiment analysis can enable longitudinal mixed-methods research combining computational and thematic analyses. First, statistical testing is used to quantitatively compare sentiment differences according to seven different student identity/lived experience variables. Then, these results inform qualitative data analysis to investigate the reasons underlying these differences. For the case of undergraduate students studying abroad, we found that prior experience living abroad was the only personal variable impacting students' sentiments of their verbal language and communication behaviors. This workflow has implications for how qualitative researchers can more easily probe multiple variables when comparing participants from different demographic groups.

2605.27402 2026-05-28 cs.CY cs.AI cs.CL

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

REC-CBM:面向可信开放评分的基于规则感知的错误修正概念瓶颈模型

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, Huan Liu

AI总结 提出REC-CBM模型,通过规则感知概念编码器、序数成对校准目标和潜在概念错误修正模块,解决开放评分中标准概念瓶颈模型无法建模细粒度规则维度、忽略评分序数语义和概念标注不可靠的问题,在提升评分性能的同时保持可解释性。

详情
AI中文摘要

开放评分对于公平和个性化教育至关重要,但人工评分耗时且成本高,凸显了自动化评分系统的必要性。尽管基于神经和大语言模型(LLM)的系统表现出优越性能,但它们通常是黑箱模型,其评分过程和理由难以让教育者验证和信任。概念瓶颈模型(CBM)通过将预测路由到人类可解释的概念,提供透明度的机制保证,成为一种有前景的方法。然而,标准CBM不适用于开放评分:它们没有显式建模细粒度的规则维度,未能充分捕捉评分量表的序数语义,并忽略了人类概念标注中固有的可靠性问题。为解决这些局限,我们提出REC-CBM,一种面向可信开放评分的规则感知错误修正概念瓶颈模型。REC-CBM引入了规则感知概念编码器,学习针对回答的概念特定表示,以及一个序数成对校准目标,保留规则维度间的排序结构。它还结合了一个潜在概念错误修正模块,在最终评分预测前对概念预测进行去噪,同时保持可解释性。在公开数据集上的全面实验表明,REC-CBM在评分性能上持续提升,并产生比最先进基线更忠实的概念级推理。进一步分析验证了每个组件的贡献,并展示了在真实教育环境中的适用性。总体而言,这项工作提供了一种实用、可解释的评分解决方案,使教育者能够检查、干预和信任自动化决策,推动更透明和可信的教育。

英文摘要

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

2605.27401 2026-05-28 cs.CY cs.AI

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

使用零样本大语言模型生成的调查数据进行地理显式人口合成

Taylor Anderson, Sara Von Hoene, Orhan Yagizer Cinar, Emma Von Hoene, Amira Roess, Andrew Crooks, Hamdi Kavak

AI总结 本文评估零样本大语言模型生成的健康调查数据能否作为传统迭代比例拟合工作流的输入,用于地理显式人口合成,并发现其可作为补充输入但尚不能替代真实调查数据。

Comments 15 pages, 5 figures, 3 tables

详情
AI中文摘要

人们对将合成人口用于各种应用的兴趣日益增长。同时,我们目睹了人工智能在各行各业的巨大发展。本文评估了零样本大语言模型(LLM)生成的健康调查数据能否作为传统迭代比例拟合(IPF)工作流的输入,用于地理显式人口合成。利用2023年行为风险因素监测系统(BRFSS),我们使用GPT-4.1和Gemini-2.5-Pro为美国科罗拉多州和密西西比州生成合成调查记录。我们将生成的数据用于基于IPF的合成流程,并针对外部基准评估生成的普查区级合成人口。结果表明,两个LLM都捕捉到了几个主要的州级对比,表明零样本生成产生了地理差异化的调查数据。然而,性能强烈依赖于变量。人口合成中的下游效应是混合的,因为IPF有时会放大或减少生成数据中的错误。空间验证表明,基于LLM的人口合理地再现了普查区级的模式,尤其是对于与真实数据更一致的变量。总体而言,LLM生成的调查数据显示出作为补充输入的前景,但尚不能替代真实调查数据。

英文摘要

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

2605.27400 2026-05-28 cs.CY cs.AI cs.CC cs.ET cs.GT cs.MA

Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning

高等教育中伦理AI使用的数学建模:面向未来学习的协调博弈框架

Ndidi Bianca Ogbo, Zhao Song, Shatha Ghareeb, The Anh Han

AI总结 本文通过协调博弈论框架,研究学生群体中负责任或机会主义AI使用规范的形成机制,并揭示评估激励如何触发行为转变。

详情
AI中文摘要

生成式人工智能在高等教育中的快速普及正在重塑评估实践,并加剧对学术诚信、公平性和学习质量的担忧。尽管机构回应越来越强调政策指导和伦理原则,但对于学生群体中负责任或机会主义AI使用的集体规范如何出现和稳定,仍缺乏正式理解。本文将学生在评估中的AI使用重新定义为由同伴期望和评估设计而非仅个体合规塑造的协调问题。我们开发了一个基于协调的演化博弈论框架,捕捉学习价值、努力、感知公平性和透明度,并通过反思性评估激励隐式建模机构AI治理。我们使用分析结果和有限种群模拟揭示了学生AI使用中的阈值驱动行为转变:小而校准良好的反思性评估激励变化可以触发向负责任、以学习为导向的AI使用规范的快速转变,而弱或错位的激励则允许机会主义实践持续存在。这些非线性动态解释了为何仅政策声明往往无法改变行为,而适度的评估重新设计可能产生不成比例的影响。通过提供评估结构如何塑造集体AI使用实践的机制层面解释,本文为高等教育机构提供了一个分析基础的工具,支持面向未来学习的比例性、教学法主导的AI治理,而无需依赖监控或惩罚性执法。

英文摘要

The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns around academic integrity, fairness, and learning quality. While institutional responses increasingly emphasise policy guidance and ethical principles, there remains limited formal understanding of how collective norms of responsible or opportunistic AI use emerge and stabilise within student cohorts. This paper reframes student AI use in assessment as a coordination problem shaped by peer expectations and assessment design rather than individual compliance alone. We develop a coordination-based evolutionary game-theoretic framework that captures learning value, effort, perceived fairness, and transparency, with institutional AI governance modelled implicitly through reflective assessment incentives. We use analytical results and finite-population simulations to reveal threshold-driven behavioural transitions in student AI use: small, well-calibrated changes in reflective assessment incentives can trigger rapid shifts towards responsible, learning-oriented AI-use norms, whereas weak or misaligned incentives allow opportunistic practices to persist. These non-linear dynamics explain why policy statements alone often fail to change behaviour, while modest assessment redesigns can have disproportionate effects. By providing a mechanism-level account of how assessment structures shape collective AI-use practices, this work offers higher education institutions an analytically grounded tool for Future Facing Learning, supporting proportionate, pedagogy-led AI governance without reliance on surveillance or punitive enforcement.

2605.27399 2026-05-28 cs.CY cs.AI

Short-Term Gain, Long-Term Fragility: AI Labor Substitution and the Erosion of Sustainable Capability

短期收益,长期脆弱:AI劳动力替代与可持续能力的侵蚀

Wolfgang Rohde

AI总结 本文提出能力掩盖与能力侵蚀机制,论证AI劳动力替代在短期内提升效率的同时,通过消耗难以重建的人力能力导致系统长期脆弱性增加。

Comments 19 pages, 7 figures, Also available on SSRN: https://doi.org/10.2139/ssrn.6577818

详情
AI中文摘要

看似加速的过程可能是一种将负担从当下悄然转移至未来的行为。用AI系统替代人类劳动力的尝试常被呈现为对技术进步的理性回应,但这种观点在结构上往往是短视的。在软件开发及邻近知识产业中,AI日益具有吸引力,因为它似乎能降低劳动力成本、加快产出速度并改善短期指标。然而,这些收益可能是通过消耗那些构建缓慢且难以恢复的人类能力而实现的。本文提出了AI劳动力替代下的能力掩盖与能力侵蚀机制。AI生成的输出可能造成组织能力已被替代的假象,即使对熟练人类劳动力的依赖依然存在。这种假象可能支持招聘限制,同时更慢的成本在暗中累积。来自AI辅助编程的证据表明,生成的输出仍需要大量人工验证,且在正确性、可维护性和安全性方面参差不齐。仓库级研究也提示了在处理更广泛代码库上下文方面的局限性。更广泛地,劳动力市场、政治经济学和产业战略证据表明,替代压力正由管理层的成本激励和国家竞争驱动,同时增加了集中化和平台控制的风险。其结果是,一个系统在短期内看似更高效,但随着时间的推移却变得更加脆弱。

英文摘要

What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI systems are often presented as rational responses to technological progress, but that view is often structurally short-sighted. Across software development and adjacent knowledge industries, AI is increasingly attractive because it appears to reduce labor costs, speed output, and improve short-term metrics. Yet those gains may be achieved by drawing down human capabilities that are slow to build and difficult to restore. This paper develops a mechanism of capability masking and capability erosion under AI labor substitution. AI-generated output can create the appearance that organizational capability has been replaced, even when dependence on skilled human labor remains. That appearance can support hiring restraint while slower costs accumulate in the background. Evidence from AI-assisted coding shows that generated output still requires substantial human verification and remains uneven in correctness, maintainability, and security. Repository-level studies also suggest limits in handling broader codebase context. More broadly, labor-market, political-economy, and industrial-strategy evidence suggests that substitution pressures are being driven by managerial cost incentives and national competition while increasing risks of concentration and platform control. The result is a system that may look more efficient in the short term while becoming more fragile over time.

2605.27396 2026-05-28 cs.CY cs.AI

Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

代理素养债务:AI素养领域尚未命名的结构性问题

Rohith Nama

AI总结 本文提出“代理素养债务”概念,指出自主AI代理大规模部署时,用户因缺乏监督能力而面临累积性社会赤字,并从医疗、金融等案例论证其结构性本质。

详情
Journal ref
AI & Ethics, 2026
AI中文摘要

自主AI代理现在能够在医疗、金融服务和工作场所等场景中代表用户进行规划、决策和行动,通常无需逐步获得人类批准。现有的AI素养框架是为人类评估AI输出并决定是否采取行动的世界而构建的;它们没有词汇来描述那些已将决策权委托给代理的用户,而代理的行为可能不可观察、不可逆转或不可控制。本文命名了由此产生的问题——代理素养债务:当代理型AI系统在没有相应素养基础设施的情况下大规模部署时,不断累积的社会赤字。这种债务通过三个强化渠道(不透明委托的正常化、多代理生态系统的复杂性以及制度路径依赖)复合增长,由部署代理的组织产生,但由代理所代表的用户、患者和公民承担。来自医疗、金融欺诈和全球公平领域的证据表明,这一差距已经具有重大影响。该问题是结构性的,而非课程改革能够弥补的暂时滞后。它要求将AI素养重新定义为一种治理能力,而非评估能力。

英文摘要

Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often without step-by-step human approval. Existing AI literacy frameworks were built for a world in which humans evaluate AI outputs and decide whether to act; they have no vocabulary for the user who has delegated decision-making authority to an agent whose actions may not be observable, reversible, or controllable. This paper names the resulting problem agentic literacy debt: the accumulating societal deficit that grows when agentic AI systems are deployed at scale without corresponding literacy infrastructure. The debt compounds through three reinforcing channels (normalization of opaque delegation, multi-agent ecosystem complexity, and institutional path dependence), and it is incurred by the organizations that deploy agents but paid by the users, patients, and citizens on whose behalf the agents act. Evidence from healthcare, financial fraud, and global equity contexts suggests the gap is already consequential. The problem is structural, not a temporary lag that curriculum reform will close. It demands a reframing of AI literacy as a governance capability, not an evaluative one.

2605.27395 2026-05-28 cs.CY cs.AI

Informing AI Policy Assessment using Large-Scale Simulation of Interventions

利用大规模干预模拟为AI政策评估提供信息

Julia Barnett, Kimon Kieslich, Natali Helberger, Nicholas Diakopoulos

AI总结 提出一种结合参与式评估、专家成本评估和基于LLM的伤害缓解评估的方法,通过遗传算法模拟探索政策组合空间,以识别缓解特定AI危害的可行政策选项。

Comments This work will be published in the proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026. 15 pages plus end matter and appendix

详情
AI中文摘要

随着AI系统和危害的快速扩散推动全球AI治理努力,在竞争性政策选项中确定优先级对政策制定者和研究人员来说变得越来越具有挑战性。我们引入了一种方法来识别缓解特定AI危害的可行政策选项,帮助政策制定者和研究人员瞄准值得投入更多时间和资源的领域。该方法结合了政策的参与式评估、专家实施成本评估以及基于LLM的每种政策选项下感知危害缓解评估。我们利用基于遗传算法的模拟研究来探索潜在政策组合的巨大解空间,并考察在成本、参与式输入和危害缓解的不同权重下结果如何变化。我们发现该方法能够探索参与式组件和专家组件之间的不同平衡,使政策制定者和研究人员能够评估每个组件应分配多少权重。我们认为遗传算法发现的可行政策组合的多样性可以作为讨论的有用起点。该方法通过将参与式AI直接整合到实际政策开发流程中,实现了现有参与式AI工作的操作化。

英文摘要

As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy options has become increasingly challenging for policymakers and researchers. We introduce a methodology for identifying viable policy options to mitigate specified AI harms, helping policymakers and researchers target areas that warrant greater time and resource investment. This method combines participatory evaluation of policies, expert assessment of implementation costs, and an LLM-based assessment of perceived harm mitigation under each policy option. We leverage a genetic algorithm-based simulation study to explore a vast solution space of potential policy combinations, and examine how outcomes change under different weightings of cost, participatory input, and harm mitigation. We find that this method enables exploration of different balances between participatory and expert components, allowing policymakers and researchers to assess how much weight to assign to each. We argue that the diversity of viable policy combinations found by the genetic algorithm could be a useful starting point for deliberation. This method operationalizes existing work on participatory AI by integrating it directly into practical policy development pipelines.

2605.27394 2026-05-28 cs.CY cs.AI cs.HC cs.MA

Human-AI Collaboration for Estimating Scientific Replicability

人机协作评估科学可复制性

Tatiana Chakravorti, Robert Fraleigh, Timothy Fritton, Christopher Griffin, Vaibhav Singh, Sai Koneru, C. Lee Giles, David Pennock, Anthony Kwasnica, Sarah Rajtmajer

AI总结 提出一种混合预测市场,结合算法代理与人类交易者,通过实时交易共同估计科学发现的可复制性,实验表明混合市场在多数情况下优于纯人工或纯机器基线。

详情
AI中文摘要

确定已发表科学发现能否成功复制是实证科学中长期存在的挑战。现有的可复制性评估方法通常依赖于人类判断(即人类专家的创造性组合)或基于论文内容元数据训练的机器学习模型。虽然这两种方法都显示出价值,但各自也有重要局限性。人类预测可能受到认知偏差和对研究文献接触范围狭窄的影响,而自动评估往往难以捕捉上下文线索和微妙的可信度信号。在本文中,我们研究了一种混合方法。具体来说,我们引入了一个混合预测市场,其中算法代理与人类参与者一起交易,共同估计已发表科学发现通过受控复制研究结果得到证实的可能性。代理基于数百项先前复制研究的结果进行训练,而人类参与者通过实时交易贡献领域知识。我们通过涉及不同学科参与者的多个现场实验评估了这种混合方法,并将其性能与纯人工和纯机器基线进行了比较。我们的结果表明,除少数情况外,混合市场达到或超过了纯人工预测市场,产生了更准确和可靠的复制预测。

英文摘要

Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Existing approaches for replicability assessment typically rely either on human judgment, i.e., creative assembly of human experts, or on machine learning models trained on paper content metadata. While both approaches have demonstrated value, each also has important limitations. Human forecasts can be influenced by cognitive biases and narrow exposure to the research literature, while automated assessments often struggle to capture contextual cues and subtle signals of credibility. In this paper, we examine a hybrid approach. Specifically, we introduce a hybrid prediction market in which algorithmic agents trade alongside human participants to jointly estimate the likelihood that a published scientific finding will be corroborated via the outcome of a controlled replication study. Agents are trained on outcomes from hundreds of prior replication studies while human participants contribute domain knowledge through real-time trading. We evaluate this hybrid approach through multiple live experiments involving participants from different academic disciplines and compare its performance to artificial-only and human-only baselines. Our results show that, except for a few cases, hybrid markets match or outperform artificial prediction markets, producing more accurate and reliable replication forecasts.

2605.27391 2026-05-28 cs.CY cs.AI

Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?

COVID-19后的学习与ICT职业抱负:学生是否以更弱的技能进入AI时代?

Diana Maria Popa, Simona-Vasilica Oprea, Adela Bâra

AI总结 基于PISA 2018和2022数据,采用混合方法分析学习环境与ICT职业抱负的关系,发现数字技能是最强预测因素,教师支持起补充作用,自主性影响较弱且依赖情境。

详情
AI中文摘要

本文考察学生是否以足够强大的教育基础进入生成式AI时代,重点关注学习环境与各国ICT相关职业抱负变化之间的关系。分析使用PISA 2018和2022的国家级数据,结合学生自主性、数字技能和教师支持的指标。采用混合方法,包括描述性统计、回归分析、聚类、潜在表示学习(使用变分自编码器VAE)、判别分析和概率建模,以捕捉教育准备的可观察和潜在维度。与以往将学习损失、数字技能和职业期望分开处理的研究不同,我们的分析将它们整合在一个比较纵向框架内。研究焦点从短期疫情后效应转向教育系统为学生准备数字和AI驱动劳动力市场的结构能力。结果显示,全球范围内ICT职业抱负有所增加但不均衡。数字技能成为最强且最一致的预测因素,而教师支持起补充作用。自主性表现出较弱且依赖情境的影响。教育准备是多维度的,ICT抱负相对独立于其他职业领域而演变。

英文摘要

This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the relationship between learning environments and changes in ICT related career aspirations across countries. The analysis uses country-level data from PISA 2018 and 2022, combining indicators of student autonomy, digital skills and teacher support. A mixed-method approach is applied, including descriptive statistics, regression analysis, clustering, latent representation learning (using Variational Autoencoder-VAE), discriminant analysis and probabilistic modeling to capture both observable and latent dimensions of educational readiness. Unlike prior research that treats learning loss, digital skills and career expectations separately, our analysis integrates them within a comparative longitudinal framework. It shifts the focus from short-term post-pandemic effects to the structural capacity of education systems to prepare students for digital and AI-driven labor markets. Results show a global but uneven increase in ICT career aspirations. Digital skills emerge as the strongest and most consistent predictor, while teacher support plays a complementary role. Autonomy shows weaker, context-dependent effects. Educational readiness is multidimensional, and ICT aspirations evolve relatively independently from other career domains.

2605.27389 2026-05-28 cs.IR cs.AI cs.CL

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

基于记忆 vs. 仅上下文条件化在有状态个性化中产生不同的行为模式

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

AI总结 通过比较上下文条件化和基于记忆的条件化在教师面向教育推荐系统中的行为,发现上下文推荐对当前问题响应更强,而基于记忆的推荐表现出历史依赖行为,包括相同输入下的学习者特异性分化。

Comments Accepted to ITS 2026

详情
AI中文摘要

我们研究了条件化上下文如何塑造教师面向教育推荐系统中的个性化行为。我们比较了基于当前学生问题的上下文条件化与使用持久学习者信息的基于记忆的条件化。通过偏差相关性和配对统计检验,我们发现上下文推荐表现出更强的问题级响应性,而基于记忆的推荐表现出历史依赖行为,包括在相同输入下的学习者特异性分化。教师面向的评估信号表明这些推荐是可解释和可操作的。这些结果表明,基于嵌入的相似性度量能够捕捉对当前问题的响应性,但不能表征基于学习者历史的个性化,从而激励了研究条件化效应的行为级诊断。

英文摘要

We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

2605.27384 2026-05-28 cs.HC cs.AI cs.CL

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

从指导者到协作者:一项90名参与者研究揭示移动严肃游戏中的人机协作

Danai Korre

AI总结 通过90名被试的对比实验,研究高拟人化语音交互体与低拟人化文本代理在移动严肃游戏中的用户偏好,发现高拟人化代理显著更受青睐,并探讨角色、混合主动对话及故障修复对目标导向任务中人机协作的影响。

Comments 4 pages, 5 figures, ACM CHI 2026 workshop paper

详情
AI中文摘要

这篇立场论文反映了我在博士期间从一项大规模被试内研究(N=90)中收集的实证数据。该研究在一个关于英国十进制前货币的Unity开发移动游戏中,比较了高度拟人化的语音具身对话代理(ECA)与低拟人化的文本基础代理(无具身,仅文本气泡)。游戏包含两个不同角色的代理——指导者(Alex)和店主/协作者。用户通过语音和鼠标输入进行交互。我收集的定量数据包括可用性问卷(CCIR MINERVA)和代理人格工具。数据使用配对t检验、重复测量方差分析和多元线性回归进行分析,以识别代理人格与可用性之间的相关性。结果显示,高度拟人化代理版本在统计上显著更受偏好,效应量大。这一结果与观察和退出访谈的定性发现一起进一步讨论。结果从人机协作的角度进行阐述,特别是角色、混合主动对话以及故障/修复在目标导向任务中如何显现。最后,我提出了关于时机、用户期望和角色特定交互的问题。本投稿不提出新框架;而是报告实证发现和问题,我希望与社区进行研讨。

英文摘要

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

2605.27381 2026-05-28 cs.CC cs.AI

The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump

推理的计算边界:能力内化、训练与图灵跳跃

Chien-Ping Lu

AI总结 本文通过经典可计算性理论证明,有限内部自修改无法超越当前计算层,而稳定化修订则通过图灵跳跃达到更强层,从而在递归自我改进叙事中划定了计算边界。

Comments 11 pages, 1 figure, v2

详情
AI中文摘要

关于AI中递归自我改进的主张常常从重复的内部修订滑向定性更强能力的可能性,而没有明确区分潜在的计算机制。本文在经典可计算性理论中给出了一个形式化的分离结果,在精确建模假设下阻止了这一滑移。对于预言$A$,令$\mathcal{C}(A)=\{B : B \leq_T A\}$为相应的计算层。我们证明,有限内部自修改仍保持在$\mathcal{C}(A)$内部,而稳定化修订则通过相对化极限引理由跳跃$A'$支配。结合局部闭包与逃逸定理,这给出了层内迭代与上升到更强相对层之间的清晰形式化分离。关键不在于更强层永远不会出现,而在于它们不能由已稳定层内的有限重复来解释。由此产生的分离为一大类递归改进叙事提供了可计算性理论上的界限,这些叙事将重复内部更新视为定性能力上升的充分条件。

英文摘要

Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capability without clearly distinguishing the underlying computational regimes. This paper gives a formal separation result in classical computability theory that blocks that move under a precise modeling assumption. For an oracle $A$, let $\mathcal{C}(A)=\{B : B \leq_T A\}$ be the corresponding computational layer. We prove that finite internal self-modification remains inside $\mathcal{C}(A)$, while stabilized revision is governed instead by the jump $A'$ via the relativized limit lemma. Together with a local closure versus escape theorem, this yields a clean formal separation between within-layer iteration and ascent to a stronger relative level. The point is not that stronger layers never arise, but that they are not explained by finite repetition inside one already settled layer. The resulting separation gives a computability-theoretic limit on a broad class of recursive-improvement narratives in which repeated internal updating is treated as sufficient for qualitative capability ascent.

2605.26959 2026-05-28 cs.LO cs.CL

MerLean-Prover: A Recursive Looping Harness for Lean 4 Theorem Proving

MerLean-Prover:用于 Lean 4 定理证明的递归循环框架

Jinzheng Li, Zeru Zhu, Yuanjie Ren

AI总结 提出一种基于递归循环框架的端到端 Lean4 定理证明器 MerLean-Prover,通过规划、检查与证明三种智能体协作,无需微调或定制强化学习,在 FormalQualBench 和 Putnam2025 上超越现有开源基线。

详情
AI中文摘要

MerLean-Prover 是一个端到端的 Lean4 定理证明器,它用内核可检查的证明替换了 sorry 声明。它由三种智能体类型(规划、检查和证明)构建,通过一个递归外层循环组合,其修订单位是证明计划本身,并且不使用微调、自定义强化学习目标或特定定理的脚手架。在 FormalQualBench(一个包含 23 道博士资格考试定理的基准测试)上,MerLean-Prover 解决了 10/23,超过了最强的开源基线(OpenGauss,8/23)。在 Putnam2025 上,相同的框架以显著低于下一个最佳系统的总挂钟时间完成了 12/12。该框架也适用于较小的模型:Sonnet 解决了所有四个测试的 FormalQualBench 问题,Haiku 解决了两个简短的问题。这些结果表明,框架设计是端到端 Lean4 定理证明的核心因素,与原始模型能力并列,并且一个相对简单的框架已经可以很有效。

英文摘要

MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.

2605.26902 2026-05-28 cs.IR cs.AI

ICICLE: Expanding Retrieval with In-Context Documents

ICICLE: 利用上下文文档扩展检索

Yu-Chen Den, Yung-Yu Shih, Zhi Rui Tam, Kuan-Yu Chen, Pu-Jen Cheng, Yun-Nung Chen, Eugene Yang

AI总结 提出ICICLE框架,通过上下文文档的docid生成实现增量式生成检索,避免重新训练和灾难性遗忘。

详情
AI中文摘要

生成式检索(GR)使用参数化知识将查询直接映射到文档标识符(docid),但这种设计使得语料库扩展成本高昂:添加新文档需要更新模型参数以编码新的文档-docid关联,导致重复训练和对先前索引文档的灾难性遗忘。在这项工作中,我们将增量式GR重新定义为上下文检索问题,其中新添加的文档作为推理时的文档-docid证据提供。我们提出了ICICLE,一种上下文索引框架,它在参数化记忆和上下文提供的文档-docid对上执行源感知的docid生成。ICICLE结合了基于`[COPY]`的路由机制、基于偏好的校准和大上下文适应,以区分基于上下文的检索和参数化检索。在MS MARCO和NQ320K上的实验表明,ICICLE提高了新引入文档的检索性能,同时无需特定语料库的重新训练即可保持对已见文档的保留。我们的分析进一步表明,高样本退化主要由路由失败引起,突出了源选择校准作为扩展上下文生成式检索的关键瓶颈。

英文摘要

Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic forgetting of previously indexed documents. In this work, we revisit incremental GR as an in-context retrieval problem, where newly added documents are supplied as inference-time document-docid evidence. We propose ICICLE, an in-context indexing framework that performs source-aware docid generation over both parametric memory and context-provided document-docid pairs. ICICLE combines a `[COPY]`-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded retrieval from parametric retrieval. Experiments on MS MARCO and NQ320K show that ICICLE improves retrieval of newly introduced documents while preserving seen-document retention without corpus-specific retraining. Our analysis further shows that high-shot degradation is mainly caused by routing failure, highlighting source-selection calibration as a key bottleneck for scaling in-context generative retrieval.

2605.26391 2026-05-28 cs.GR cs.CV

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

Garment Particles: 一种用于生成和编辑的2D-3D对称服装表示

Kiyohiro Nakayama, I-Chao Shen, Ruofan Liu, Yiming Wang, Gordon Wetzstein, Takeo Igarashi

AI总结 提出Garment Particles,一种5D点云表示,联合编码2D裁剪图和3D几何,通过Garment Particles Flow框架支持从高级输入生成和多种编辑操作,实现最先进的服装生成效果。

详情
AI中文摘要

实际服装设计跨越两种模式:从高级意图(如参考图像或文本描述)进行直观创建,以及在2D裁剪图和3D悬垂几何之间进行复杂的低级编辑,这需要专业培训才能驾驭其复杂的相互依赖性。然而,现有框架仅解决了这一挑战的一部分,提供了从随意输入生成服装或直接在裁剪图上编辑的功能。为了支持这两种需求,我们提出了Garment Particles,一种5D点云表示,联合编码2D裁剪图和3D几何。这种表示使得Garment Particles Flow(GPF)成为可能,这是一个整流流框架,支持从高级输入(文本、图像、草图)进行直观生成,并通过扩散后验采样对2D裁剪图和3D几何进行各种编辑操作。最后,我们引入了Particles-to-Pattern Flow,将生成的服装粒子转换为基于曲线的裁剪图以进行模拟。我们在多个数据集上验证了模型的生成能力,与竞争基线相比实现了最先进的服装生成结果。我们的模型还支持许多服装编辑场景,包括服装插值、裁剪图编辑、点云和轮廓条件服装生成。我们的项目网站位于 https://garment-particles.github.io。

英文摘要

Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model's generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at https://garment-particles.github.io .

2605.26186 2026-05-28 cs.SE cs.AI cs.CL cs.LG

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

SetupX:LLM代理能否从过去的功能正确代码仓库设置失败中学习?

Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu

AI总结 提出SetupX框架,通过经验学习、推测执行和验证协议解决代码仓库设置中的跨仓库经验迁移、多步试错和鲁棒验证问题,在基准测试中达到92%通过率。

Comments 21 pages, 6 figures

详情
AI中文摘要

功能正确的仓库设置旨在配置执行环境(例如,依赖项、构建脚本)以成功执行仓库的文档化功能。由于多样化的、特定于仓库的失败(包括依赖项不兼容、缺少工具链、不完整的安装和验证策略不匹配),这带来了重大挑战。现有的LLM代理难以稳健地解决这些问题,具体来说,它们无法支持(1)跨仓库经验迁移,(2)在不可逆状态变化下的多步试错修复,以及(3)对设置结果的鲁棒验证,以区分设置引起的失败和仓库错误。为了解决这些问题,我们引入了SetupX,一个基于经验学习的设置框架。首先,我们构建了自进化经验表示(XPU),一种双模态知识单元,编码设置信号、文本指导和可执行动作,以动态地将已验证的环境修复迁移到未见过的仓库。其次,我们采用了由LIFO Docker快照栈支持的经验增强推测执行,使代理能够主动尝试修复并安全回滚到已知的良好状态。第三,我们引入了检察官-法官验证协议,将证据收集与最终判断分离,从而实现超越表面构建时度量的更可靠的设置验证。在精心设计的基准测试上的评估结果表明,SetupX达到了最高性能(例如,92%的通过率),并且比最强基线高出19%以上。关键的是,SetupX在需要跨不同容器协调多个互连服务的复杂多仓库设置中表现出色。代码仓库可在https://github.com/OpenDataBox/SetupX获取。

英文摘要

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

2605.25567 2026-05-28 stat.ML cs.LG

Rao-Blackwellized Score Matching on Manifolds

流形上的 Rao-Blackwellized 得分匹配

Divit Rawal

AI总结 针对潜分布支撑在光滑嵌入流形上的去噪得分匹配,提出通过最近点投影条件期望消除奇异性的 Rao-Blackwellized 方法,并推导出小噪声展开下与内在黎曼得分相差显式 σ² 修正的规范目标。

Comments 22 pages, 3 figures; SPIGM @ ICML 2026

详情
AI中文摘要

我们研究了当潜分布支撑在光滑嵌入流形 $M \subset \mathbb{R}^D$ 上时的去噪得分匹配(DSM)。在环境高斯噪声下,切向去噪目标包含一个奇异的法向纤维噪声通道,其方差在 $\sigma \to 0^+$ 时发散为 $d/\sigma^2$。我们证明,对最近点投影 $\pi(X)$ 取条件可以规范地消除这一奇异性:所得的条件期望是所有仅依赖于投影观测 $\pi(X)$ 的估计量中切向 DSM 目标的唯一 $L^2$ 最优 Rao-Blackwellized 预测器。然后我们计算了这一规范目标的小噪声展开,并证明它等于内在黎曼得分,相差一个显式的 $\sigma^2$ 阶修正,该修正分解为一个内在的 Tweedie 项和一个涉及 Weingarten 和 Ricci 算子的外在曲率项。在平坦情形下,该构造精确退化为普通的低维高斯 DSM,而在 $S^d$ 上,外在修正简化为标量因子 $(1-d/2)\nabla_M \log q$;在 $S^2$ 上,这一外在 $\sigma^2$ 修正恒为零,但内在的 Tweedie 项仍然存在。

英文摘要

We study denoising score matching (DSM) when the latent distribution is supported on a smooth embedded manifold $M \subset \mathbb{R}^D$. Under ambient Gaussian corruption, the tangent denoising target contains a singular normal-fiber noise channel whose variance diverges as $d/σ^2$ as $σ\to 0^+$. We show that conditioning on the nearest-point projection $π(X)$ canonically removes this singularity: the resulting conditional expectation is the unique $L^2$-optimal Rao-Blackwellized predictor of the tangent DSM target among all estimators depending only on the projected observation $π(X)$. We then compute the small-noise expansion of this canonical target and show that it equals the intrinsic Riemannian score up to an explicit order-$σ^2$ correction that decomposes into an intrinsic Tweedie term and an extrinsic curvature term involving the Weingarten and Ricci operators. In the flat case, the construction reduces exactly to ordinary lower-dimensional Gaussian DSM, while on $S^d$ the extrinsic correction simplifies to the scalar factor $(1-d/2)\nabla_M \log q$; this extrinsic $σ^2$ correction cancels identically on $S^2$, though the intrinsic Tweedie term remains.

2605.23137 2026-05-28 eess.IV cs.CV

STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding

STAMBRIDGE:用于脑电视觉解码的谱时幅度感知中间特征桥

Jiahe Meng, Weiming Zeng, Yueyang Li, Bo Chai, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

AI总结 提出STAMBRIDGE两阶段框架,通过谱时幅度感知调制(STAM)提取稳健脑电特征,并利用中间特征语义桥(MFSB)实现稳定的跨模态对齐,在THINGS-EEG基准上取得34.50% Top-1和65.95% Top-5的200路零样本检索准确率。

详情
AI中文摘要

脑电图(EEG)视觉解码由于低信噪比神经信号与高度结构化的视觉-语言空间之间的模态差距而仍然具有挑战性,使得直接的跨模态对齐不稳定。为了解决这个问题,我们提出了STAMBRIDGE,一个通用的两阶段框架,依次处理特征条件和跨模态对齐。首先,我们引入谱时幅度感知调制(STAM)来提取良好条件的EEG表示。通过用幅度导出的软通道权重和多尺度时间卷积替代硬频率掩蔽,STAM明确保留了频率感知的瞬态,同时降低了时域振铃伪影的风险。在这些稳健的神经特征基础上,我们进一步引入了一个模型无关的中间特征语义桥(MFSB),通过定向的跨模态交互构建一个正则化的中间空间,实现分阶段蒸馏和更稳定的语义对齐。在THINGS-EEG基准上的实验显示了具有竞争力的200路零样本检索性能,Top-1准确率为34.50%,Top-5准确率为65.95%。此外,STAMBRIDGE学习的嵌入使用扩散模型产生了语义连贯的图像重建,展示了稳健的EEG到视觉语义对齐。代码可在https://github.com/thabeatmjh/STAMBRIDGE获取。

英文摘要

Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision--language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50\% Top-1 and 65.95\% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: https://github.com/thabeatmjh/STAMBRIDGE.

2605.23066 2026-05-28 cs.DC cs.LG

Orbax: Distributed Checkpointing with JAX

Orbax: 使用JAX进行分布式检查点

Colin Gaffney, Shutong Li, Daniel Ng, Anastasia Petrushkina, Niket Kumar, Adam Cogdell, Mridul Sahu, Yaning Liang, Nikhil Bansal, Justin Pan, Angel Mau, Abhishek Agrawal, Marco Berlot, Ruoxin Sang, Kiranbir Sodhia, Rakesh Iyer

AI总结 本文提出Orbax,一个模块化的JAX原生检查点库,通过抽象分布式加速器系统的复杂性并提供灵活的用户友好检查点操作,在保存和加载性能上分别比PyTorch竞品快3.5倍和2倍。

Comments 18 pages, 5 tables, 6 figures

详情
AI中文摘要

在高性能分布式ML系统的背景下,JAX已成为一个受欢迎的框架。然而,JAX的模块化设计理念使其缺乏标准化的检查点解决方案。在本文中,我们介绍Orbax,一个模块化的、JAX原生的检查点库,它抽象了分布式加速器系统的复杂性,同时在整个ML模型生命周期中为用户友好的检查点操作提供灵活性。我们展示了在保存和加载性能上分别比PyTorch竞品快3.5倍和2倍。该库可在https://github.com/google/orbax获取。

英文摘要

In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX-native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user-friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5$\times$ for saving and 2$\times$ for loading. The library is available at https://github.com/google/orbax.

2605.17448 2026-05-28 cs.GR cs.CL

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

基于有限元分析反馈的自改进CAD生成智能体

Guijin Son, Jehyun Park, Seyeon Park, Sunghee Ahn, Youngjae Yu

AI总结 提出一种以有限元分析为反馈的CAD生成框架,通过蓝图和渲染图监督信号提升多部件装配质量,使生成结果满足工程需求。

Comments Work in progress

详情
AI中文摘要

计算机辅助设计(CAD)是现代工业设计的基石,然而现有的CAD生成器仍无法满足实际工程流程:它们既不像工程师那样迭代,也不评估工程所需。先前的工作将CAD生成视为两个独立的步骤——零件合成和装配,前者通过接近参考标准来评分,而后者(如果处理的话)被简化为一个单独的约束求解步骤。在这项工作中,我们引入了一种更贴近工业的任务形式,要求模型根据自由形式的工程简报生成完全装配的多部件STEP文件,然后通过有限元分析(FEA)进行验证。FEA验证显示,Codex (GPT-5.5) 和 Claude Code (Opus-4.7) 智能体在主要的首次尝试扫描中没有产生任何严格通过的工件,最佳配置平均仅满足约20%的类型化要求。此外,我们引入了两种额外的监督信号:一种新颖的纯文本蓝图模式和一种21视角图像渲染器,以辅助智能体的视觉检查,使生成循环更符合工程师实际迭代的方式。在S2O和Fusion360上,相同的反馈工具改善了几何重建,GPT-5.5/xhigh在S2O上的Box-IoU从0.444提升到0.592,在Fusion360上从0.397提升到0.505。这些信号共同将CAD程序推向不仅视觉上合理,而且经过物理和结构要求检查的工件。

英文摘要

Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent's visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.

2605.13931 2026-05-28 eess.AS cs.SD

FSD50K-Solo: Automated Curation of Single-Source Sound Events

FSD50K-Solo:单源声音事件的自动策展

Ningyuan Yang, Sile Yin, Li-Chia Yang, Bryce Irvin, Xiao Quan, Marko Stamenovic, Shuo Zhang

AI总结 提出一种基于生成扩散模型和预训练音频编码器的数据策展框架,自动识别并过滤多源样本,从FSD50K中构建单源声音事件数据集FSD50K-Solo。

Comments Accepted to EUSIPCO 2026. 5 pages, 3 figures

详情
AI中文摘要

高质量的训练数据集对于神经网络的性能至关重要。然而,音频领域仍然缺乏大规模、强标注的单源声音事件数据集。FSD50K数据集虽然相对较大且开放,但包含相当比例的多源样本,其中背景干扰或重叠事件可能限制数据的实用性。为了解决这一挑战,我们引入了一个为大规模开放音频语料库设计的数据策展框架。我们的方法利用生成扩散模型合成干净的单一类别事件,以构建受控的噪声混合用于监督。随后,我们采用预训练音频编码器结合判别分类器自动识别并过滤多源样本。实验表明,我们的框架在由人类专家策展的测试集上取得了强劲的性能。最后,我们发布了FSD50K-Solo,这是FSD50K的一个由模型策展的子集,包含由我们的方法识别的单源音频样本。除了FSD50K,我们的方法为策展开源音频语料库建立了一个可扩展的范式。

英文摘要

High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.

2605.16293 2026-05-28 cs.CY cs.AI

From Prediction to Intervention: The Evolution of AI in Biomedicine

从预测到干预:人工智能在生物医学中的演进

Andrew Feinberg, Aleksandr Sarachakov, Viktor Svekolkin, Alexander Bagaev, Ferran Prat, Michael Feinberg

AI总结 本文提出生物医学AI正从基于历史数据的预测范式转向能够模拟干预效果的干预智能,通过定义疾病级模型实现从推理到仿真的转变,以支持新型疗法和未观测干预下的决策。

Comments 10 pages, 3 figures, 1 table. Figures were replaced with a better versions

详情
AI中文摘要

人工智能通过大规模多模态数据集成在生物医学领域取得了快速进展,实现了对临床结果和患者分层的日益精确的预测。然而,这些系统本质上仍然是观察性的:它们从历史数据中学习统计关联,并在先前观察到的生物学和临床状态内运行,限制了它们推广到新型疗法或未观测干预的能力。我们认为,生物医学AI正在经历结构性转变。随着生物医学决策越来越依赖于对干预的推理而非对过去观察的外推,预测架构在结构上变得不足。从历史数据中学习的系统,在构造上无法表示生物系统在扰动下如何演化,因此无法可靠地支持存在新型干预时的决策。我们引入了一个概念框架,区分观察性智能和干预性智能,并将疾病级模型定义为明确表示生物过程的状态、动态和干预响应的系统。这些模型实现了从推理到仿真的转变——推理在干预下会发生什么,而非基于过去可能发生什么。这一转变也意味着价值创造点的转移:从数据处理和预测转向支持和定义干预下决策的系统。这直接源于生物医学决策的结构,并定义了AI在医学中的下一阶段。无法建模干预的系统将在结构上被排除在决策之外。

英文摘要

Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate prediction of clinical outcomes and patient stratification. These systems, however, remain fundamentally observational: they learn statistical associations from historical data and operate within previously observed biological and clinical states, limiting their ability to generalize to novel therapies or unobserved interventions. We argue that AI in biomedicine is undergoing a structural transition. As biomedical decision-making increasingly depends on reasoning about intervention rather than extrapolation from past observations, predictive architectures become structurally insufficient. Systems that learn from historical data cannot, by construction, represent how biological systems evolve under perturbation, and therefore cannot reliably support decision-making in the presence of novel interventions. We introduce a conceptual framework distinguishing observational and interventional intelligence and define disease-level models as systems that explicitly represent the state, dynamics, and intervention response of biological processes. These models enable a shift from inference to simulation -- reasoning about what will happen under intervention rather than what is likely based on the past. This transition also implies a shift in where value is created: from data processing and prediction toward systems that support and define decision-making under intervention. It follows directly from the structure of biomedical decision-making and defines the next stage of AI in medicine. Systems that cannot model intervention will be structurally excluded from decision-making.

2605.13278 2026-05-28 math.OC cs.LG

Proximal-Based Generative Modeling for Bayesian Inverse Problems

基于近端算子的贝叶斯逆问题生成建模

Boyang Zhang, Zhiguo Wang, Ya-Feng Liu

AI总结 针对扩散模型在逆问题中似然分数难以计算的问题,提出基于近端算子的生成建模框架,利用Moreau-Yosida正则化与高斯卷积的理论等价性,通过Moreau分数匹配学习近端算子,实现无需显式似然评估的采样,理论上去除早期停止偏差并达到非渐近收敛,实验在重建质量和采样时间上超越现有方法。

详情
AI中文摘要

基于分数的扩散模型在生成任务中表现出卓越性能,但由于时间相关似然分数的解析难解性,在逆问题中遇到根本性瓶颈。为弥补这一差距,我们提出一种新颖的基于近端算子的生成建模(PGM)框架,严格规避了显式似然评估。我们的框架建立在扩散过程中的高斯卷积与非光滑优化中的Moreau-Yosida正则化之间的理论等价性之上。这使得一种由所提出的Moreau分数驱动的新采样机制成为可能,该分数通过近端算子具有闭式表达式。此外,我们引入Moreau分数匹配来学习仅依赖于从先验分布中抽取的样本的近端算子。理论上,PGM消除了基于分数的扩散模型固有的早期停止偏差,并实现了非渐近收敛。实验表明,PGM在重建质量和采样时间上显著超越了最先进的方法。

英文摘要

Score-based diffusion models demonstrate superior performance in generative tasks but encounter fundamental bottlenecks in inverse problems due to the analytical intractability of the time-dependent likelihood score. To bridge this gap, we propose a novel proximal-based generative modeling (PGM) framework that rigorously circumvents explicit likelihood evaluation. Our framework is built upon a theoretical equivalence between Gaussian convolution in diffusion processes and Moreau-Yosida regularization in nonsmooth optimization. This enables a new sampling mechanism driven by the proposed Moreau score, which admits a closed-form expression via proximal operators. Moreover, we introduce Moreau score matching to learn the proximal operators that rely solely on samples drawn from the prior distribution. Theoretically, PGM eliminates the early-stopping bias inherent in the score-based diffusion model and achieves non-asymptotic convergence. Experiments demonstrate that PGM significantly surpasses state-of-the-art methods in reconstruction quality and sampling time.

2605.12015 2026-05-28 cs.CR cs.AI cs.CL cs.LG cs.MA

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench:在技能面攻击表面下评估智能体安全性

Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu

AI总结 提出SkillSafetyBench基准,通过155个对抗案例评估大语言模型智能体在技能、本地工件和执行环境文件等非用户攻击下的安全失败模式。

详情
AI中文摘要

可复用技能正成为扩展大语言模型智能体的常见接口,它将程序性指导与对文件、工具、内存和执行环境的访问打包在一起。然而,这种模块化引入了现有安全评估大多忽略的攻击面:即使用户请求是良性的,不安全的影响可能存在于技能指导、本地工件或执行环境文件中,这些会引导智能体采取不安全行为。我们提出了SkillSafetyBench,一个可运行的基准,用于评估此类技能中介的安全失败。SkillSafetyBench包含跨47个任务、6个风险领域和30个安全类别的155个对抗案例,每个案例都使用特定于案例的基于规则的验证器进行评估。使用多个CLI智能体和模型后端的实验表明,非用户攻击可以一致地诱导不安全行为,在不同领域、攻击方法和脚手架-模型配对中表现出不同的失败模式。我们的发现表明,智能体安全性不仅取决于模型级别的对齐,还取决于智能体如何解释技能、信任工作流上下文以及通过可执行环境采取行动。

英文摘要

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, unsafe influence may reside in skill guidance, local artifacts, or execution-environment files that steer the agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

2605.11325 2026-05-28 cs.IR cs.AI

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval

结构化信念状态与首个面向LLM记忆检索的精度感知基准

Jeffrey Flynt

AI总结 针对现有LLM记忆系统基准仅评估答案质量而忽略检索精度的问题,提出独立于生成模型的检索精度基准PrecisionMemBench和结构化信念存储系统Tenure,实现89/89测试案例通过且平均精度1.0。

Comments v2 evaluates three production memory systems, evidence to make the claim falsifiable and the benchmark reusable

详情
AI中文摘要

每个主要的LLM记忆系统基准(尤其是LoCoMo)都只衡量模型是否回答正确,而非记忆系统是否检索正确。一个返回其整个信念存储的系统能达到1.0的召回率并通过答案质量评估。这是单元测试与集成测试的区别:检索质量必须独立于其馈入的生成模型进行测量,而现有基准均未做到这一点。 我们证明,即使实体提取完全忠实,这种失败仍然存在。记忆基线在引用自身提取的案例上平均检索精度仅为0.05到0.08。这种失败是结构性的:在领域特定语料库上,余弦相似度无法区分相关信念与语义相近的信念,这一不变性在20倍范围的嵌入模型规模上得到确认。多轮评估揭示了累积性失败;在话题漂移后,对比系统允许语义质量在轮次间渗漏,导致重入时的高漂移分数。单轮指标掩盖了这一代价:Hindsight报告单轮延迟低于700ms,但每会话轮次平均延迟超过2700ms,p95超过6000ms。在LLM-as-a-Judge评估下,这些失败仍然不可见。 我们提出两项贡献:PrecisionMemBench,一个包含89个案例的基准,独立于生成模型测量检索精度,涵盖多样化的范围、变异和隔离断言;以及Tenure,一个本地优先的结构化信念存储,使用多路径BM25、分析器不对称、差分提升和硬范围隔离。Tenure通过了89/89案例,平均精度1.0,检索延迟低于15ms。对比提供商的表现比它们所基于的原始向量基线更差,零主动检索通过,摄取成本为98到897秒,这些失败是答案质量基准无法检测的。

英文摘要

Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system retrieved correctly. A system returning its entire belief store achieves recall of 1.0 and passes answer-quality evaluation. This is the difference between a unit test and an integration test: retrieval quality must be measured in isolation from the generative model it feeds into, and no existing benchmark does this. We demonstrate that this failure persists even when entity extraction is entirely faithful. Memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. The failure is structural: cosine similarity over a domain-specific corpus cannot discriminate relevant beliefs from semantically proximate ones, an invariance confirmed across a 20x range in embedding model scale. Multi-turn evaluation surfaces a compounding failure; after topic drift, comparison systems allow semantic mass to bleed across turns, yielding high drift scores on re-entry. Single-turn metrics conceal this cost: Hindsight reports sub-700ms single-turn latency but exceeds 2,700ms mean per session turn, with p95 above 6,000ms. Under LLM-as-a-Judge evaluation, these failures remain invisible. We present two contributions: PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models across diverse scope, mutation, and isolation assertions; and Tenure, a local-first structured belief store using multi-path BM25 with analyzer asymmetry, differential boosting, and hard scope isolation. Tenure passes 89/89 cases with mean precision 1.0 and sub-15ms retrieval latency. Comparison providers perform worse than the raw vector baseline they are built on, with zero active retrieval passes and ingestion costs of 98 to 897 seconds, failures that answer-quality benchmarks cannot detect.

2605.11154 2026-05-28 astro-ph.IM cs.AI cs.LG

Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction

利用大语言模型和信息论量化天体物理方法的可重建性:光谱重建的案例研究

Hsing Wen Lin, Zong-Fu Sie

AI总结 提出信息论框架,通过大语言模型生成的概率分布和香农熵、JS散度,量化文本描述对算法重建的约束力,以海王星外天体光谱重建为例,发现文本虽能明确算法结构但无法消除实现级方差,存在“熵下限”,且LLM无法推断隐性专家知识。

Comments 26 pages, 6 figures, Accepted for publication in PASP

详情
AI中文摘要

现代天体物理研究严重依赖复杂的数据分析流程;然而,已发表的描述往往缺乏计算可重复性所需的细节。在这项工作中,我们提出了一个信息论框架,用于量化方法从其书面描述中重建的有效性。通过将算法重建视为大语言模型(LLM)生成的概率分布,我们利用香农熵和詹森-香农散度来衡量文本对有效实现假设空间的约束程度。我们通过对稀疏测光数据中的海王星外天体(TNO)光谱重建的案例研究来展示这种方法。通过向前沿LLM提供不同级别的稿件文本(标题、摘要和方法),我们发现虽然增加文本成功澄清了整体算法结构,但未能消除实现层面的方差。这种持续存在的方差建立了一个“熵下限”,表明多个不同的实现与明确指令保持一致。为了评估实际可重复性,我们将这些重建的算法转换为可执行的流程。我们的结果表明,虽然LLM容易恢复核心功能方法,但它们系统性地无法推断严格科学校准所需的隐性专家知识。这项初步研究表明,LLM可以作为一种零样本诊断工具来审计方法透明度,帮助作者识别缺失的结构约束,并在自动化研究时代维护科学完整性。

英文摘要

Modern astrophysical studies rely heavily on complex data analysis pipelines; however, published descriptions often lack the detail required for computational reproducibility. In this work, we present an information-theoretic framework to quantify how effectively a method can be reconstructed from its written description. By treating algorithmic reconstruction as a probability distribution generated by Large Language Models (LLMs), we utilize Shannon entropy and Jensen-Shannon divergence to measure how strongly text constrains the hypothesis space of valid implementations. We demonstrate this approach through a case study of Trans-Neptunian Object (TNO) spectral reconstruction from sparse photometry. By prompting frontier LLMs with varying levels of manuscript text (Title, Abstract, and Methods), we find that while increasing text successfully clarifies the overall algorithmic structure, it fails to eliminate variance at the implementation level. This persistent variance establishes an "entropy floor," demonstrating that multiple divergent implementations remain consistent with explicit instructions. To evaluate practical reproducibility, we convert these reconstructed algorithms into executable pipelines. Our results reveal that, while LLMs easily recover core functional methodologies, they systematically fail to infer the tacit expert knowledge required for strict scientific calibration. This pilot study demonstrates that LLMs can be repurposed as a zero-shot diagnostic tool to audit methodological transparency, helping authors identify missing structural constraints and preserve scientific integrity in an era of automated research.

2602.07999 2026-05-28 cs.IT cs.LG math.IT

Tighter Information-Theoretic Generalization Bounds via a Novel Class of Change of Measure Inequalities

通过一类新的测度变换不等式实现更紧的信息论泛化界

Yanxiao Liu, Yijun Fan, Deniz Gündüz

AI总结 本文提出基于数据处理不等式的一类新测度变换不等式,涵盖f-散度、Rényi散度和α-互信息,并应用于泛化误差、PAC-Bayes、差分隐私和数据记忆化,得到更紧的保证。

Comments Fixed a mistake in the proof of PAC-Bayesian bound

详情
AI中文摘要

测度变换不等式将概率测度之间的散度转化为事件概率的显式界,在机器学习、信息论和统计学中推导概率保证时发挥重要作用。我们通过基于数据处理不等式的统一框架提出新颖的测度变换不等式,该框架出乎意料地基础却足够强大,能够产生新颖、更紧的不等式。我们给出了以广泛信息测度族表示的测度变换不等式,包括f-散度(以Kullback-Leibler散度和χ²-散度为特例)、Rényi散度和α-互信息(以最大泄漏为特例)。我们将这些结果应用于泛化误差分析、PAC-Bayesian理论、差分隐私和数据记忆化,通过简化分析获得了更强的保证,同时恢复了已知的最佳结果。

英文摘要

Change of measure inequalities translate divergences between probability measures into explicit bounds on event probabilities, and play an important role in deriving probabilistic guarantees in learning theory, information theory, and statistics. We propose novel change of measure inequalities via a unified framework based on the data processing inequality, which is surprisingly elementary yet powerful enough to yield novel, tighter inequalities. We provide change of measure inequalities in terms of a broad family of information measures, including $f$-divergences (with Kullback-Leibler divergence and $χ^2$-divergence as special cases), Rényi divergence, and $α$-mutual information (with maximal leakage as a special case). We apply these results to generalization error analysis, PAC-Bayesian theory, differential privacy, and data memorization, obtaining stronger guarantees while recovering best-known results through simplified analyses.

2602.02561 2026-05-28 cs.LO cs.AI cs.LG

MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

MathlibLemma: 形式化数学中的民间引理生成与基准测试

Xinyu Liu, Zixuan Xie, Amir Moeini, Claire Chen, Shuze Daniel Liu, Yu Meng, Aidong Zhang, Shangtong Zhang

AI总结 提出基于LLM的模块化流水线MathlibLemma,自动挖掘、形式化并证明数学中缺失的民间引理,生成包含4028个类型检查的Lean语句的基准测试集。

详情
AI中文摘要

尽管Lean和Mathlib生态系统在大语言模型(LLM)的帮助下在形式化数学推理方面取得了显著成功,但Mathlib中缺乏许多民间引理仍然是一个持续存在的障碍,限制了Lean作为像LaTeX或Maple那样的日常工具对数学家的可用性。为了解决这个问题,我们引入了MathlibLemma,一个基于LLM的模块化流水线,用于自动进行民间引理挖掘:发现、形式化并证明数学家通常认为理所当然但形式化库中并不总是存在的可重用中间事实。其核心是,MathlibLemma主动挖掘数学中缺失的连接组织。该流水线生成一个经过验证的民间风格引理库,包括1506个通过证明绕过筛选的Lean检查证明;一个精心策划的小型试点子集也已合并到Mathlib中,提供了外部证据表明选定的输出可以满足专家库标准。利用这一流水线,我们进一步构建了MathlibLemma基准测试集,包含4028个跨越广泛数学领域的非平凡类型检查的Lean语句。通过将LLM的角色从被动消费者转变为主动贡献者,这项工作朝着AI辅助扩展形式化数学库迈出了一步。

英文摘要

While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX{} or Maple. To address this, we introduce MathlibLemma, a modular LLM-based pipeline for automated folklore-lemma mining: the discovery, formalization, and proving of reusable intermediate facts that mathematicians often take for granted but that are not always present in formal libraries. At its core, MathlibLemma proactively mines the missing connective tissue of mathematics. The pipeline produces a verified library of folklore-style lemmas, including 1,506 Lean-checked proofs that pass a proof-bypass screen; a small curated pilot subset has also been merged into Mathlib, providing external evidence that selected outputs can meet expert library standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 non-trivial type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work takes a step toward AI-assisted expansion of formal mathematical libraries.

2605.09986 2026-05-28 stat.ML cs.CL cs.LG

Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

带宽预算下的联邦语言模型:蒸馏率与共形覆盖

Prasanjit Dubey, Xiaoming Huo

AI总结 本文研究带宽受限节点间分布式语言模型的统计保证,提出联邦探针-对数蒸馏(FPLD)和联邦共形RAG(FC-RAG)两种协议,分别给出训练时的KL一致性率和推理时的无分布边际覆盖界,首次将带宽作为一阶统计参数。

详情
AI中文摘要

在临床网络、企业知识库和科学联盟中,经常出现数据分散在带宽受限节点上且无法集中的场景,需要训练语言模型。我们研究数据必须保持分布式在节点上的情况,并询问在明确带宽预算下原则上可以实现哪些统计保证;我们的目标是描述可证明的可能性,而不是展示一个可部署的系统。现有理论要么单独处理训练时的一致性,要么单独处理推理时的校准,且没有先前的工作将带宽作为一阶统计参数。我们分析了两种协议:用于训练的联邦探针-对数蒸馏(FPLD)和用于推理的联邦共形RAG(FC-RAG),作为我们结果的分析载体。我们的第一个主要结果是FPLD的显式高概率KL一致性率,同时依赖于节点数$K$、每节点样本量$n$、量化预算$B$、探针集大小$m$和词汇量$V$;带宽仅通过指数衰减的量化项进入。我们的第二个主要结果是FC-RAG的无分布边际覆盖界,其新颖的检索带宽松弛量$\Delta_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$使每节点检索带宽成为一阶统计参数,在每节点均匀情况下,通过$K$个节点的算术聚合使松弛量以$K^{-1/2}$的速度缩小。一个Pinsker型推论将两个界组合成端到端的覆盖保证。合成实验验证了沿界参数的预测缩放;在GPT-2测试平台上的小规模实验表明,定性带宽-准确率权衡在真实语言模型上仍然存在。部署规模的实证评估超出范围。

英文摘要

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and no prior work makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.