arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23087 2026-05-25 cs.LG

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

深度的隐式偏差：从神经坍缩到Softmax编码

Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

AI总结该研究探讨了深度神经网络中梯度下降的隐式偏差如何影响神经崩溃（NC）现象。通过分析无正则化的深度非约束特征模型（UFM），研究发现深度本身会引入一种隐式的低秩偏差，使得网络更倾向于生成低秩的特征表示，这些表示与softmax编码形式的最优解相关。研究还揭示了深度如何影响训练动态和NC的收敛区域，并指出网络宽度的增加可能促使训练向更高秩的解发展，为理解深度模型的隐式偏差提供了新的理论视角。

Comments 46 pages, 11 figures, accepted at the International Conference on Machine Learning 2026

详情

AI中文摘要

神经坍缩（NC）描述了训练分类器中特征和权重出现的结构化几何。最近的理论表明，NC在深度架构中可能不是最优的，将其归因于L2正则化的显式低秩偏差。我们研究了深度无约束特征模型（UFM）——等价于具有正交输入的深度线性网络——在无正则化训练下的情况，以隔离梯度下降和深度单独如何塑造NC。我们表明，深度诱导了隐式低秩偏差：低秩矩阵通过连续乘法更有效地传播范数，从而促进NC的低秩替代方案。我们认为，这些替代方案对应于softmax编码：先前在宽度瓶颈网络中发现的最大间隔解。通过分析谱初始化下的训练动态，我们识别出早期奇异值之间的排斥力驱动低秩出现，并刻画了深度如何缩小NC的吸引域。最后，我们展示了一些相反方向的效果：对于随机初始化的网络，增加宽度会使训练偏向更高秩的解。我们的结果首次提供了在无正则化多类交叉熵训练的深度UFM中隐式偏差的渐近和动态刻画。

英文摘要

Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

URL PDF HTML ☆

赞 0 踩 0

2605.23081 2026-05-25 cs.LG

效率前沿：LLM上下文管理中成本-性能优化的统一框架

Binqi Shen, Lier Jin, Hanyu Cai, Lan Hu, Yuting Xin

AI总结随着大语言模型对长上下文处理的需求增加，扩展上下文窗口带来了显著的计算和经济成本。本文提出了一种统一的框架《The Efficiency Frontier》，用于在上下文管理中实现成本与性能的优化，通过联合考虑任务性能、令牌成本和预处理复用，将上下文策略选择建模为部署感知的优化问题。该框架揭示了检索与预处理策略在不同操作条件下的适用范围，并在实验中展示了其在减少令牌使用和降低成本方面的显著优势。

详情

AI中文摘要

大型语言模型（LLM）越来越依赖长上下文处理，但扩展上下文窗口会带来巨大的计算和财务成本。现有的上下文缩减方法，包括检索和内存压缩方法，通常使用性能和效率指标独立评估，限制了系统比较和部署感知决策。本文介绍了效率前沿，一个用于LLM上下文管理中成本-性能优化的统一框架。该框架将上下文策略选择建模为部署感知优化问题，通过摊销成本建模联合考虑任务性能、token成本和预处理重用。与孤立比较方法的现有评估不同，所提出的框架能够进行决策导向分析，揭示不同上下文管理策略在不同操作条件下何时变得更为可取。在5000个HotpotQA实例上的评估显示，该框架揭示了基于检索和基于预处理的策略之间的不同操作区间和转换边界。结果表明，部署感知优化在可比性能（F1 ≈ 0.78）下将有效token使用减少了约25%，而摊销内存压缩在高性能设置下相比全上下文提示实现了超过50%的token成本降低。总体而言，所提出的框架为评估和部署可扩展、高效且可持续的LLM系统提供了原则性和实用性的基础。

英文摘要

Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated using performance and efficiency metrics independently, limiting systematic comparison and deployment-aware decision-making. This paper introduces The Efficiency Frontier, a unified framework for cost-performance optimization in LLM context management. The framework models context strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and preprocessing reuse through amortized cost modeling. Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis of when different context management strategies become preferable under varying operational conditions. Evaluated on 5,000 HotpotQA instances, the framework reveals distinct operational regimes and transition boundaries between retrieval-based and preprocessing-based strategies. Results show that deployment-aware optimization reduces effective token usage by approximately 25% at comparable performance ($F1 \approx 0.78$), while amortized memory compression achieves over 50% lower token cost relative to full-context prompting in higher-performance settings. Overall, the proposed framework provides a principled and practical foundation for evaluating and deploying scalable, efficient, and sustainable LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23070 2026-05-25 cs.CV

Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

Flow Mismatching: 通过流匹配模型中的速度差异进行无监督异常检测

Shengzhe Chen, Mehrdad Moradi, Kamran Paynabar, Hao Yan

AI总结本文提出了一种名为 Flow Mismatching 的无监督异常检测方法，避免了基于重建的范式，转而利用流匹配模型中的速度差异来检测异常。该方法通过在从高斯噪声到目标图像的仿射路径上分析模型预测速度与几何路径速度之间的不一致，从而识别出异常区域。实验表明，该方法在多个基准数据集上优于现有的基于重建和基于流匹配的最新方法。

详情

AI中文摘要

我们提出Flow Mismatching，一种无监督异常检测方法，有意避免基于重建的范式。相反，我们将流匹配视为几何动力学，并利用一个关键见解：异常发生在学习到的正常流与指向测试图像的几何路径不一致的地方。给定仅在正常图像上训练的流匹配模型，我们沿着从高斯噪声到目标图像的仿射路径探测其学习到的速度场。沿着每条路径，我们比较模型预测的速度（遵循正常生成动力学）与指向目标的速度（包含任何异常内容）。异常会导致这些速度之间的强烈局部不一致。聚合不同时间步和多条路径上的不匹配，产生像素级热图和图像级分数，无需测试时优化、特征记忆或额外校准。我们的分析表明，总体不匹配分解为一个不可约的降噪项和一个测试路径与正常路径得分函数之间的Fisher散度项，后者识别出驱动异常分离的得分差距成分，并解释了鲁棒路径聚合的有效性。在MVTec-AD和VisA上的大量实验表明，与最先进的基于重建和最近的基于流匹配的方法相比，性能优越。

英文摘要

We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.23069 2026-05-25 cs.CL

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

DFKI-MLT 在 SemEval-2026 任务 7 中：将多语言模型引导至文化知识

Yusser Al Ghussin, Daniil Gurgurov, Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet, Simon Ostermann

AI总结该研究针对多语言大语言模型在文化知识理解上的不足，提出了一种基于激活引导的方法，通过从平行语料FLORES中提取语言向量，对多语言模型进行推理时的适应性调整。研究参与了SemEval-2026任务7的多选题和简答题两个赛道，其中多选题部分取得了86.96%的准确率，排名第七。分析表明，激活引导在不同语言和层面上的效果不一，提示在文化感知任务中应综合优化提示设计与激活引导策略。

Comments Accepted to The 20th International Workshop on Semantic Evaluation at ACL 2026

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于不同的语言和文化背景，但其文化知识在不同地区和语言之间仍然不均匀。我们提出了用于 SemEval-2026 任务 7（文化意识）的 DFKI-MLT 系统，该系统使用从并行 FLORES 数据中提取的语言向量，对多语言 LLMs 应用激活引导。我们的方法通过在选定的 Transformer 层的残差流中添加特定语言的引导向量来进行推理时调整，无需任何参数更新。我们参加了简答题（SAQ）和多项选择题（MCQ）两个赛道；然而，只有我们的 MCQ 提交获得了官方评分。在官方 MCQ 赛道中，我们达到了 86.96% 的准确率，在 17 个队伍中排名第 7。为了更好地理解系统行为，我们对共享任务的 MCQ 和 SAQ 设置进行了事后分析。这些分析表明，激活引导对文化推理产生了适度且异质的改进：增益对层高度敏感，在不同语言-区域对之间差异很大，某些配置甚至降低了性能，并且与提示表述相互作用，比较了通用提示和文化条件提示。我们的发现表明，提示设计和激活引导应联合优化，以实现具有文化意识的多语言推理。

英文摘要

Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language-region pairs, with some configurations even degrading performance, and interact with prompt formulation, comparing generic and culturally conditioned prompts. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference.

URL PDF HTML ☆

赞 0 踩 0

2605.23068 2026-05-25 cs.CV

毫米波成像用于人体测量

Miriam Senne, Benjamin D. Killeen, Christoph Baur, Nassir Navab, Azade Farshad

AI总结该研究提出了一种基于毫米波雷达的无接触人体体型测量方法，旨在解决传统测量工具在隐私、效率和适用性方面的不足。通过优化框架，该方法能够从毫米波点云数据中恢复人体三维形状并提取全面的体态测量指标。其核心贡献在于引入了一种顶点加权策略，结合参数化人体模型（SMPL）进行鲁棒的表面对齐与噪声抑制，实现了无需脱衣、无需摄像头的快速、隐私保护的测量流程，适用于各类人群的临床风险评估。

详情

AI中文摘要

身体形状和围度是临床上用于风险分层的信息性生物标志物，包括腰臀比、肢体和躯干周长等指标，然而传统工具如手动卷尺和光学扫描仪通常需要脱衣和保持姿势。这些要求减缓了工作流程，损害了尊严，并且排除了许多老年人和行动不便者。为了实现快速无接触测量，我们利用毫米波雷达，它保护隐私并能穿透典型衣物，实现快速全身采集。在这项工作中，我们提出了一个新的基于优化的框架，从体积毫米波数据中恢复3D人体形状并提取一套全面的人体测量数据。我们的方法引入了一个加权配准流程，将参数化身体模型（SMPL）直接拟合到噪声毫米波点云上。我们贡献的核心是一种顶点加权策略，该策略调节Chamfer能量函数以实现可靠的表面对齐和噪声消除。我们通过加入脚-地面约束和姿态先验进一步稳定拟合，直接优化SMPL参数。这些组件共同实现了一个快速、保护隐私的工作流程，无需摄像头或脱衣，且只需最小程度的配合，即可通过衣物提供高保真度的身体形状和测量数据，支持在诊所和护理机构中对所有年龄和活动水平的患者进行频繁的风险导向评估。

英文摘要

Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.

URL PDF HTML ☆

赞 0 踩 0

2605.23061 2026-05-25 cs.LG cs.AI math.OC stat.ML

时间机器：论运动在高效感知中的力量

Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara

AI总结本文提出了一种以运动为核心模态的视频表征学习方法，旨在解决现有视频模型在时序理解和训练成本方面的局限。通过使用点轨迹表示视频中的运动，并利用掩码自编码器进行自监督训练，模型能够学习到更高效且细粒度的视频表征。该方法无需依赖语言标注，大幅降低了训练数据需求，并在多项任务中展现出与当前先进模型相当的性能，为构建更高效、更具时序感知能力的视频模型提供了新方向。

详情

AI中文摘要

近年来，视频表示学习取得了巨大进展。这受到多种因素的推动，包括训练规模以及通过语言对比训练的视觉模型的成功。虽然这些因素推动了视频模型的能力边界，但它们也引入了自身的局限性：首先，扩展视频模型可能达到高昂的成本；其次，从语言学习限制了可学习概念的范围，仅限于字幕中的概念。因此，视频模型在时间理解方面仍然存在困难。在本文中，我们提出了一种新颖的方法，将运动作为视频表示的核心模态。具体而言，给定视频中以点轨迹形式存在的运动，我们使用掩码自编码器来掩码部分轨迹，并训练自编码器重建缺失的轨迹。这使我们能够以自监督方式学习表示。我们表明，使用运动来表示视频实际上解决了视频技术的两个核心局限性。首先，它使我们能够大幅减少训练数据的规模，因为运动本质上与外观无关，因此需要更少的样本就能很好地泛化。其次，运动使我们能够绕过依赖语言的训练范式，学习更细粒度的概念。结果是一种嵌入，我们称之为TIME（时间感知运动嵌入），这是一种仅使用合成运动数据训练的表示。我们在零样本方式下对广泛的任务测试了这种嵌入。我们观察到，无需额外技巧，其性能与使用多达4个数量级更少训练数据的最先进模型相当。这为迈向更有时序感知且更具可扩展性的视频模型新范式奠定了基础。

英文摘要

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

URL PDF HTML ☆

赞 0 踩 0

2605.23043 2026-05-25 cs.CL stat.ML

HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation

HawkesLLM：智能体文本模拟中的语义不确定性传播

Zewei Deng, Tinghan Ye, Liyan Xie

AI总结本文提出HawkesLLM框架，用于解决智能体文本模拟系统中语义不确定性随时间累积的问题。该方法将时间影响建模与文本生成过程分离，通过多变量Hawkes过程建模节点间的激活关系，并利用语言模型基于时间模型选择的紧凑记忆生成新内容。实验表明，在GDELT新闻传播案例中，HawkesLLM在有限提示记忆预算下有效提升了后期语义对齐的效果。

Comments 10 pages, 4 figures, Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情

AI中文摘要

智能体文本模拟系统按顺序生成文本，每个项目成为后续步骤的可能上下文。这使得不确定性具有路径依赖性：早期的模糊性可能影响后续输出。本文通过HawkesLLM框架研究这一问题，该框架将时间影响建模与文本生成分离。我们将级联表示为一个网络，其节点是文本生成智能体。多变量Hawkes过程模拟这些节点随时间激活的方式，以及哪些早期节点输出应影响后续提示。然后，语言模型根据该时间模型选择的紧凑记忆编写每个新事件。我们在一个保留的全球事件、语言和语调数据库（GDELT）新闻级联案例研究中评估该框架。诊断跟踪与局部保留参考的语义对齐，并区分局部漂移和全局漂移。在此设置下，HawkesLLM在紧凑的提示记忆预算下改善了后期语义对齐。

英文摘要

Agentic text-simulation systems write in sequence, with each item becoming possible context for later steps. That makes uncertainty path-dependent: an early ambiguity can affect later outputs. This paper studies this problem with HawkesLLM, a framework that separates temporal influence modeling from text generation. We represent the cascade as a network whose nodes are text-generating agents. A multivariate Hawkes process models how these nodes activate over time and which earlier node outputs should influence later prompts. A language model then writes each new event from the compact memory selected by this temporal model. We evaluate the framework on a held-out Global Database of Events, Language, and Tone (GDELT) news-cascade case study. The diagnostics track semantic alignment with local held-out references and separate local drift from global drift. In this setting, HawkesLLM improves late-stage semantic alignment under a compact prompt-memory budget.

URL PDF HTML ☆

赞 0 踩 0

2605.23040 2026-05-25 cs.LG

世界机器：面向时间序列的生成式世界建模

Elton Cardoso do Nascimento, Alexandre da Silva Simões, Esther Luna Colombini, Ricardo Ribeiro Gudwin, Paula Dornhofer Paro Costa

AI总结本文提出了一种名为 World Machine 的生成式世界建模架构，用于时间序列数据，旨在实现对环境的可预测理解和可控模拟。该架构基于变压器模型，引入了潜在状态机制，能够适应不同长度的观测数据和上下文，相比传统变压器在计算和内存效率上有所提升。实验在合成数据集 Toy1D 上验证了该方法的可行性，并展示了其相对于传统变压器的独特优势与各训练组件的贡献。

2605.23024 2026-05-25 cs.AI cs.CC cs.CL cs.LG

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

确定性视界：作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题，提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”，该上限不受训练数据量、适配器秩或损失函数的影响，并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用，形成一套包含十六项设计规范的目录，为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情

AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记，但从图灵、阿罗到没有免费午餐定理的基本极限，塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限：超过关键推理深度后，无论适配器秩、样本大小或损失函数如何，训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算，在十二种Transformer架构中测量值介于19到31之间，而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性，信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界（对抗常数深度素数模电路）补充了这一结果。同样的论证重新应用于多个子领域：任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃；多阶段检索流水线至少需要与阶段数一样多的独立指标；标准诚实拍卖对于具有提示相关估值的智能体失效；神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录，每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则：两个组合已被证明，一个配对是诚实障碍，四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

URL PDF HTML ☆

赞 0 踩 0

2605.23019 2026-05-25 cs.LG

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

PACE：小型语言模型代理的双时间尺度自我进化

Chen Ling, Pei Chen, Albert Guan, Jiaming Qu, Shayan Ali Akbar, Madhu Gopinathan, Erwin Cornejo

AI总结本文研究了在资源受限条件下，冻结的小语言模型（SLM）能否作为有效的自进化智能体。为此，作者提出了PACE框架，通过双时间尺度协调低风险的提示优化与高风险的控制逻辑更新，实现了无需更新模型权重或依赖前沿模型的可靠自进化。实验表明，PACE在多个基准任务中均优于传统方法，显著提升了多轮工具使用等复杂任务的性能。

详情

AI中文摘要

在生产中部署语言模型代理通常需要大量的计算和人力来调整提示、解析器、验证器和代理流水线的其他组件。自我进化提供了一种有前景的替代方案，但大多数现有框架假设可以访问能够可靠诊断故障、提出修订并判断自身更新的前沿模型。我们研究冻结的小型语言模型（SLM）是否可以在资源约束下作为有效的自我进化代理。我们提出PACE（提示和控制逻辑进化），一个双时间尺度框架，协调低风险的提示优化与高风险的控逻辑更新。PACE在固定控制逻辑下进化提示，直到提示层面的增益饱和，然后考虑通过保留验证接受的有约束控制逻辑更新。在三个从4B到14B参数的冻结SLM骨干和四个受控基准上，PACE在所有12个骨干-基准组合上实现了最佳性能，相比原始SLM代理相对提升高达+9.2%，相比更强的单模式进化基线相对提升高达+5.4%。tau-bench案例研究进一步表明，PACE在多次交互工具使用成功率上优于原始和仅提示进化。这些结果表明，无需更新模型权重或依赖前沿模型教师，可靠的SLM代理自我进化是可能的，并且关键优势不在于任何单一的最终求解模式，而在于自主、经过验证地发现适合任务的推理策略。

英文摘要

Deploying language-model agents in production often requires substantial compute and human effort to tune prompts, parsers, validators, and other components of the agent pipeline. Self-evolution offers a promising alternative, but most existing frameworks assume access to frontier models that can reliably diagnose failures, propose revisions, and judge their own updates. We study whether frozen small language models (SLMs) can serve as effective self-evolving agents under resource constraints. We propose PACE (Prompt And Control Logic Evolution), a two-timescale framework that coordinates low-risk prompt refinement with higher-risk control-logic updates. PACE evolves prompts under fixed control logic until prompt-level gains saturate, then considers constrained control-logic updates that are accepted through held-out validation. Across three frozen SLM backbones ranging from 4B to 14B parameters and four controlled benchmarks, PACE achieves the best performance on all 12 backbone--benchmark combinations, improving over vanilla SLM agents by up to +9.2% relative improvement and over the stronger single-mode evolution baseline by up to +5.4% relative improvement. A tau-bench case study further shows that PACE improves multi-turn tool-use success over vanilla and prompt-only evolution. These results suggest that reliable SLM agent self-evolution is possible without updating model weights or relying on frontier-model teachers, and that the key benefit is not any single final solver pattern but autonomous, validated discovery of task-appropriate inference strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.23017 2026-05-25 cs.LG cs.GT

Smoothed Elicitation Complexity for Approximate $Γ$-calibration of Discrete Classification Tasks

离散分类任务的近似 $\Gamma$ 校准的平滑引发复杂度

Jessica Finocchiaro, Victor Ganson, Drona Khurana

AI总结本文研究了在离散分类任务中实现近似Γ-校准的问题，针对多类别分类模型的校准复杂度过高这一挑战，提出了一种基于Lipschitz连续性质的中间表示方法，有效降低了校准复杂度。通过构造适用于强可排序离散属性的Lipschitz性质，作者首次给出了离散属性近似校准的理论结果，并提供了设计这些性质的算法，为离散属性的校准提供了新的方法和理论支持。

Comments Working paper

详情

AI中文摘要

评估机器学习模型可信度的一种重要方法是校准的概念。在二元结果设置中，如果结果根据模型的条件分布预测实现，则概率预测器是校准的。将二元校准定义直接扩展到概率多类分类器会导致指数级的复杂度爆炸，因为预测空间随类别数 $n$ 呈指数增长。作为补救措施，Noarov 和 Roth (2023) 提出了使用结果分布属性的多类校准，将复杂度从随类别数 $n$ 增长降低到属性维度 $d$，称为其引发复杂度。先前关于近似属性校准的工作通常局限于连续标量属性，尽管许多相关属性是离散的，如众数或排名。我们通过使用Lipschitz连续属性作为中介，刻画了强可排序离散属性的近似属性校准。据我们所知，这是首次为离散属性提供近似校准结果。在此过程中，我们通过构建设计这些Lipschitz属性的算法，刻画了强可排序离散属性的Lipschitz引发复杂度，并证明这些属性可以通过后处理得到原始离散属性。

英文摘要

One prominent method of evaluating machine learning model trustworthiness is the notion of calibration. In the binary outcome setting, a probabilistic predictor is calibrated if outcomes are realized according to a model's distributional prediction, conditioned on this prediction. Straightforward extensions of binary calibration definitions to probabilistic multiclass classifiers suffer from an exponential complexity blowup as the space of predictions grows exponentially in the number of classes $n$. As a remedy, Noarov and Roth (2023) propose multiclass calibration with predictions that are properties of the outcome distribution, reducing complexity from growing in the number of classes $n$ to the dimension $d$ of the property, called its elicitation complexity. Previous work on approximate property calibration is generally limited to continuous scalar properties, despite many relevant properties of interest being discrete, like the mode or rankings. We characterize the approximate property calibration of discrete properties which are strongly orderable by using Lipschitz continuous properties as an intermediary. This work is the first to our knowledge to provide approximate calibration results for discrete properties. Along the way, we characterize the Lipschitz elicitation complexity of strongly orderable discrete properties by constructing algorithms for designing these Lipschitz properties, which we prove can be post-processed to obtain the original discrete property.

URL PDF HTML ☆

赞 0 踩 0