arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21338 2026-05-21 cs.CL

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

文本分析评估框架:基于LLM和社交媒体的案例研究

Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados

AI总结 本文提出了一种基于问题的评估框架,通过470个手工整理的问题来评估LLM在处理聚合文本数据时的语义理解和推理能力,揭示了LLM在处理大规模文本数据时的性能瓶颈。

详情
AI中文摘要

LLMs在广泛的NLP任务中表现出色,但在实际数据分析场景中仍存在显著差距,尤其是在处理长序列的非结构化文档(如新闻feed或本文特别针对的社交媒体帖子)时。为了实证评估LLM在该设置中的有效性,我们引入了一个包含470个手工整理问题的问题基于评估框架,旨在评估LLM在聚合文本数据上的语义理解和推理能力。我们将其应用于覆盖各种NLP任务的多样化Twitter数据集,包括情感分析、仇恨言论检测和情感识别。我们的结果表明,性能严重依赖于输入规模和数据源的复杂性,在多标签或目标依赖场景中下降明显。此外,随着任务复杂性的增加,性能从基本的语义存在识别逐步下降到更 demanding 的操作,如比较、计数和计算。此外,当输入规模超过500个实例时,我们发现LLMs,特别是开放式权重模型,普遍存在一个共同的限制:在数值任务上性能显著下降。这些发现突显了当前LLMs在对大规模文本集合进行严格定量分析时的关键架构瓶颈。

英文摘要

LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

2605.21333 2026-05-21 cs.CL cs.AI

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

SymbolicLight V1: 一种具有高激活稀疏性和亚十亿级预训练证据的脉冲门双路径语言建模

Ting Liu

AI总结 本文提出SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型,通过长程记忆的指数衰减聚合路径和短程精度的脉冲门局部注意力路径,实现了高激活稀疏性和亚十亿级预训练证据。

详情
Comments
35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript
AI中文摘要

原生训练的脉冲语言模型难以同时结合Transformer类语言质量、稳定的多领域预训练和高激活稀疏性。我们提出了SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型。其Dual-Path SparseTCAM模块用指数衰减聚合路径替代密集自注意力,用于长程记忆,用脉冲门局部注意力路径用于短程精度,辅以动态上下文条件解码头和双语分词器。一个从头开始在300亿词中文-英语语料上训练的19400万参数SymbolicLight V1模型,在四个独立运行中达到8.88-8.93的验证PPL,每元素激活稀疏性超过89%。其PPL在GPT-2 20100万参数模型下落后7.7%,但在GPT-2 12400万参数模型上表现更优。在匹配0.5亿词训练预算的组件消融实验中,脉冲门局部注意力路径是最大贡献者,而用确定性top-k掩码替代LIF动力学在匹配稀疏性时导致更大退化,表明时间积分而非稀疏性本身驱动性能。我们还报告了一个在4880亿词上训练的0.8亿参数规模运行作为优化和稀疏性保持的证据,而非主要质量比较。当前密集硬件推理速度比GPT-2慢,因此神经形态部署被提出作为未来稀疏性驱动的机会,而非已实现的硬件加速。

英文摘要

Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

2605.21330 2026-05-21 cs.RO

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

从联合传感器学习鲁棒的灵巧手部操作

Senlan Yao, Chenyu Yang, Jaehoon Kim, Aristotelis Sympetheros, Robert K. Katzschmann

AI总结 本文研究如何仅依靠关节传感进行手部操作,提出了一种无需外部感知的Proprioceptive Transformer方法,通过强化学习训练教师策略并将其转化为PT,实现了在腱驱动灵巧手上的连续立方体旋转,实验表明其在旋转速度和位置估计精度上优于基线方法。

详情
Comments
8 pages, 6 figures, 3 tables
AI中文摘要

手部对象操作是灵巧机器人的一项基本但具有挑战性的能力。尽管在灵巧操作方面取得了显著进展,现有方法主要依赖视觉或触觉传感来跟踪物体状态,而关节传感——任何机械手上最易获得的模态——仍被忽视,尤其是对于腱驱动手。本文研究关节传感单独能走多远,通过三个问题:(i) 是否电机编码器或直接关节传感能提供更好的本体感觉反馈,(ii) 如何从关节测量中提取环境信息,以及(iii) 是否仅使用关节控制可以在不依赖外部感知的情况下实现竞争性的现实性能。我们提出了Proprioceptive Transformer (PT),一种无需外部感知的连续立方体旋转方法,仅使用关节传感反馈。首先通过强化学习训练教师策略,利用特权物体信息,然后将其转化为PT,该方法仅基于关节位置和速度的历史数据。Transformer架构有效从关节传感器读数中的时间模式中提取隐含的物体状态信息。在真实的ORCA手实验中,我们的方法实现了比基线方法高3.1倍的旋转速度。我们还展示了PT在立方体位置估计上的RMSE比MLP基线低23.4%,表明其在从本体感觉源中提取外部信息方面具有优势。

英文摘要

In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.

2605.21325 2026-05-21 cs.LG

Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers

快速且稳定的三角矩阵求逆用于Delta规则线性变换器

Aleksandros Sobczyk, Gioele Gottardo, Christos K. Matzoros, Mirko De Vita, Filip Skogh, Anastasios Zouzias, Jiawei Zhuang

AI总结 本文研究了Delta规则线性变换器中快速且稳定的三角矩阵求逆方法,通过分析直接和迭代算法,探讨了矩阵乘法丰富的算法在现代硬件上的高效利用,实验验证了不同方法在低精度浮点表示下的性能和稳定性,实现了三角矩阵求逆的4.3倍加速,从而提升整个层级性能并保持端到端模型精度。

详情
Comments
Preprint
AI中文摘要

线性注意力机制已成为高效长上下文架构的核心,如Qwen3.5/3.6、Kimi Linear和RWKV-7等先进开源模型均整合了该机制。包含线性注意力层的Delta规则模型涉及三角矩阵求逆作为核心子过程。该操作常成为性能瓶颈,且由于对数值误差高度敏感,若未正确实现,会显著降低端到端模型精度。本文系统分析了直接和迭代三角矩阵求逆算法,针对矩阵乘法丰富的算法,从而可能高效利用现代硬件。为此,我们的分析涵盖了广泛的数学和实际方面,重点在于数值稳定性、计算复杂度以及最终的硬件效率和实际考虑。我们提供了严谨的实验评估以验证这些属性在实际场景中的表现,并在低精度浮点表示下突出每种方法的优势和局限性。在NPUs上的性能基准测试显示,三角矩阵求逆的实现相比SGLang的最新实现快达4.3倍,从而在整个层级上实现显著的性能提升,同时保持完整的端到端模型精度。

英文摘要

Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to $4.3\times$ speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.

2605.21324 2026-05-21 q-bio.NC cs.LG

Stimulus symmetries can confound representational similarity analyses

刺激对称性可能混淆表征相似性分析

Farhad Pashakhanloo, Jacob A. Zavatone-Veth

AI总结 研究探讨了网络输入对称性如何影响表征相似性矩阵(RSMs)的分析,指出不同配置可能导致不同的RSMs,并展示了随机梯度下降或能量正则化如何生成稀疏漂移代码,从而导致漂移RSMs。

详情
Comments
40 pages
AI中文摘要

表征相似性矩阵(RSMs)能告诉我们关于神经编码的什么信息?随着这些汇总统计量的普及,对它们性质的更全面描述的需求也日益增加。本文表明,网络输入中的对称性可能干扰基于RSM的分析。刺激对称性使许多表示在功能上等价,但这些不同配置可能导致不同的RSMs。这些不同的RSMs反映了质上不同的表征几何结构。我们展示随机梯度下降或能量正则化可以生成稀疏、漂移的代码,从而导致漂移的RSMs。此外,我们证明这些现象在训练以编码图像数据的网络中也存在,其中对称性是隐含的。我们的结果说明了在非线性神经编码比较中面临的挑战,当功能等价的表示不通过简单的旋转相关时。

英文摘要

What can representational similarity matrices (RSMs) tell us about a neural code? As the popularity of these summary statistics grows, so too does the need for a more complete characterization of their properties. Here, we show that symmetries in network inputs can confound RSM-based analyses. Stimulus symmetries render many representations functionally equivalent, but these different configurations can lead to different RSMs. These different RSMs reflect qualitatively different representational geometries. We show that stochastic gradient descent or energetic regularization can generate sparse, drifting codes, leading in turn to drifting RSMs. Moreover, we demonstrate that these phenomena are present in networks trained to encode image data, where the symmetry is latent. Our results illustrate the challenges inherent in comparing nonlinear neural codes, when functionally-equivalent representations are not related by a simple rotation.

2605.21322 2026-05-21 cs.LG

Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

优化的联邦知识蒸馏与分布式神经架构搜索

Chaimaa Medjadji, Sylvain Kubler, Yves Le Traon, Guilain Leduc, Sadi Alawadi, Feras M. Awaysheh

AI总结 本文提出FedKDNAS框架,结合客户端侧神经架构选择与服务器协调的知识蒸馏,以解决联邦学习中数据异质性、系统异质性和通信效率问题,通过提升准确率和效率的帕累托效率。

详情
AI中文摘要

联邦学习(FL)使在不集中数据的情况下进行协同模型训练成为可能。然而,现实部署必须同时解决客户端数据的统计异质性(非iid)、系统异质性(设备能力差异)和通信效率。现有FL方法通过改进聚合、个性化或知识蒸馏来缓解这些挑战,但几乎都假设客户端架构固定,限制了对异质数据复杂性和硬件约束的适应性。这种架构限制通常导致现实FL系统中准确率与效率之间的次优权衡。本文引入FedKDNAS,一种由蒸馏驱动的FL框架,结合客户端侧神经架构选择与服务器协调的知识蒸馏。每个客户端在准确率-资源约束下自主选择轻量模型,然后使用结合监督学习和知识蒸馏的混合目标在本地训练,并仅分享预测结果。服务器然后聚合并平滑这些预测,可选地与教师模型结合,以生成下一轮的稳定蒸馏目标。在六个数据集上对六个代表性的FL基线(FedAvg、Ditto、FedMD、FedDF、FedDistill、Local-KD)的广泛评估表明,FedKDNAS在非iid条件下将准确率提高高达15%,减少客户端CPU使用约28%,同时将通信开销减少高达44倍,同时保持轻量的logit通信。

英文摘要

Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15\% under non-IID conditions, reducing client CPU usage by approximately 28\%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

2605.21318 2026-05-21 cs.CL cs.AI cs.LG

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

TextReg: 通过正则化的文本空间优化缓解提示分布过拟合

Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang

AI总结 本文研究了提示分布过拟合问题,提出TextReg框架通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新,提升模型在分布外(OOD)任务上的泛化能力。

详情
Comments
Code: https://github.com/luchengfu6/TextReg
AI中文摘要

大型语言模型(LLMs)对用于指定任务目标和行为约束的提示非常敏感。许多最近的提示优化方法通过迭代使用LLM生成的反馈来重写提示,但结果提示往往变长,积累狭窄的样本特定规则,并在训练分布之外泛化能力差。我们研究这种失败模式作为提示分布过拟合,并认为这反映了离散文本空间优化中表示控制的不足。我们通过表示不效率(representational inefficiency)进行了形式化,这是一种双因素度量,将提示不效率分解为容量成本和范围狭窄,将分布提示过拟合归因于优化过程中两者的耦合增长。我们提出了TextReg,一个正则化框架,通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新。在多个推理基准上,TextReg显著提高了分布外(OOD)泛化能力,其准确性在TextGrad和REVOLVE上分别提高了+11.8%和+16.5%。

英文摘要

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

2605.21317 2026-05-21 cs.LG

CRAFT: Conflict-Resolved Aggregation for Federated Training

CRAFT: 用于联邦训练的冲突解决聚合

Ziqi Wang, Qiang Liu, Nils Thuerey

AI总结 本文提出CRAFT框架,通过将全局更新视为几何校正问题,解决联邦学习中冲突客户端更新的聚合问题,提升全局模型准确率并减少客户端间性能差异。

详情
AI中文摘要

在异构数据分布下,冲突客户端更新的聚合仍是联邦学习(FL)中的关键瓶颈。简单平均会产生一个改进全局目标但与特定客户端冲突的全局更新,导致这些客户端性能下降。本文提出CRAFT(Conflict-Resolved Aggregation for Federated Training),一种新的聚合框架,将全局更新视为几何校正问题。我们将其形式化为寻找最接近参考方向且满足无冲突对齐约束的更新。我们推导出约束优化问题的闭式表达式,避免了迭代求解器的计算开销。此外,我们使用分层适应来解决不同特征粒度下的冲突。我们提供了理论分析,证明CRAFT通过其投影几何促进共同下降结构并缓解冲突。在异构基准上的广泛实验表明,与最先进的基线相比,CRAFT在提升全局模型准确率的同时,减少了客户端间的性能差异。CRAFT的源代码可在https://github.com/tum-pbs/CRAFT获取。

英文摘要

The aggregation of conflicting client updates remains a fundamental bottleneck in federated learning (FL) under heterogeneous data distributions. Naive averaging can produce a global update that improves the global objective while conflicting with specific clients, causing degradation for those clients. In this work, we propose CRAFT (Conflict-Resolved Aggregation for Federated Training), a new aggregation framework that treats the global update as a geometric correction problem. We formulate aggregation as finding the update closest to a reference direction while satisfying conflict-free alignment constraints. We derive a closed-form expression for the constrained optimization problem, avoiding the computational overhead of iterative solvers. Furthermore, we use a layer-wise adaptation to address conflicts at varying feature granularities. We provide a theoretical analysis showing that CRAFT promotes a common-descent structure and mitigates conflicts through its projection geometry. Extensive experiments on heterogeneous benchmarks demonstrate that CRAFT improves the accuracy of the global model while reducing performance disparity across clients compared with state-of-the-art baselines. The source code for CRAFT is available at https://github.com/tum-pbs/CRAFT.

2605.21313 2026-05-21 cs.LG

A New Framework to Analyse the Distributional Robustness of Deep Neural Networks

分析深度神经网络分布鲁棒性的新框架

Divij Khaitan, Subhashis Banerjee

AI总结 本文提出了一种新框架,通过研究层权重与激活之间的相互作用来分析和量化深度神经网络的分布鲁棒性,展示了该框架在CIFAR-10和ImageNet上模型的实用性,并表明所提指标能区分记忆训练数据和未记忆的网络。

详情
Comments
9 pages, 6 figures, 3 tables
AI中文摘要

深度神经网络在多种任务上取得了显著性能,但其对分布变化的脆弱性仍然是实际部署中的重大障碍。本文提出了一种框架,通过研究层权重与激活之间的相互作用来分析和量化神经网络的分布鲁棒性。我们使用伯努利分布建模这些相互作用,利用类别间分离度作为鲁棒性的诊断代理。我们通过在CIFAR-10和ImageNet上训练的模型展示了该框架的实用性。我们证明所提出的指标可以区分记忆训练数据的网络和未记忆的网络。我们还进行了类似的激活空间实验,发现相同的性质不成立。此外,我们研究了我们的指标在各种分布变化下的行为,并显示这些变化在我们的路径基础上降低了分离度。我们的结果表明,该框架提供了有用的模型级表示结构和鲁棒性的诊断。

英文摘要

Deep neural networks have achieved impressive performance on a variety of tasks, but their brittleness to distributional shifts remains a significant barrier to real-world deployment. In this paper, we propose a framework to analyse and quantify the distributional robustness of neural networks by studying the interactions between layer weights and activations. We model these interactions using Bernoulli distributions, using the separation between classes as a diagnostic proxy for robustness. We demonstrate the usefulness of this framework through models trained on CIFAR-10 and ImageNet. We show that our proposed metrics can distinguish between networks that have memorised their training data and those that have not. We also perform analogous experiments in the activation space and find that the same properties do not hold up. Additionally, we investigate the behaviour of our metrics under various distribution shifts and show that these shifts reduce separation under our path-based diagnostics. Our results suggest that this framework provides useful model-level diagnostics of representation structure and robustness.

2605.21312 2026-05-21 cs.DC cs.AI cs.LG

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Frontier: 向全面且准确的LLM推理模拟迈进

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu

AI总结 本文提出Frontier,一种用于现代LLM推理服务的离散事件模拟器,通过离散化抽象和对关键运行时优化的建模,实现了对复杂工作负载的准确预测,从而在不同服务场景中提供更精确的计算、通信和内存成本预测。

详情
AI中文摘要

现代LLM服务已不再是单一或整体的。生产系统现在结合了解耦执行、复杂并行性、运行时优化和状态化工作负载,如推理、代理和RL展开。模拟对于探索这个快速增长的设计空间具有吸引力,但现有模拟器缺乏所需的架构完整性和决策级精度。它们的单体-副本抽象不适合解耦服务,而平均情况分析代理可能会扭曲SLA预测甚至逆转优化结论。我们提出了Frontier,一种用于现代LLM推理服务的离散事件模拟器。Frontier具有解耦抽象。它通过建模共置、预填解码解耦(PDD)和注意力-前馈网络解耦(AFD)与角色特定的集群工作者,捕捉现代服务系统的结构和动态。它在调度器-批次引擎循环中整合关键运行时优化(例如CUDA图、推测解码),并支持新兴工作负载的状态请求。它进一步提供了在多样化服务场景中对计算、通信和内存成本的准确且可推广的预测。在16-H800 GPU测试平台上,Frontier实现了平均吞吐量误差低于4%。与最先进的模拟器相比,它在共置情况下将端到端延迟误差从44.9%降低到6.4%,在解耦情况下从51.7%降低到2.6%。它扩展到超过1000个GPU在商用CPU上,并启用了新的用例,如依赖SLA的帕累托前沿探索、异构解耦分配、代理推理调度验证和RL后训练重配置。

英文摘要

Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration.

2605.21311 2026-05-21 cs.LG cs.AI

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

DeCoR:基于强化学习的城市街道设计与控制联合优化

Bibek Poudel, Lei Zhu, Kevin Heaslip, Sai Swaminathan, Weizi Li

AI总结 本文提出DeCoR框架,通过强化学习联合优化城市街道的过街横道布局和网络级信号控制,减少了行人到达最近过街横道的时间,并显著降低了行人和车辆等待时间。

详情
Comments
22 pages, 8 figures
AI中文摘要

现代视觉系统可以大规模检测、跟踪和预测城市中的行人,但将感知输出转化为城市设计仍然有限。我们介绍了DeCoR,一种两阶段强化学习框架,利用流量观测来联合优化过街横道布局和网络级信号控制。设计阶段将行人网络编码为图,并学习一种生成策略,该策略参数化一个高斯混合模型,用于过街横道的位置和宽度,从中采样新的过街横道。对于每个布局,共享的控制策略学习自适应信号时序以最小化行人和车辆的总延迟。在一条750米的现实世界城市走廊上,DeCoR学习了一个布局,该布局将行人到达最近过街横道的时间减少了23%,同时使用比现有配置更少的过街横道。在控制方面,DeCoR相对于固定时间信号控制,将行人和车辆等待时间分别减少了79%和65%。进一步,控制策略能够泛化到训练外的需求,并且在不重新训练的情况下对布局变化具有鲁棒性。

英文摘要

Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.

2605.21309 2026-05-21 cs.CV cs.RO

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Hyper-V2X: 基于超网络的协作鸟瞰图语义分割中epistemic和aleatoric不确定性的估计

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

AI总结 本文提出Hyper-V2X框架,通过超网络估计协作V2X感知中的epistemic和aleatoric不确定性,采用部分权重生成方案和V2X上下文嵌入模块,条件化贝叶斯超网络生成随机鸟瞰图分割的权重分布,提升感知可靠性。

详情
Comments
Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026
AI中文摘要

通过Vehicle-to-Everything (V2X)通信实现的协作感知通过共享传感器数据创建统一的环境表示,从而提高自动驾驶安全性。尽管近期工作已推进多智能体融合以改善感知,但此类协作框架中的不确定性量化仍鲜有研究。本文介绍Hyper-V2X,一种基于超网络的框架,用于估计V2X感知中的epistemic和aleatoric不确定性。具体而言,我们提出了一种部分权重生成方案和V2X上下文嵌入模块,将贝叶斯超网络条件化于融合的多智能体特征,以生成随机Bird's-Eye-View (BEV)分割的权重分布。与现有确定性BEV模型不同,Hyper-V2X在计算开销小的情况下实现了高效的不确定性估计。我们的方法架构无关,可无缝集成到现代协作骨干结构中,如CoBEVT。在OPV2V基准测试中,Hyper-V2X提供了准确且校准良好的不确定性估计,并提高了整体感知可靠性。我们的代码和基准已公开发布,许可证为开源:https://github.com/abhishekjagtap1/Hyper-V2X

英文摘要

Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

2605.21308 2026-05-21 cs.CV cs.AI

Deformba: Vision State Space Model with Adaptive State Fusion

Deformba:具有自适应状态融合的视觉状态空间模型

Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

AI总结 本文提出Deformba,一种能够动态增强空间结构信息并保持状态空间模型线性复杂度的自适应方法,通过多模态融合(如交叉注意力)提升视觉任务的性能,展示了在2D和3D视觉任务中的广泛适用性。

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

状态空间模型(SSMs)已作为一种强大的、高效的替代方案出现于Transformer之上,展现出线性时间复杂度和卓越的序列建模能力。然而,将其应用于视觉任务仍具有挑战性。首先,现有的视觉SSMs大多依赖于手动设计的固定扫描方法将图像块扁平化为序列,这会引入预定义的几何结构并增加复杂性。其次,在需要不同信息流之间进行查询式交互的领域中,SSMs的更广泛采用受到阻碍。这是由于SSMs为1D序列建模任务设计时固有的因果性和自指性所致。这种融合机制对于多视角3D融合等关键感知任务至关重要。为了解决这些限制,我们提出Deformba,一种上下文自适应的方法,能够在保持SSMs线性复杂度的同时动态增强空间结构信息。Deformba还允许多模态融合,如交叉注意力。为了证明Deformba的有效性和广泛适用性,我们在通用的2D视觉任务(如图像分类、目标检测和分割)以及3D视觉任务(如BEV感知)上测试其性能。大量实验表明,Deformba在各种视觉感知基准上均取得了强劲的性能。

英文摘要

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

2605.21303 2026-05-21 cs.LG cs.AI cs.LO

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

从电路证据到机制理论:一种归纳逻辑方法

Nura Aljaafari, Danilo S. Carvalho, Andre Freitas

AI总结 本文提出了一种基于归纳逻辑的方法,通过将电路解释视为归纳理论构建,为累积的机制科学提供形式化基础设施。该方法通过因果功能签名和建筑签名,明确机制主张,并在不同模型规模之间实现可移植性。

详情
Comments
27 pages, 10 Figures, 14 Tables
AI中文摘要

机制可解释性能够产生神经网络行为的电路层面因果分析,但发现的电路往往仍然是孤立的实验艺术品:没有共享的形式化表示来说明电路计算什么,它们如何相互关联,或者两个发现是否为同一机制提供证据。本文通过将电路解释视为归纳理论构建,提供了一种形式化基础设施,用于累积的机制科学。每个电路在两个层面进行表征:因果功能签名(CFS),它通过因果归因证据和令牌角色配置文件将组件行为基础化;以及建筑签名τ_arch,通过归纳逻辑编程(ILP)从尺度不变的结构谓词中学习。共同,这些构成了一个形式化的一致层,使机制主张显式化,并通过θ-子sume进行比较,并在模型规模之间实现可移植性。CFS揭示了不同任务类型中不同的计算策略,包括注意力介导的复制与MLP介导的绑定。ILP签名在结构分离方面优于图核和特征向量基线,并支持在不同模型规模和架构家族之间进行原理性转移。

英文摘要

Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $τ_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $θ$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

2605.21301 2026-05-21 cs.LG cs.CV

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

通过与健康对照组对比自动发现疾病亚组

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori

AI总结 本文提出了一种通过对比患者与健康对照组来发现可解释且同质的疾病亚组的方法,该方法在医学影像数据集上展示了改进的亚组估计质量。

详情
Comments
Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track
AI中文摘要

在生物医学亚组发现中,研究者致力于在患者群体中发现可解释且同质的亚组。在本文中,我们假设健康个体(即对照组)与患者共享一些无关的变异性因素,从而提出了一种称为Deep UCSL的对比亚组发现方法。通过对比患者与对照组,Deep UCSL识别出仅由病理因素驱动的亚组,忽略与健康个体共享的共同变异性。我们的框架采用深度特征提取器来学习判别性表示空间。数学上,我们基于潜在聚类和患者/对照组标签的条件联合似然推导出一种新的损失函数,并通过期望最大化策略交替优化亚组推断和特征编码器更新。一个正则化项进一步鼓励表示捕捉疾病特异性变异性,同时忽略与对照组共享的变异性。与先前相关工作相比,我们的方法在MNIST示例和四个不同的医学影像数据集上展示了改进的亚组估计质量。代码和数据集可在:https://github.com/rlouiset/deep_ucsl获取。

英文摘要

In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

2605.21300 2026-05-21 cs.CV

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

通过强调图像负样本token减少LVLMs中的物体幻觉

Meng Shen, Minghao Wu, Deepu Rajan

AI总结 本文通过强调图像负样本token来减少LVLMs中的物体幻觉问题,提出调整不同token的训练权重和数据过滤策略以控制幻觉。

详情
Comments
20 pages, 10 figures, 10 tables
AI中文摘要

物体幻觉是阻碍大型视觉-语言模型(LVLMs)在实践中应用的重要挑战。我们假设幻觉的一个可能来源是模型倾向于优先生成文本而非与图像进行有意义的交互。为此,我们研究了生成过程并将文本token分为三类:图像正样本、不变样本和负样本,基于它们对输入图像token的视觉依赖性。我们的分析发现,大多数生成的token对图像信息影响很小。这表明在模型训练阶段,更强调学习如何遵循文本指令,而非从图像中提取信息。基于此发现,我们提出根据token的视觉依赖性调整训练权重以控制幻觉。此外,我们移除一部分可能包含更多幻觉的训练数据作为数据过滤策略。这两种方法在不牺牲响应长度或引入额外计算成本的情况下减少了幻觉。我们验证了我们的方法在三个LVLM变体上的有效性,展示了其有效性和通用性。

英文摘要

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

2605.21299 2026-05-21 cs.CL cs.AI

Tracing the ongoing emergence of human-like reasoning in Large Language Models

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

AI总结 研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力,发现人类通过语用推理丰富逻辑推理,而模型行为更不稳定,部分模型遵循条件语义但忽视语用推理,表明LLMs在语义准确性上表现良好,但缺乏人类推理中的语用丰富性。

详情
AI中文摘要

人类能够超越字面意义:如果你修剪草坪,我会给你五十美元,通常被理解为说话者只在草坪修剪时支付,而如果你饿了,烤箱里有披萨,意味着披萨无论听者是否饥饿都可用。大型语言模型(LLMs)在许多任务上表现出类人性能,但尚不清楚它们是否像人类一样推理。为此,我们进行了一项人口匹配实验,评估了25个LLMs在四种语言中计算条件推理的能力,并与每种语言中等数量的人类进行比较。我们发现,人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表,但忽视语用推理,而另一些LLMs偏离真值表,坚持单一解释,从而反映准确的规则处理但不具有类人推理能力。总体而言,LLMs是准确的语义运算符,但未能捕捉到人类推理中特有的语用丰富性。关键的是,LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升,表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

2605.21295 2026-05-21 cs.LG cs.AI cs.HC

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL: 通过语义RL调优的LLM实现通用的时间序列行为建模 -- 一项心理健康应用的案例研究

Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu

AI总结 本文提出TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测,将原始信号抽象为高级自然语言,从而预测行为结果,该方法在心理健康预测中实现了最先进的跨群体泛化性能。

详情
AI中文摘要

纵向被动传感能够实现连续健康预测,但模型在跨数据集分布偏移下往往失效。传统机器学习容易过拟合群体特异性特征,而大型语言模型(LLMs)在长且异质的时间序列上难以可靠推理。我们引入TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测。模型首先将原始信号抽象为高级自然语言,然后仅从这些抽象中预测行为结果。这迫使模型在我们认为泛化更好的语义概念上进行推理。我们通过组相对策略优化(GRPO)结合可验证奖励的强化学习(RLVR)端到端优化这一过程,学习与结果对齐的抽象,而无需金标准中间注释。在心理健康预测中,TimeSRL在设计用于在严格的一留一数据集-out(LOSO)协议下压力测试跨群体泛化能力的基准上实现了最先进的性能,将焦虑的均绝对误差(MAE)在强大的非LLM ML和LLM基线模型上分别降低了3.1-10.1%和9.5-44.1%,抑郁的MAE则降低了3.2-9.6%和27.4-57.6%(所有p值<0.05)。TimeSRL在不同传感管道上的跨基准迁移中显著优于先前方法,在不进行目标领域微调的情况下,其性能与自身在领域内性能相当。这些结果表明语义抽象具有可重用性,并指出了通过RL调优的LLM实现通用行为建模的新方向。

英文摘要

Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

2605.21292 2026-05-21 stat.ML cs.AI cs.LG math.DS

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

双因子线性变换器模型的大步训练动态

Krishnakumar Balasubramanian

AI总结 本文研究了双因子线性变换器模型在大学习率下的训练动态,通过分析发现大步长学习率可以改变变换器的训练吸引子,而非仅仅加速收敛,可能在稳定性阈值之外导致训练进入循环、有界混沌或发散。

详情
AI中文摘要

梯度流分析显示,简化的线性变换器可以学习上下文线性回归算法,但无法解释大学习率下梯度下降的有限步行为。受高学习率变换器不稳定性实证研究和二次回归的立方图相图启发,我们研究了一个可以简化为单提示线性变换器训练问题的恰好可约问题。归一化后,动态减少为一个双因子乘积映射,具有有效步长参数μ。在平衡切片上,该映射恢复了已知的标量立方过渡,从单调收敛到飞弹收敛,周期性和有界非收敛,以及发散。我们随后分析了完整的二维系统,显示对于0<μ<2,它有一个显式不变的切比雪夫椭圆,将前向不变区域分开;该椭圆承载着不平衡的混沌动态,但横向排斥,而平衡标量吸引子可以横向吸引。这些结果表明,大常数学习率可以改变学习变换器的训练吸引子,而不仅仅是加速收敛:在稳定性阈值之外,有限步训练可能进入循环、有界混沌或发散,而不是单一的上下文线性回归解。我们还讨论了这对基于小批量梯度下降训练方法的影响。

英文摘要

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(μ\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<μ<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

2605.21288 2026-05-21 cs.LG

A Mechanistic Study of Tabular Foundation Models

表格基础模型的机理研究

Marin Biloš, James T. Wilson, Anderson Schneider, Yuriy Nevmyvaka

AI总结 本文研究了不同架构的表格基础模型在分类和回归任务中的准确性收敛问题,揭示了模型内部算法、对称性来源以及扰动鲁棒性的机理,发现先前指出的表示崩溃并非实际问题。

详情
AI中文摘要

表格基础模型在不同架构下在多种分类和回归任务中表现出准确性的收敛。这引发了排行榜无法回答的问题:(i)这些模型是否执行相同的上下文算法?(ii)行、列和类置换不变性来源在哪里?(iii)在针对推断机制设计的扰动下,它们的鲁棒性如何?我们对这三个问题进行了特征化。模型家族实现了质上不同的相似性基于读取:从加权投票上下文标签到类条件均值读取,每种都通过因果干预得到验证。我们发现先前工作中强调的表示崩溃并非这些模型的实际问题。每个模型的置换不变性可以追溯到特定的位置参数,移除这些参数可保持准确性并使近似不变性变为精确。针对每个读取设计的扰动复现了预测的失败模式;枢纽和排名攻击将它们与重训练基线隔离。这些结果共同提供了当前表格基础模型的机理解释,并识别了哪些归纳偏置同时决定了其准确性和特征性失败。

英文摘要

Tabular foundation models with different architectures converge in accuracy across a range of classification and regression tasks. This raises questions a leaderboard cannot answer: (i) whether the models execute the same in-context algorithm, (ii) where row, column, and class-permutation invariances originate, and (iii) how robust they are under perturbations engineered against the inferred mechanism. We characterize all three. The model families realize qualitatively distinct similarity-based readouts: from an attention-weighted vote over context labels to a class-conditional mean readout, each confirmed by causal intervention. We find that the representation collapse highlighted in prior work is not a practical concern for them. Each model's permutation invariances trace to specific positional parameters whose removal preserves accuracy and makes approximate invariance exact. Perturbations engineered against each readout reproduce predicted failure modes; hub and rank attacks isolate them from refit baselines. Together these results give a mechanistic account of contemporary tabular foundation models and identify which inductive biases govern both their accuracy and characteristic failures.

2605.21280 2026-05-21 cs.CV

Let EEG Models Learn EEG

让EEG模型学习EEG

Yifan Wang, Yijia Ma, Wen Li, Chenyu You

AI总结 本文提出了一种基于条件流匹配的生成框架JET,通过直接建模神经信号的连续演化来生成高质量的EEG信号,解决了传统离散去噪方法在捕捉长期时间依赖性和保持频谱结构方面的不足,实现了在多个基准测试中优于现有方法的性能。

详情
Comments
Accepted by ICML 2026
AI中文摘要

高保真度的EEG生成对于缓解大规模神经建模中的数据稀缺和隐私约束至关重要。尽管近年来取得了进展,但大多数现有方法通过离散去噪目标来生成EEG,这无法充分反映神经活动本质上连续的时间动态和频谱结构。因此,这些方法往往难以保持长期时间依赖性,并且生成信号在频谱和时间结构上存在不匹配。在本文中,我们主张有效的EEG生成需要能够直接操作神经信号连续演化的模型。我们引入了Just EEG Transformer (JET),一种基于条件流匹配的生成框架,将EEG建模为沿着连续轨迹演变的原始序列。通过学习一个平滑的向量场,将噪声传输到EEG数据分布,JET在不依赖离散去噪方案或领域特定表示的情况下捕捉时间连续性和瞬态动态。为了确保学习到的动力学与EEG信号的关键属性保持一致,我们引入了保留频谱结构、时间平稳性和信号级统计的原理性约束。在三个大规模基准测试中,JET一致地实现了最先进的性能,相比强大的基线,将TS-FID降低了超过40%。广泛的分析显示,JET捕捉了神经动态的关键结构特性,提供了一种可扩展且原理性的EEG生成方法。项目页面:https://y-research-sbu.github.io/JET/

英文摘要

High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .

2605.21272 2026-05-21 cs.CV cs.AI

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET:一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

AI总结 本文提出MONET数据集,通过多阶段过滤和增强,提供高质量的文本到图像数据,以降低大规模可重复研究的门槛。

详情
AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集,具有多样内容和详细的描述。然而,收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET,一个开放的Apache 2.0数据集,包含约104.9亿个图像-文本对,这些数据来自29亿个原始对,通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述,覆盖短到长形式的描述,并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释,以加速下游使用。为了验证MONET的有效性,我们仅使用它训练了一个400亿参数的潜在扩散模型,并在GenEval和DPG评分中达到了具有竞争力的结果,证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

2605.21266 2026-05-21 cs.LG cs.AI

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

在线强化学习需要多少?用于RLVR中离线偏好优化的信息性回放

Richa Verma, Balaraman Ravindran

AI总结 本文提出G2D方法,通过短时GRPO预热、构建静态偏好数据集和离线DPO微调,以较低的计算成本实现优于GRPO的性能,强调偏好数据信息性而非数量的重要性。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为语言模型推理的强大范式,GRPO是其主要例子。然而,GRPO需要连续在线回放生成,这使它计算成本高且难以扩展。尽管直接偏好优化(DPO)提供了稳定的离线替代方案,但通常在训练时表现不如在线RL方法如GRPO。我们引入G2D(GRPO到DPO),一个三阶段流程,进行短GRPO预热,构建静态偏好数据集,并使用DPO离线微调模型。在Qwen2.5-7B和Llama-3.1-8B上,我们发现离线DPO在适度预热下能以显著更低的计算成本匹配或超越GRPO。在Qwen2.5-7B上,G2D在K=150时在MATH-500上达到62.4%,比GRPO(51.6%)高出10.8%,计算成本低约4倍。在Llama-3.1-8B上,G2D在K=500时达到49.4%,在实验设置中超越GRPO。我们表明性能不取决于偏好对的数量,而取决于其信息性。适度预热产生校准的不确定性回放,产生更强的对比信号,而过度预热导致过于自信的策略和信息较少的数据。我们的结果将RLVR中的离线-在线差距重新定义为主要的数据信息性问题,并识别了适当难度校准的离线微调数据集的短在线RL预热作为计算高效的在线RL替代方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.

2605.20706 2026-05-21 cs.DC cs.AI cs.LG

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

网络上的Llamas:基于WebGPU的内存高效、性能可移植和多精度LLM推理

Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen

AI总结 本文提出LlamaWeb,一种基于WebGPU的LLM推理框架,通过静态内存规划和高效模型加载减少内存开销,支持多种模型权重格式,实现了内存高效、性能可移植的LLM推理。

详情
Comments
19 pages, 11 figures, 5 tables
AI中文摘要

在浏览器中运行语言模型提供了一个独特的机会,可以构建高效、私有且可移植的AI应用,但需要应对受限的内存可用性和异构硬件目标。为了实现这一机会,我们提出了Llamas on the Web(LlamaWeb),一种针对llama.cpp的WebGPU后端,能够在浏览器中实现内存高效且性能可移植的LLM推理,适用于广泛范围的模型权重格式。我们的设计通过静态内存规划和高效的模型加载显著减少了内存开销,通过可调的内核库解决了跨设备的差异性,并引入了模板化的GPU内核,支持多种量化格式的高性能实现,从而实现了广泛模型支持和对新格式的扩展性。我们评估了LlamaWeb在16个设备上,收集了10个语言模型和四种模型权重格式的数据。我们比较了LlamaWeb与现有的浏览器LLM框架,发现LlamaWeb在多种设备、浏览器和操作系统组合下需要29-33%更少的内存。我们还评估了LlamaWeb的性能,发现其在四个不同供应商的GPU上解码吞吐量提高了45-69%。此外,我们还比较了LlamaWeb与其他llama.cpp后端的性能,发现其在某些设备上与甚至超越了供应商特定的后端性能。

英文摘要

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

2605.19362 2026-05-21 cs.HC cs.AI

Toward User Comprehension Supports for LLM Agent Skill Specifications

向LLM代理技能规范提供用户理解支持

Zikai Alex Wen

AI总结 研究探讨了技能规范是否有助于用户形成对技能消耗、产生和覆盖范围的有限预期,并通过分析878个网络安全技能的文本线索,发现仅少数规范包含必要的提示,强调应将规范视为面向用户的能劾示范而非仅执行指令的容器。

详情
Comments
To appear at ACM CAIS Workshop Agent Skill 2026
AI中文摘要

用户经常通过SKILL markdown规范来解释和选择代理技能。为了保护用户,现有审核主要关注恶意或不安全的技能。我们研究了互补问题:规范是否帮助用户形成对技能消耗、产生和覆盖范围的有限预期。在878个网络安全技能中,我们使用基于规则的编码来测量四个理解锚点的文本线索,即操作基础、输出合同、边界披露和示例能力演示。操作基础的线索较为常见,但仅有19.0%的规范包含示例任务、样本或预期结果的线索,仅2.3%的规范包含所有四个锚点的线索。我们进一步检查了一个小型DNS/C2遥测子集(n=6)以说明缺失示例可能带来的影响。示例似乎使首次本地检查更容易构建,而无示例的技能通常需要辅助代码检查来恢复命令参数或输出字段。我们主张代理技能评估应将规范视为面向用户的能劾示范,而非仅仅是执行指令的容器。

英文摘要

Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.

2605.19269 2026-05-21 cs.LG

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

CODA:将Transformer块重写为GEMM-epilogue程序

Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao

AI总结 本文提出CODA,一种将Transformer中的非注意力计算重写为GEMM-epilogue程序的GPU内核抽象,以提高训练效率和硬件利用效率。

详情
AI中文摘要

Transformer训练系统围绕密集线性代数构建,但端到端时间的非平凡分数耗费在周围的内存绑定操作上。归一化、激活、残差更新、减少和相关计算反复在全局内存中移动大型中间张量,同时进行很少的算术运算,使数据移动成为在高度优化的训练堆栈中越来越重要的瓶颈。我们引入CODA,一种GPU内核抽象,将这些计算表示为GEMM-plus-epilogue程序。CODA基于观察到许多Transformer运算作为独立框架内核暴露时,可以重新参数化为在GEMM输出瓷砖留在芯片上执行,然后再写入内存。该抽象固定了GEMM主循环,并暴露了一组小的可组合的epilogue原语用于缩放、减少、配对转换和累积。这种受限的接口保留了专家编写GEMM的性能结构,同时足够表达以覆盖标准Transformer块正向和反向传递中几乎所有非注意力计算。在代表性的Transformer工作负载上,无论是人工还是LLM编写的CODA内核均实现了高性能,表明GEMM-plus-epilogue编程为结合框架级生产力与硬件级效率提供了一条实际路径。

英文摘要

Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

2605.18991 2026-05-21 cs.CR cs.AI

Agent Security is a Systems Problem

智能体安全是系统问题

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, Nishit V. Pandya

AI总结 本文提出智能体安全应作为系统问题来解决,强调通过系统层面的安全不变量来保障AI模型的安全性,而非仅仅依赖模型鲁棒性。文章基于系统安全领域的技术,提出了设计可预测安全保证的智能体系统的核心原则,并分析了实际攻击案例和实现这些原则面临的挑战。

详情
AI中文摘要

我们主张智能体安全必须作为系统问题来处理:驱动智能体的AI模型必须被视为不可信的组件,系统层面必须强制实施安全不变量。通过这一视角,单纯提高模型鲁棒性(社区中的主流观点)是不够的。相反,我们必须将现有努力与系统安全领域的技术相结合。基于我们在操作系统、网络、形式化方法和对抗机器学习领域的经验,我们提出了一套基于数十年系统安全研究的核心原则,为设计具有可预测安全保证的智能体系统提供基础。作为证据,我们分析了十一个代表性的现实世界攻击案例,并讨论了如果系统原则得以实现,这些攻击将如何被防止。我们还识别了在智能体中实现这些原则所面临的科研挑战。

英文摘要

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

2605.17618 2026-05-21 cs.AI

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

使用可穿戴传感器预测课堂环境中与深度自闭症相关的行为问题

Yadhu Kartha, Conor Anderson, Jenny Foster, Theresa Hamlin, Johanna Lantz, Ryan Lay, Juergen Hahn, Gari D. Clifford, Hyeokhyen Kwon

AI总结 本研究通过可穿戴传感器和机器学习方法,在真实课堂环境中预测自闭症深度患者的行为问题,展示了在教育环境中提前10分钟预测行为问题的可行性,并实现了AUC-ROC为0.78的准确率。

详情
AI中文摘要

自闭症谱系障碍(ASD)以社交互动和沟通困难以及思维和行为的限制或重复模式为特征,表现具有显著变化性。大约四分之一的ASD儿童被归类为深度自闭症,这些患者常常表现出自我伤害行为、攻击性、逃跑或口欲症等具有严重安全风险的行为,这些行为会干扰教育环境中的学习。先前的工作已应用可穿戴传感器和机器学习来检测这些行为,但大多局限于受控的实验室环境。本工作证明了在真实世界特殊教育课堂中预测这些行为事件是可行的。我们收集了约110.7小时的标记多模态可穿戴数据,包括加速度计、电导活动(EDA)和皮肤温度数据,来自10至21岁的9名儿童和年轻人,在标准课堂会话中。我们微调了最先进的多模态可穿戴时间序列分析基础模型,并展示了可以提前10分钟预测行为事件,AUC-ROC为0.78。这些结果为开发能够帮助教师减少特殊教育课堂中行为问题安全风险的主动干预系统奠定了坚实的基础。

英文摘要

Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms

2605.15156 2026-05-21 cs.CL cs.AI cs.LG

MeMo: Memory as a Model

MeMo:记忆作为模型

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

AI总结 本文提出MeMo框架,通过在不改变LLM参数的情况下将新知识编码到专用记忆模型中,解决了大型语言模型在需要及时领域特定信息的应用中的问题,同时具备处理复杂跨文档关系、抗检索噪声、避免灾难性遗忘、无需访问LLM权重或输出logits以及检索成本与语料库大小无关等优势。

详情
Comments
MeMo augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise
AI中文摘要

大型语言模型(LLMs)在广泛的任务上表现出色,但预训练后保持冻结状态,直到后续更新。许多现实应用需要及时、领域特定的信息,这促使需要高效的机制来整合新知识。在本文中,我们介绍MeMo(Memory as a Model),一个模块化框架,能够将新知识编码到专用的记忆模型中,同时保持LLM参数不变。与现有方法相比,MeMo具有几个优势:(a)它能够捕捉复杂的跨文档关系;(b)它对检索噪声具有鲁棒性;(c)它避免了LLM中的灾难性遗忘;(d)它不需要访问LLM的权重或输出logits,从而能够与开源和专有闭源LLM进行即插即用式集成;(e)其检索成本在推理时间与语料库大小无关。我们在三个基准测试集BrowseComp-Plus、NarrativeQA和MuSiQue上的实验结果表明,MeMo在多种设置中相比现有方法表现优异。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

2605.12597 2026-05-21 cond-mat.dis-nn cond-mat.stat-mech cs.AI cs.LG physics.comp-ph

The critical slowing down in diffusion models

扩散模型中的临界减慢现象

Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabrié

AI总结 本文研究了扩散模型在统计场理论O(n)模型中的应用,揭示了训练过程中参数学习的临界减慢现象,并通过引入局部得分近似方法,展示了通过适当架构设计可以克服这一现象,为统计物理中的采样方法提供了可控的改进框架。

详情
Comments
17 pages, 8 figures
AI中文摘要

计算采样自20世纪中叶以来一直是科学的核心。尽管基于机器学习的方法最近取得了重大进展,但其行为仍缺乏深入理解,理论上对何时以及为何成功控制有限。本文通过分析扩散模型在统计场理论O(n)模型的高斯极限n→∞下的应用,提供了对扩散模型的深入见解。在这一可分析的设置中,我们展示了训练一个具有单层网络架构的得分模型时,参数学习会出现临界减慢现象。这种减慢也影响生成过程,表明即使对于学习生成模型,接近临界点的采样困难仍然存在。为克服这一瓶颈,我们展示了通过结合架构深度与物理局部性可以提升性能。我们发现使用双层架构可以显著减少临界减慢,训练时间与系统规模的关系从二次方变为对数。通过引入局部得分近似,我们证明这种训练时间的加速可以在不增加神经网络参数数量的情况下实现。总体而言,这些结果表明扩散模型可以通过适当的架构设计克服临界减慢现象,并为统计物理及其他领域中的学习采样方法建立了可控的改进框架。

英文摘要

Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.