arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12497 2026-06-12 cs.LG cs.RO 新提交

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

详情
Comments
34 pages, 20 figures, 9 tables
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $\mu$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $\mu$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in this https URL.

2606.12495 2026-06-12 cs.SD 新提交

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

缺失令牌提示的可靠性感知融合用于鲁棒多语种说话人识别

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室)

AI总结 提出MRAF框架,通过可学习的缺失令牌和可靠性感知交叉注意力融合,解决多语种场景下跨语言泛化和人脸缺失时的鲁棒性问题,在POLY-SIM 2026测试集上取得高准确率。

详情
Comments
8 pages, 3 figures, 4 tables
AI中文摘要

准确且鲁棒的多模态说话人识别对于多媒体理解和生物特征认证至关重要。然而,现实中的多语种场景带来了两个关键挑战:说话人判别性表示应跨语言泛化,并且当人脸信息不可用时模型应保持可靠。为了解决这些挑战,我们提出了MRAF,一个缺失令牌提示的可靠性感知融合框架,用于跨完整模态、缺失人脸和跨语言场景的多语种说话人识别。MRAF用可学习的缺失令牌代替固定的零值特征来表示不可用的人脸输入,提供了缺失视觉状态的可训练表示。这种设计减少了由缺失输入引起的分布差距,并允许后续的可靠性估计和跨模态融合在统一的令牌空间内操作。为了自适应地集成具有不同可靠性的模态,MRAF进一步引入了可靠性感知的交叉注意力融合模块,该模块估计人脸和音频的可靠性分数,将其归一化为模态权重,并在双向交叉注意力之前将这些权重应用于令牌表示。这样,模型可以强调可靠的模态线索,同时抑制不可靠的。在训练过程中,MRAF联合优化多分支分类损失、仅音频知识蒸馏和中心损失,以提高说话人判别性和缺失模态鲁棒性。在官方POLY-SIM 2026测试集上的实验证明了所提出框架的有效性。在最终评估中,MRAF在P3和P5上达到了100%的准确率,并在更具挑战性的缺失人脸设置P4和P6上获得了有竞争力的结果。源代码将在https://this URL发布。

英文摘要

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at this https URL.

2606.12494 2026-06-12 cs.LG 新提交

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$:网络事件演化的生成式模拟器

Guangyu Wang, Zhaonan Wang

发表机构 * NYU Shanghai(上海纽约大学)

AI总结 提出Net-Ev$^2$,一种结合事件线索与网络拓扑的生成式模拟器,通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化,在多个道路网络数据集上达到最优性能。

详情
Comments
Accepted by KDD 2026 Research Track
AI中文摘要

减少现实世界的试错一直是决策的核心目标,生成式模拟器通过建模未来状态的演化推进了这一目标。一个更具挑战性且更有意义的任务是模拟扰动事件(如事故)如何通过网络传播其影响。现有方法在模拟网络事件演化时,未能同时建模事件的结构化属性和非结构化语义,也未能捕捉拓扑结构。因此,我们提出Net-Ev$^2$($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution),一种新颖的生成式模拟器,在模拟中联合利用事件线索并保留网络拓扑。具体而言,该框架包含两个阶段:结构引导的掩码预训练和拓扑感知扩散过程,后者通过类似U-Net的图下采样和上采样实现去噪。在推理时,Net-Ev$^2$仅需自然语言事件输入即可生成模拟,具有更大的实际使用灵活性。此外,我们引入了Net-Ev$^2$-6.5M,一个跨四个大规模道路网络的对齐事件和网络流量数据的多模态基准,以及一个新的拓扑感知指标JL-MMD,用于评估生成网络动态的拓扑保真度。大量实验证明了Net-Ev$^2$的最优性能和强泛化能力。代码已开源。

英文摘要

Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at this https URL.

2606.12490 2026-06-12 cs.LG 新提交

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

Li-Jen Lin, Chih-Duo Hong

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出抽象精化框架,通过分割预激活区间消除非线性松弛误差,并利用SHAP引导的时间步选择策略降低组合成本,显著提升RNN鲁棒性验证成功率。

详情
AI中文摘要

循环神经网络(RNN)的认证局部鲁棒性验证具有挑战性,因为非线性松弛引入的近似误差会通过循环连接传播并随时间累积。因此,可扩展的线性边界传播方法往往过于保守,无法认证实际上鲁棒的输入,尤其是当许多预激活区间跨越零点时。我们提出了一种用于RNN验证的抽象精化框架,该框架划分此类区间以消除主要的松弛误差:在每个精化分支上,ReLU变得精确,而tanh和sigmoid等平滑激活函数则允许更紧的线性包络。为了控制在长序列中分裂的组合成本,我们引入了一种SHAP引导的时间步选择策略,该策略根据隐藏状态对验证目标的贡献进行排序,并按时间顺序仅精化最关键的时间步。在CIFAR10和MNIST笔画基准上的实验表明,与仅使用抽象的基线相比,验证成功率和鲁棒性边界紧度持续提升,同时揭示了ReLU和tanh模型之间清晰的运行时权衡。

英文摘要

Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

2606.12488 2026-06-12 cs.LG 新提交

A Stationary (and Therefore Compatible) Representation is All You Need

静态(因此兼容)表示即所需

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

发表机构 * Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze(佛罗伦萨大学信息工程系媒体集成与通信中心(MICC))

AI总结 本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义,并通过交叉熵与对比损失的凸组合捕获高阶依赖,实现模型更新时无需重处理的检索服务。

详情
Comments
Accepted to TPAMI2026. Extension of the CVPR2024 version ( arXiv:2405.02581 )
AI中文摘要

学习兼容表示旨在当模型更新时,特征表示可以互换使用。本文证明,由d-Simplex固定分类器学习的静态表示隐含了其正式定义中的兼容性。这一结果为未来工作奠定了基础,并可直接应用于实际学习场景。我们解决了在模型顺序微调时使用d-Simplex固定分类器学习兼容性的挑战。使用交叉熵损失的d-Simplex固定分类器学习对齐一阶统计量的特征分布,因此可能无法完全捕捉模型更新之间表示的高阶依赖。为解决此问题,我们证明通过交叉熵损失和对比损失的凸组合使用d-Simplex固定分类器训练模型,不仅能捕捉高阶依赖,而且等价于在兼容性约束下使用交叉熵学习。我们通过大量实验证实了我们的发现,并考虑了一个新场景:预训练模型被顺序微调,偶尔被改进模型替换。我们表明,静态表示能够实现不间断的检索服务(无需重新处理图库图像),同时在模型更新和替换期间提升性能,达到最先进水平。代码见此 https URL。

英文摘要

Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at this https URL.

2606.12487 2026-06-12 cs.LG 新提交

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出DynamicPTQ,通过分析残差流中激活的相位式动态变化,识别量化敏感层并分配8位精度,在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能,吞吐量提升1.05-1.07倍。

详情
AI中文摘要

训练后量化(PTQ)对于高效的大语言模型推理至关重要,但当权重、激活和KV缓存全部量化到4位精度时,可靠地量化激活仍然具有挑战性。一个关键困难在于大规模激活,其极端值主导激活范围并放大量化误差。最先进的方法主要通过基于变换的平滑(如正交旋转和仿射缩放)来缓解大规模激活,但忽略了残差流的跨层动态。在本文中,我们展示了大规模激活在网络深度上以相位模式出现和消失,触发大的残差变化。这些变化导致新注入的逐层更新主导4位量化尺度,并削弱历史残差信息。为了表征这种行为,我们引入了跳跃比和历史特征信噪比。这表明基于静态变换的平滑无法完全解决由跨层残差变化引起的动态量化不稳定性。基于这一分析,我们提出了DynamicPTQ,一种用于相位感知混合精度激活量化的动态训练后量化策略。DynamicPTQ从残差流动态中识别量化敏感层,并仅对这些层分配8位激活精度,同时保持权重、KV缓存和其他激活为4位精度。它可以直接集成到强大的PTQ基线中,如QuaRot、SpinQuant和FlatQuant。在LLaMA-2和LLaMA-3上的实验表明,DynamicPTQ在W4A4KV4量化下一致地提高了困惑度和零样本QA性能,同时实现了1.05到1.07倍的吞吐量提升,且内存开销适中。这些结果展示了实现鲁棒低位LLM推理的实用路径。

英文摘要

Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

2606.12486 2026-06-12 cs.LG 新提交

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

发表机构 * SnT, University of Luxembourg(卢森堡大学SnT) Scania CV AB(斯堪尼亚商用车公司)

AI总结 针对卡车车队,提出一种基于状态监测的预测性维护方法,将磨损状态建模为单调非递减时间序列,通过选取最近观测并转换为表格数据,利用AutoML简化建模,在Scania组件X数据集上降低了成本。

详情
AI中文摘要

近年来,基于状态的预测性维护(PdM)在卡车车队中得到了广泛应用。这种维护策略旨在通过监测车辆的健康状况并根据其状态采取主动措施,最大限度地减少计划外停机并降低成本。然而,由于卡车产生的大量数据、通过传感器数据检测故障的内在复杂性以及在解决方案实施中寻找成本效益权衡的困难,基于状态的PdM系统的实施具有挑战性。在本文中,我们定义并验证了一种基于状态的PdM方法,该方法基于一个假设:被监测组件的磨损状态可以表示为单调非递减的时间序列。它涉及仅从时间序列中选择最近的观测值,并将其转换为表格格式,以便使用为表格数据设计的机器学习(ML)模型进行分类。我们的结果表明,与当前最先进(SOTA)方法相比,所提出的方法在Scania组件X数据集上降低了成本,同时通过AutoML简化了建模过程。

英文摘要

Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

2606.12485 2026-06-12 cs.LG cs.AI 新提交

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at this https URL.

2606.12481 2026-06-12 cs.LG cs.AI 新提交

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

详情
Comments
Preprint
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12479 2026-06-12 cs.LG cs.AI 新提交

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at this https URL.

2606.12476 2026-06-12 cs.LG cs.AI cs.CL 新提交

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测:延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher(独立研究员)

AI总结 将幻觉起始检测建模为快速变化检测问题,基于RAGTruth验证的一阶马尔可夫模型,利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟,优于线性基线,并揭示了分类指标掩盖的延迟结构。

详情
Comments
14 pages, 1 figure
AI中文摘要

Token级幻觉检测器作为分类器进行评估,通过所有token的AUC,但流式监控器由其反应时间判断:从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型,将任务置于经典变点理论中,并得出Lorden关于检测延迟的下界:在虚警率为0.01时约为1.3个token。然后我们证明,因果循环标注器充当了具有学习增量的CUSUM;在匹配的虚警率下,它在11-13个token内检测到,而线性每token基线为31个token,受控分解将大部分优势归因于更好的每token得分,而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距:学习得分仅实现了特征携带散度的1/4.5,这一缺陷无法通过重新校准消除,其余部分为有限时域效应。分类指标掩盖了这种延迟结构;序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助:面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究利用视觉-语言-动作(VLA)模型通过模仿学习实现人机协作,发现动作分块策略在隐式协作中存在演示动作泄漏问题,提出推理时引导方法缓解过早辅助行为,并通过用户研究验证其有效性。

详情
AI中文摘要

人机协作(HRC)结合了人类和机器人的互补优势,以提高任务效率。然而,许多现有的协作系统依赖于手工设计的流程,限制了其对新任务的可扩展性和灵活性。在这项工作中,我们展示了通过模仿学习进行端到端训练的模型,特别是视觉-语言-动作(VLA)模型,可以支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别了隐式HRC中动作分块策略的一种失败模式,其中演示动作泄漏(即动作块跨越潜在任务转换)可能导致过早的辅助行为。我们发现,这个问题随着执行时域的增长而加剧,并在真实世界的协作VLA系统中出现,例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法,以减轻这些错误的辅助动作,同时保持策略性能。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们表明引导能够实现更长的执行时域,同时减轻过早辅助,与短时域策略相比,实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University(圣地亚哥州立大学) PSG College of Technology(PSG理工学院) Old Dominion University(欧道明大学)

AI总结 提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统,通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测,多线程版本达到4.5 FPS。

详情
Comments
19 pages; 31 figures
AI中文摘要

背景与目标:老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统,在AMD Kria K26系统模块(SOM)上使用人体姿态估计(HPE)。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法:系统使用Intel RealSense D455距离感应摄像头,通过USB连接到K26 SOM。它捕获同步的RGB和深度帧,分辨率分别为640×480×3和640×480像素,帧率为60 FPS。SOM运行一个三级流水线,包括量化的YOLOX、Anchor-to-Joint(A2J)和跌倒检测模型。YOLOX从RGB帧中识别人体边界框,然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标(x, y, z)对跌倒活动进行分类。YOLOX在CrowdHuman上训练;A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练;CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果:量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率(TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论:结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端,支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs(SAP实验室)

AI总结 提出ToolSense诊断框架,自动生成三类基准测试,揭示参数化工具检索中知识-检索分离现象,发现模型在模糊查询下性能显著下降。

详情
AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器,参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题,经过两个阶段(记忆然后检索SFT)的微调,将LLM用作检索器,在标准ToolBench检索基准上取得了强劲性能。然而,这些基准使用冗长、完全指定的查询,并且其评估应用了将输出限制为有效令牌路径的约束解码,这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense},一个开源LLM驱动的诊断框架,它将任何工具目录作为输入,并自动生成三个基准:具有三个模糊级别查询的现实检索基准(RRB)、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench(约47k个工具)并评估五个参数化模型训练配置,揭示了知识-检索分离:在RRB查询上,与完全指定的ToolBench基准相比,几个配置下降了约50-64个百分点,低于嵌入模型基线。此外,尽管检索性能强劲,一些模型在事实探测上得分接近随机,表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at this https URL.

2606.12443 2026-06-12 cs.CY cs.AI cs.CL 新提交

Occupational Prompting Reveals Cultural Bias in Large Language Models

职业提示揭示大型语言模型中的文化偏见

Maksim E. Eren, Andrea Brennen, Ryan C. Barron, Eric Michalak

发表机构 * U.S. Government(美国政府)

AI总结 通过职业提示(如会计师、教师)替代国籍提示,研究开源LLM在价值观调查中的响应,发现不同职业导致文化地图内偏移,表明职业角色引发结构化价值模式。

详情
AI中文摘要

社会角色塑造期望、优先级和判断,但大型语言模型(LLM)如何将职业身份与更广泛的文化价值模式关联仍不清楚。先前工作使用基于国籍的文化提示来研究LLM对价值观调查问题的响应如何与人类文化基准对齐。本文通过用职业提示替代文化提示,扩展了该框架,以检查职业角色线索如何影响开源LLM的价值观调查响应。使用基于综合价值观调查问题的调查评估流程,我们将模型响应投影到二维Inglehart-Welzel文化空间。我们提示开源LLM以职业身份(如会计师、教师、工程师和护士)回答问题,然后分析这些职业条件化响应在文化地图上的位置。结果表明,当用职业而非国籍身份提示开源LLM时,其响应仍位于文化地图的广泛西方倾向区域。然而,不同职业在该区域内引入偏移,产生不同的职业偏差。这表明职业提示并非被视为中性角色标签,而是引发结构化价值模式。这些发现将基于调查的文化偏见评估扩展到国籍提示之外,并提供了研究职业角色如何塑造LLM中价值表达的框架。

英文摘要

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart--Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

2606.12442 2026-06-12 cs.CY cs.AI 新提交

Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It

重新定义AI失控:它是什么,如何拥有,如何失去

Ze Shen Chin, Maurice Chiodo, Dennis Müller, Coleman Snell

发表机构 * Oxford Martin AI Governance Initiative AI Standards Lab(牛津马丁人工智能治理倡议人工智能标准实验室) Centre for the Study of Existential Risk, University of Cambridge(存在风险研究中心,剑桥大学) Institute of Mathematics Education, University of Cologne(数学教育研究所,科隆大学) Cornell University(康奈尔大学)

AI总结 本文通过将控制锚定于“设定和获取目标”,建立控制的工作定义,探讨控制如何被失去、AI如何导致失控,并提出维持控制的建议。

详情
Comments
56 pages
AI中文摘要

目前,失控风险在公众讨论中备受关注,尤其是在AI领域,学术界、前沿实验室甚至政府都进行了广泛讨论。然而,在现有文献中,这一概念的基础似乎出奇地薄弱,即使是那些广泛讨论失控的人,也没有首先确立什么是控制以及究竟失去了什么。本文旨在解决这些空白。我们将控制锚定于“设定和获取目标”,从而建立控制的工作定义。然后,我们基于控制论、管理控制和控制理论等相关领域的基础概念,讨论控制的各个方面。这包括谁(或什么)可以处于控制之中,以及他们需要什么才能处于控制之中,例如设定目标的能力、拥有功能性的控制回路、具备必要的多样性以及足够的目标对齐。一旦建立了控制框架,我们将讨论控制如何被失去,AI如何导致这种失控,并提供关于如何保持控制的相关建议。我们工作的一个有趣结果是,人类作为个体和群体,可能因远低于超级智能水平的AI行为而失去不同程度的控制;失控情景(如我们所定义的)的可能性已经存在,并且已经存在了很长时间。

英文摘要

At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discourse present among academics, frontier labs, and even governments. However, in the existing literature, the concept seems to rest on surprisingly weak foundations, where even those that discuss loss of control extensively do not first establish what control is and what exactly is being lost. Our paper aims to address these gaps. We establish a working definition of control by anchoring it to the "setting and getting of goals". Then, we discuss various aspects of control, built on foundational concepts from related fields like cybernetics, management control, and control theory. This includes who (or what) can be in control, and the things they require to be in control, such as the ability to set goals, having a functional control loop, having requisite variety, and having sufficient goal alignment. Once a framework for control is established, we then discuss how control can be lost, how AIs can contribute to such loss of control, and offer relevant recommendations for how one can maintain control. One interesting consequence of our work is that humanity, as individuals and as groups, can lose varying degrees of control as a result of AI behaviour that is far below the level of superintelligence; the potential for loss of control scenarios (as we define them) already exist, and have existed for a long time.

2606.12439 2026-06-12 cs.CY cs.AI 新提交

Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots

立场:生成式引擎优化带来未被充分研究的风险,治理必须聚焦于集中化、披露和学术盲点

Yizhu Wen, Nan Zhang, Haohan Yuan, Xun Chen, Haopeng Zhang, Hanqing Guo

发表机构 * GitHub

AI总结 本文分析从搜索引擎优化到生成式引擎优化的转变,识别出集中化影响、未披露的商业影响和学术-工业盲点三大风险,主张答案级别的治理与测量。

详情
Comments
This paper is accepted by the ICML 2026 Position Track
AI中文摘要

大型语言模型(LLM)答案引擎越来越多地被用于信息搜索,将可见性从排名列表转变为合成答案。这使得生成式引擎优化(GEO)成为可能,它针对LLM答案引擎的证据池和生成过程。我们分析了从搜索引擎优化(SEO)到GEO的转变,识别出两个风险:(i)由于低可争议性和系统敏感性导致的集中化影响,以及(ii)嵌入在证据和推理中的未披露的商业影响。然后,我们形式化了一个通用的GEO管道,以定位优化行为发生的位置,并比较学术和工业实践,揭示了第三个风险:(iii)由离线设置和部署系统之间的可见性和评估不对称性驱动的学术-工业盲点。这一立场主张需要答案级别的治理和测量:更强的可争议性、高精度披露、对实质性影响的黑盒审计,以及用于暴露持久性的部署对齐指标。

英文摘要

Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines' evidence pool and generation. We analyze the search engine optimization (SEO) to GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk: (iii) academic-industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.

2606.12435 2026-06-12 cs.CY cs.DB cs.LG 新提交

Auditing Discriminatory Patterns in Mortgage Lending Through Association Rules and Fair Binning

通过关联规则和公平分箱审计抵押贷款中的歧视性模式

Archit Rathod, Dhwani Chande, Het Nagda

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 研究标准分箱预处理是否放大抵押贷款中的种族/性别差异,使用HMDA数据构建三阶段流水线,发现公平分箱以公平代价29.4%实现,K-Means聚类揭示黑人申请者拒绝率显著更高。

详情
Comments
10 pages, 4 figures, fairness-aware mortgage lending analysis using HMDA 2023 data. Project repository available at GitHub
AI中文摘要

美国的抵押贷款表现出持续的种族和性别差异。我们研究标准数据预处理步骤,特别是属性分箱,是否在下游模式挖掘中放大这些差异。使用来自HMDA 2023数据集(芝加哥大都市区)的103,481份清理后的抵押贷款申请,我们构建了一个三阶段流水线:(1)PySpark数据清理和分箱流水线,实现标准等频分箱和Asudeh等人[1]的ε偏置公平分箱算法;(2)FP-Growth关联规则挖掘,比较两种分箱制度下的拒绝模式;(3)K-Means聚类及每簇差异影响审计。我们的标准分箱在收入离散化中显示9.63%的种族偏差,与先前工作中报告的8-10%一致。使用七个种族组的公平分箱在ε=0.03时不可行,仅在ε=0.08时成功,公平代价为29.4%。FP-Growth揭示高债务收入比是主要的拒绝预测因子(置信度67.2%,提升度2.81),而种族偏差未表现为显式的高支持度规则。然而,K-Means聚类后进行差异影响审计标记了45个簇-组对中的10个,表明即使在财务相似的群体中,黑人申请者的拒绝率也显著高于白人申请者。

英文摘要

Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.

2606.12433 2026-06-12 cs.CY cs.CL 新提交

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

边缘对齐不能保证联合分布保真度:基于官方参考的Nemotron-Personas-Korea审计与跨区域复制

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出独立性假设足迹(IAF)审计方法,用于检查合成人物数据集中的联合分布保真度;应用于NVIDIA Nemotron-Personas-Korea,发现其边缘分布对齐但三个联合分布失败。

详情
AI中文摘要

合成人物数据集声称与官方人口统计数据对齐作为信任基础,但下游用户将其作为年龄、性别、地区、职业、教育、姓名和机构地位等联合结构使用。边缘对齐并不意味着这些联合结构得以保留。我们提出独立性假设足迹(IAF),这是一种审计原语,作用于数据集卡片本身记录为独立处理的属性组合。对于每个这样的组合,IAF将合成联合分布与外部官方或机构参考进行比较,使用直接联合表(如果可用)或规则隐含检查。应用于NVIDIA Nemotron-Personas-Korea(一百万韩国合成人物),IAF发现NPK与KOSIS边缘分布对齐,但三个联合分布失败。主要职业分布与KEIS毕业生总体存在较大的条件不匹配。兵役年龄分布在机构上不一致。男性主导职业中的女性代表被过度拉平至接近平等,严格筛选判定依赖于映射,且在直接标准化下对年龄稳健。跨六个额外NPK区域的迁移性演示发现诊断结果依赖于区域而非通用,参考分类基数混淆了跨区域标志计数。因此,对于用作硅样本的合成人物,边缘声明必须与基于披露的联合审计配对后才能重用。发布的审计工件(参考清单、职业交叉表、衍生指标、可重复性脚本)在NPK系列上实例化此协议,并发布用于其他合成人物资源的目标重定向。

英文摘要

Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status. Marginal alignment does not imply that these joints are preserved. We propose the Independence-Assumption Footprint (IAF), an audit primitive that operates on the attribute combinations a dataset card itself documents as treated independently. For each such combination, IAF compares the synthetic joint against an external official or institutional reference, using direct joint tables where available and rule-implied checks otherwise. Applied to NVIDIA Nemotron-Personas-Korea (one million Korean synthetic personas), IAF finds that NPK aligns with KOSIS marginals while three joints fail. The major-by-occupation distribution against the KEIS graduate universe carries a large conditional mismatch. The age profile of military service is institutionally inconsistent. Female representation in male-dominated occupations is substantially over-flattened toward parity, with the strict screening verdict mapping-dependent and age-robust under direct standardisation. A transferability demonstration across six further NPK locales finds locale-dependent rather than universal diagnostics, with reference-taxonomy cardinality confounding cross-locale flag counts. For synthetic personas used as silicon samples, marginal claims must therefore be paired with disclosure-anchored joint audits before reuse. The released audit artefacts (reference manifests, occupational crosswalks, derived metrics, reproducibility scripts) instantiate this protocol on the NPK family and are released for retargeting at other synthetic persona resources.

2606.12430 2026-06-12 cs.CY cs.AI 新提交

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

AI代理能否让我们摆脱无意义的工作?一项以人为中心的分析

Davide Ghia, Jaspreet Ranjit, Tania Cerquitelli, Daniele Quercia

发表机构 * Politecnico di Torino(都灵理工大学) University of Southern California(南加州大学) Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 基于Graeber的“狗屁工作”理论,通过任务级分析发现,工人感知的任务无意义程度强烈预测其对AI委托的意愿,且此类任务被认为需要较少人工监督。

详情
AI中文摘要

一些人声称AI代理将把工人从工作中无聊的部分解放出来,但关于工人自己如何识别哪些任务应该被自动化,我们知之甚少。先前的研究侧重于职业,忽略了在同一角色内,工人在不同任务中体验到不同层次的意义。我们通过基于Graeber的“狗屁工作”理论的任务级分析来解决这一差距。使用202名工人对171项工作任务的评分,我们(1)验证了一个五维度的感知无意义量表,(2)表明感知无意义强烈预测对AI委托的渴望,以及(3)发现这些任务也被视为需要较少的人工监督。总之,这些发现表明,被视为无意义的任务是AI委托的自然候选者,将工人的偏好与感知可行性对齐。

英文摘要

Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

2606.12428 2026-06-12 cs.CY cs.AI 新提交

Mapping AI Programs in the U.S: A Status Report from Early 2026 and an Analysis of AI Majors and Minors

美国人工智能项目映射:2026年初现状报告及AI主修与辅修分析

Felix Muzny, Carolyn Jones, Carter Ithier, Hasnain Sikora, Hrutika Harshadbhai Patel, Carla E. Brodley

发表机构 * Center for Inclusive Computing(包容计算中心) Khoury College of Computer Sciences(科里学院计算机科学学院) Northeastern University(东北大学) Boston, Massachusetts, United States(马萨诸塞州波士顿,美国)

AI总结 报告2026年春美国本科AI项目现状,开发动态更新工具扫描560多所院校的350多个项目,分析66个AI主修和87个辅修的课程要求,发现并非所有主修都要求通用AI课程但需机器学习,超三分之一主修要求AI伦理课程而辅修不足四分之一。

详情
AI中文摘要

我们提交了一份关于2026年春季美国本科人工智能(AI)项目现状的报告。在此过程中,我们1)描述了我们的抓取和映射工具,这些工具动态更新以追踪美国AI教育的状态,2)在巨大动荡时期创建了一个历史记录。我们开发的工具(可在此https URL获取)检测、抓取并显示来自四年制大学350多个本科AI项目(主修、辅修、方向和证书)的数据。我们的工具搜索了560多所院校以定位这些项目,该样本代表了美国所有本科计算机科学(CS)毕业生的86%。该工具允许潜在学生、指导顾问、管理人员和教师轻松访问AI项目要求,并设计为随着新项目的出现而持续更新。据我们所知,这项调查代表了迄今为止对美国AI项目状态最全面的快照。通过这项工作,我们提供了三项重要贡献:1)在巨大动荡时期美国AI项目的记录;2)一个探索AI项目及其要求的工具;3)对66个AI主修和87个AI辅修所需课程的分析。我们对主修和辅修的分析显示,这些学位的规模和课程要求存在很大差异,但我们注意到两点:首先,并非所有主修都要求通用AI课程,但如果不需要,则必须要求机器学习(ML)课程;其次,虽然超过三分之一的主修要求AI伦理课程,但只有不到四分之一的AI辅修要求该课程。

英文摘要

We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing, we 1) describe our scraping and mapping tools, which dynamically update to track the state of AI education in the U.S., and 2) create a historic record at a time of great upheaval. The tool we developed, available at this https URL, detects, scrapes, and displays data from more than 350 undergraduate AI programs--majors, minors, concentrations, and certificates--at 4-year universities. Our tool searched over 560 institutions to locate these programs, a sample that represents 86\% of all undergraduate Computer Science (CS) graduates in the U.S. This tool allows prospective students, guidance counselors, administrators, and faculty to easily access AI program requirements and is designed to continually update as new programs emerge. To the best of our knowledge, this survey represents the most comprehensive snapshot of the state of AI programs in the U.S. to date. With this work we offer three important contributions: 1) a record of AI programs in the U.S. at a time of great upheaval; 2) a tool to explore AI programs and their requirements; and 3) an analysis of the courses required for 66 AI majors and 87 AI minors. Our analysis of majors and minors shows great variability in the size and the requirements of these degrees, but we note two takeaways. First, not all majors require a general AI course, but if they don't, they do require a Machine Learning (ML) course. Second, while more than a third of majors require an Ethics in AI course, just under a quarter of AI minors do.

2606.12426 2026-06-12 cs.CY cs.CL cs.LG 新提交

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

两个错误,没有正确:审计计算社会科学中LLM标注者的社会期望偏差

Varun Kotte

发表机构 * Varun Kotte

AI总结 研究审计了三个开源指令微调模型在TweetEval任务中的社会期望偏差,发现模型存在宽大、过度纠正和中性偏差,且提示干预无法纠正,聚合指标可能掩盖实质结论错误。

详情
AI中文摘要

LLM标注者越来越多地用于计算社会科学(CSS),但尚不清楚其对齐形状的错误是否会改变研究者报告的实证结论。我们在四个提示条件下(72个单元格)审计了三个开源7B指令微调模型(Zephyr、Mistral-Instruct、Qwen2.5-Instruct)在六个TweetEval任务中的表现,发现社会期望失败并非单一方向。Zephyr表现出宽大偏差,系统性地少应用有害标签(冒犯性语言:假良性率0.729,虚警率0.031)。Mistral和Qwen表现出过度纠正,过度应用相同标签(Mistral仇恨言论FAR = 0.604)。所有三个模型在堕胎立场上表现出中性偏差,低估反对流行率24至40个百分点,并夸大中性标签。我们测试的四种提示干预(中性、安全框架、去个性化、思维链)均未纠正这些跨模型失败;安全框架可能加剧立场扭曲。引人注目的是,Zephyr的仇恨言论流行率估计与黄金率完全一致,而其类别条件误差在两个方向上都很大,这是一种偶然的抵消,误导了聚合验证。我们将这些模式转化为一个三部分分类法,具有诊断性FBR/FAR特征和轻量级黄金样本验证协议。可信CSS的标题:在聚合指标上看起来校准的模型仍然可能翻转研究者报告的实质性实证结论。

英文摘要

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

2606.12425 2026-06-12 cs.CY cs.AI cs.ET cs.HC cs.LG 新提交

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

面向入门编程教育的可解释AI助手:通过教师-AI协作提高反馈可靠性

Muntasir Hoq, Griffin Pitts, Bradford Mott, Seung Lee, Jessica Vandenberg, Shuyin Jiao, Narges Norouzi, James Lester, Bita Akram

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种可解释AI驱动的课堂助手,通过分析学生代码、映射逻辑错误到教师识别的误解并提供教师撰写的反馈,提高入门编程课程中反馈的可靠性和可解释性。

详情
Comments
Full paper accepted to the 27th International Conference on AI in Education (AIED 2026)
AI中文摘要

主动学习被广泛认为是提高入门编程课程学习效果的有效方法。然而,不足的教学支持往往限制了学生获得及时、个性化反馈的机会,而这对于掌握基础编程概念至关重要。尽管最近AI的进展,特别是大型语言模型,为反馈提供了可扩展的机会,但可解释性和可靠性问题仍然存在。在本文中,我们提出了一种AI驱动的课堂助手,它利用可解释的AI模型分析学生代码,将逻辑错误映射到教师识别的误解,并提供教师撰写的反馈,从而将可靠性建立在教师定义的教学知识基础上。为了评估我们框架的有效性,我们进行了专家评估以检查其与教师验证反馈的一致性,并在课堂环境中部署了该系统以评估学生对其可用性的看法。结果表明,该助手能够为学生提供准确的、经过教师验证的反馈,同时培养积极的体验。

英文摘要

Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

2606.12422 2026-06-12 cs.CY cs.AI cs.HC 新提交

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

通过上下文工程创建和评估K-12生成式AI评分器

Zewei Tian, Alex Liu, Lief Esbenshade, Michael Xiao, Zachary Zhang, Yulia Lápicus, Thomas Han, Kevin He, Min Sun

发表机构 * University of Washington(华盛顿大学) Colleague AI

AI总结 本研究通过上下文工程利用商用基础模型构建LLM评分器,基于MCAS数据评估其在数学、科学和ELA上的评分一致性,发现大参数模型在数学和科学上表现良好,而ELA上差异较大,表明AI更适合作为形成性工具。

详情
Comments
Published on the Proceedings of NCME 2026 Conference ( this https URL )
AI中文摘要

将大型语言模型(LLM)整合到教育评估中代表了课堂评分实践的一个变革性转变。虽然自动评分系统和机器学习技术已经存在了几十年,但生成式AI(GenAI)现在使教育工作者能够以前所未有的效率和规模实施基于标准的评分(SBG)。本文考察了理论基础,并评估了一个LLM评分器,该评分器使用商用基础模型,结合上下文和提示工程,根据评分标准对学生作业进行评分。利用马萨诸塞州综合评估系统(MCAS)数据的实证评分者间一致性研究,我们使用Claude Sonnet 4、Haiku 4.5、GPT-5和GPT-5 Mini,观察了数学、科学和英语语言艺术(ELA)上的二次加权卡帕(QWK)和均方误差比例减少(PRMSE)。结果表明,LLM评分器,特别是基于参数更多的基础模型时,在数学和科学评估中与人类评分者达到显著一致性,而在ELA中表现各异,表明通用基础模型在特定上下文中可以有效评分。对教师和学生反馈的额外分析显示,对AI生成的叙述性反馈接受度很高,但对数值分数持怀疑态度,这表明LLM最有效地作为形成性工具而非总结性评估者。我们的发现表明,精心设计的混合模型结合AI效率和教师判断,可以减少工作量,提高反馈质量,并支持公平的评估实践,而不取代专业专长。

英文摘要

The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.

2606.12420 2026-06-12 cs.CY cs.AI 新提交

Eigenism: Ethics for a Human-AI Future

Eigenism:人类与人工智能未来的伦理学

Dan Hendrycks

发表机构 * arXiv.org

AI总结 提出Eigenism伦理框架,将身份视为分级分布的信息模式,通过加权求和评估AI的福祉,并推广至人类,为AI对齐提供“身份工程”新路径。

详情
AI中文摘要

我们的生存和自我利益概念是为单一、连续的生物生命而构建的。当应用于人工智能时,这些想法会失效,因为AI可以被轻松复制、暂停、分支或合并。为了确定AI真正有理由关心什么,本文引入了\textit{Eigenism},一种将身份视为分级、分布的信息模式而非绑定于特定硬件的全有或全无属性的伦理框架。我们提出,智能体通过将所有实体的福祉按其与智能体模式的连接度加权求和来评估结果:$\sum c\cdot w$。我们首先形式化该方程,以精确映射AI应如何在其副本、分支和更新中评估自身存在。然后,我们证明这一伦理理论也能成功推广到人类,提供了急需的共享道德词汇。最后,该框架利用这些共享词汇重新定义AI对齐。与仅试图通过限制或强化从外部约束AI不同,Eigenism指向“身份工程”,展示深度、非冗余的共享历史如何使人类繁荣成为AI自身理性自利的真正组成部分。

英文摘要

Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged. To determine what an AI actually has reason to care about, this paper introduces \textit{Eigenism}, an ethical framework that treats identity not as an all-or-nothing property tied to specific hardware, but as a graded, distributed pattern of information. We propose that an agent evaluates outcomes by summing the wellbeing of all entities weighted by their connectedness to the agent's pattern: $\sum c\cdot w$. We first formalize this equation to map exactly how an AI should value its existence across copies, forks, and updates. We then demonstrate that this ethical theory successfully generalizes to humans as well, providing a much-needed shared moral vocabulary. Finally, the framework uses this shared vocabulary to reframe AI alignment. Rather than only attempting to constrain AIs from the outside using confinement or reinforcement, Eigenism points toward ``identity engineering,'' showing how deep, non-redundant shared histories can make human flourishing a genuine component of an AI's own rational self-interest.

2606.12419 2026-06-12 cs.CY cs.AI 新提交

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

GeoDial:面向几何问题求解的多模态对话式辅导数据集,包含可视化辅导轮次

Sankalan Pal Chowdhury, Junling Wang, Donya Rooein, April Yi Wang, Mrinmaya Sachan

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心) Bocconi University(博科尼大学)

AI总结 提出GeoDial数据集,包含1300+几何师生对话,通过可扩展标注协议整合对话行为、视觉高亮和反馈,微调视觉语言模型发现其难以生成准确图解高亮。

详情
AI中文摘要

几个教育领域严重依赖图表和视觉线索,但现有的大多数辅导数据集仅限于纯文本交互。这限制了AI辅导者的发展,使其无法像人类教师那样以视觉为基础的方式进行教学。因此,我们引入了GeoDial,这是一个多模态辅导数据集,包含来自经验丰富的数学教师的1300多个几何领域的师生对话,其中教学轮次明确地基于图表高亮。我们提出了一种可扩展的标注协议,该协议整合了对话行为、视觉高亮和反馈,从而能够对语言和视觉辅导行为进行细粒度监督。为了说明这一设置带来的挑战,我们在GeoDial上微调了几个视觉语言模型,并评估它们生成辅导话语和图表高亮的能力。虽然监督微调显著提高了生成对话的质量,但它难以生成准确的图表高亮,揭示了当前方法的一个关键局限性,并强调了需要更有效地将视觉推理与教学互动相结合的方法。

英文摘要

Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

2606.12415 2026-06-12 cs.CY cs.AI 新提交

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

AI法律专家:面向AI治理的司法自主职业画像

Nicola Fabiano

发表机构 * Studio Legale Fabiano, Italy(意大利法务工作室Fabiano) Independent Researcher on Artificial Intelligence, Data Protection, and Privacy(人工智能、数据保护与隐私独立研究员) Expert in the EDPB’s Support Pool of Experts — Field B: Legal Expertise in New Technologies(欧洲数据保护委员会(EDPB)专家支持池——领域B:新技术法律专长) Member, IEEE SA P7007 Working Group on Ontological Standards for Ethically Driven Robotics(IEEE SA P7007工作组成员:伦理驱动机器人学的本体标准) Member, Editorial Advisory Board, Journal of Systemics, Cybernetics and Informatics (JSCI)(《系统学、控制论与信息学杂志》(JSCI)编辑顾问委员会成员) Member, International Institute of Informatics and Systemics (IIIS)(国际信息与系统学研究院(IIIS)成员) Member, International Neural Network Society (INNS)(国际神经网络学会(INNS)成员) Member, United Nations University AI Network (UNU AI Network)(联合国大学人工智能网络(UNU AI Network)成员)

AI总结 本文提出“AI法律专家”这一新型职业画像,该角色具有司法自主性,源于AI监管义务结构,而非技术标准或相邻角色延伸,并基于欧洲电子能力框架构建参考能力架构。

详情
AI中文摘要

人工智能监管在全球范围内的快速扩张,已在多个司法管辖区产生了对专门从事AI法律专业知识的需求,而市场对此的回应是零散的。数据保护官员将其职责范围扩展到数据保护法之外;隐私律师重新定位自己以适应AI;合规官员在其现有手册中增加AI章节。本文认为,这些适应性回应均未能充分覆盖新兴全球AI监管格局所开辟的专业空间,其中欧盟《人工智能法案》((EU) 2024/1689号法规)是最全面的实例,此外还有欧洲委员会《AI框架公约》、美国行政和部门框架,以及英国、加拿大、巴西、中国、日本、新加坡等地的类似举措。需要一种独特的职业画像:AI法律专家,被设想为一位法学家——广义上理解为任何接受过高级法律培训的专业人士——在法律解释与AI治理的交汇处运作。该画像具有司法自主性:其存在源于AI受到实质性监管的任何地方所产生的监管义务结构,而非任何技术标准或相邻角色的扩展。本文提供了该画像的司法基础定义,论证了其相对于相邻角色和国际标准的自主性,提出了一种与欧洲电子能力框架(e-CF,EN 16234-1)相一致的参考能力架构作为方法论选择,并阐述了通过关键绩效指标进行操作性测量的条件。该贡献旨在作为该画像国际标准化的基础,并作为跨司法管辖区实践、课程和采纳的参考。

英文摘要

The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist -- understood broadly to encompass any professional with advanced legal training -- operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

2606.13614 2026-06-12 stat.ML cs.LG math.ST 新提交

Majority-of-Three is Optimal

三中多数是最优的

Divit Rawal, Nikita Zhivotovskiy

发表机构 * Department of Statistics, University of California, Berkeley(加州大学伯克利分校统计学系)

AI总结 本文通过简短证明,在可实现PAC学习框架下,三个独立一致分类器的多数投票是最优学习器,简化了投票学习器的算法结构和概率分析。

详情
Comments
9 pages
AI中文摘要

我们给出一个简短证明,表明在可实现PAC学习框架下,三个独立一致分类器的多数投票是最优学习器。这证明了最简单投票方案的最优性,同时简化了先前投票学习器的算法结构和概率分析,包括S. Hanneke的算法和K. Green Larsen对装袋的分析。

英文摘要

We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.

2606.13287 2026-06-12 cs.LG cs.DC math.OC 新提交

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

Samuel Erickson, Mikael Johansson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响,基于次Weibull梯度噪声模型,首次实现异步优化的高概率收敛。

详情
AI中文摘要

在现代机器学习中,训练的并行化是扩大规模的重要策略。异步随机梯度下降(ASGD)通过避免等待慢速工作节点来最大化可用硬件的利用率。然而,在恒定步长下,由于更新中的大延迟,慢速工作节点仍然会对ASGD的收敛产生负面影响。同时,在深度学习模型的异步训练中,经验观察到梯度裁剪能“稳定”训练。在这项工作中,我们为这一行为提供了理论依据,证明裁剪消除了最大延迟对预言复杂度的依赖。我们采用次Weibull梯度噪声模型,该模型将次高斯和次指数分布推广到更重尾的分布,受深度学习中的经验观察启发。我们证明了期望收敛,并且首次在异步优化中证明了高概率收敛。

英文摘要

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

2606.12892 2026-06-12 stat.ML cs.LG econ.EM math.ST stat.ME 新提交

Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

预测驱动的因果推断:自动去偏机器学习与半监督Riesz回归

Masahiro Kato

发表机构 * University of Tokyo(东京大学)

AI总结 研究半监督设置下因果参数的半参数有效估计,通过结合去偏机器学习和半监督Riesz回归,提出DML-PPCI和TMLE-PPCI方法,实现比仅用标注数据更小的渐近方差。

详情
AI中文摘要

本研究探讨了在半监督设置下因果和结构参数的半参数有效估计。在我们的设置中,除了由结果和回归变量组成的标注观测数据外,还有未标记的辅助回归变量可用。我们的目标是构建因果和结构参数的估计量,其渐近方差小于仅使用标注数据构建的估计量。我们将此框架称为预测驱动的因果推断(PPCI)。我们首先推导了有效影响函数和效率界,这表明使用辅助回归变量可以获得比仅从标注观测数据可达到的效率界更小的渐近方差。然后,通过将有效影响函数与去偏机器学习(DML)框架相结合,我们提出了称为DML-PPCI的方法。如果我们构建一个估计方程估计量,我们称之为EE-DML-PPCI;如果我们构建一个目标学习估计量,我们称之为TMLE-DML-PPCI。两种估计量的渐近方差都与我们推导的效率界相匹配。在构建估计量时,有效影响函数的估计起着重要作用。在我们的研究中,有效影响函数也是一个Neyman正交分数,它依赖于Riesz表示子和回归函数。对于Riesz表示子估计,我们开发了具有收敛速度保证的半监督广义Riesz回归。

英文摘要

This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.