arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.06034 2026-06-05 cs.LG cs.AI

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

当足够好即最优:量化门控DeltaNet的仅乘法矩阵求逆近似

Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对分块并行线性注意力中矩阵求逆的瓶颈,提出基于截断Neumann级数展开的仅矩阵乘法算法,结合结构掩码和并行残差校正,实现NPU上5倍内核加速和20%解码层开销降低。

详情
AI中文摘要

分块并行线性注意力中的矩阵求逆是长上下文建模的主要瓶颈,尤其是在NPU上,基于前向替换的方法并行性有限且硬件利用率低。我们提出了一种快速的、基于矩阵乘法(MatMul)的算法,专门针对分块线性注意力中出现的严格下三角矩阵。受Neumann级数项快速增长和逆矩阵对角集中性的启发,我们采用截断Neumann展开,结合结构掩码和并行残差校正,以消除顺序依赖。我们进一步将方法扩展到低比特INT,通过缓解重复矩阵幂运算引起的动态范围扩展,并根据块大小调整近似阶数和残差步长,以最小化计算成本同时保持模型精度。在Qwen3.5系列模型上的实验表明,在浮点和低精度推理下,该方法实现了高达5倍的内核级加速和20%的解码层开销降低,同时保持了精度。我们的方法为可扩展线性注意力提供了一种高效且硬件友好的解决方案。

英文摘要

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

2606.06031 2026-06-05 cs.CL

NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language Models

NAVIRA: 解耦随机重掩码用于掩码扩散语言模型

Andrey Fomenko, Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko

发表机构 * Lomonosov Moscow State University(莫斯科罗蒙诺索夫莫斯科国立大学) Institute for Artificial Intelligence(人工智能研究所)

AI总结 针对掩码扩散语言模型并行生成中的上下文污染问题,提出NAVIRA解码策略,通过解耦质量检测与重生成、随机采样重掩码位置,提升流畅性和多样性。

详情
AI中文摘要

掩码扩散语言模型通过并行迭代地解除掩码生成文本,但这种速度带来了修正问题:同一时间步生成的标记是从边缘分布预测的,早期局部依赖错误可能污染后续上下文。PRISM通过学习标记级质量分数并重掩码不可靠标记来解决此问题,但其推理规则是耦合的:同一前向传播既检测低质量标记又计算其替换的对数几率,因此错误标记仍会条件化再生。我们提出NAVIRA,一种推理时解码策略,将这两个操作分离并随机采样重掩码位置。第一次前向传播对标记评分;选中的标记被掩码;第二次前向传播从清理后的上下文再生。温度控制的重掩码减少对相同位置的重复修正,并在流畅性与多样性之间取得平衡。在170M掩码扩散语言模型的受控实验中,解耦提高了流畅性,而调度的随机重掩码保持了熵,并在更大的前向传播预算下获得了更强的LLM评判分数。这些结果表明,重掩码策略(而不仅仅是学习到的质量信号)对于可靠的掩码扩散文本生成至关重要。

英文摘要

Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.

2606.06027 2026-06-05 cs.AI cs.CL cs.LG cs.SI

RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

RedditPersona: 一个用于从Reddit进行社区条件化LLM适配的模块化框架

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

发表机构 * Future Computing Group University of Oulu(未来计算组奥卢大学) Centre for Applied Computing University of Oulu(应用计算中心奥卢大学)

AI总结 提出RedditPersona模块化框架,通过五种分组策略和QLoRA训练参数高效适配器,在112个Reddit子版块上评估社区条件化语言模型,发现适配器的行为可识别性与策略内在一致性相关,且所有策略在可识别性和分布相似性之间存在一致权衡。

详情
AI中文摘要

社区条件化的语言模型适配需要在每个研究中独立做出关于数据收集、社区定义和评估的选择,这使得比较假设或重用工件变得困难。我们提出了RedditPersona,一个模块化框架,标准化了这些选择:它收集Reddit帖子和评论,分析活跃用户,根据五种分组策略(基于子版块、图结构、语义、混合和基于交互)对用户进行划分,通过QLoRA为每种策略训练参数高效的适配器,并在一个涵盖流畅性、忠实度、分布对齐和社区可识别性的共享度量套件下进行评估。应用于城市福祉领域的112个子版块(301,429个用户档案,超过1600万条评论),我们发现适配器的行为可识别性追踪了每种策略与子版块基线的内在一致性,并且所有五种策略在可识别性和与真实文本的分布相似性之间存在一致的权衡。代码和配置文件可在以下网址获取:https://github.com/Ahghaffari/redditpersona。

英文摘要

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

2606.06025 2026-06-05 cs.CL cs.AI

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review: 基于多智能体教师蒸馏的高效证据支撑科学同行评审生成

Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

发表机构 * Department of Information Management, Peking University(北京大学信息管理系) PKU-WUHAN Institute for Artificial Intelligence, Peking University(北京大学武汉人工智能研究院)

AI总结 提出EGTR-Review框架,通过多智能体教师蒸馏和证据加权目标,实现轻量级学生模型的高质量、可溯源同行评审生成。

详情
AI中文摘要

科学同行评审生成因能减少评审负担并提供及时反馈而受到越来越多的关注。然而,现有基于大型语言模型(LLM)的方法往往产生缺乏证据支持和弱源可追溯性的通用评论,而复杂的多智能体系统则导致高推理成本。为应对这些挑战,我们提出EGTR-Review,一种通过多智能体教师蒸馏实现的证据支撑且可追溯的评审生成框架。EGTR-Review首先构建一个多智能体教师,执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理和评审合成。然后,通过任务前缀驱动的多任务学习,将中间推理轨迹和最终评审评论蒸馏到轻量级学生模型中。证据加权目标进一步减少弱、缺失或不可验证监督的影响。在公共同行评审数据集上的实验表明,EGTR-Review(学生)在自动指标、LLM作为评判者评估和人工评估中均优于强提示基、微调基和结构化/智能体基线,同时保持强事实基础和源可追溯性,且显著降低令牌消耗和推理时间。我们的代码、提示、配置和样本数据可在GitHub上获取。

英文摘要

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

2606.06020 2026-06-05 cs.CV

ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

ReSAGE-PAR:行人属性识别中生成式扩展的表征相似性评估

Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral

发表机构 * Universidad Autónoma de Madrid(阿隆托纳大学马德里分校)

AI总结 针对行人属性识别数据稀缺问题,提出ReSAGE-PAR管道,通过扩散模型生成图像并利用贝叶斯分类器验证属性,实现可扩展的高保真数据集扩展,在标准骨干网络上提升高达8.7%。

详情
Comments
Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
AI中文摘要

为了解决行人属性识别(PAR)中有限的数据多样性和数据稀缺问题,我们探索了使用基于属性提示的扩散模型进行图像合成。虽然这能够实现行人图像的可控生成,但它面临两个关键挑战:(i)高质量预训练数据与低分辨率、非标准监控裁剪之间的领域差距,以及(ii)需要可靠的属性验证以防止生成幻觉。在本文中,我们引入了一个稳健的生成-评分-自动标注管道,称为ReSAGE-PAR(PAR中生成式扩展的表征相似性评估),它弥合了这一领域差距,并实现了可扩展、高保真的数据集扩展。首先,我们使用定制的基于LoRA的图像到图像方法,将预训练的扩散模型适应到原生PAR分辨率。其次,我们提取生成图像与其条件提示之间的视觉-语言对齐分数,利用包括标签一致和不一致补充的综合提示策略。最后,我们制定了一个贝叶斯分类器,将这些连续分数转换为可靠的二值伪标签。大量评估证明了ReSAGE-PAR在保留空间先验和验证属性方面的有效性。当集成到PAR训练中时,ReSAGE-PAR一致地带来了显著的改进——在标准骨干网络上实现了高达8.7%的提升,并将最先进的框架推向了新的性能水平。这证明了其作为可扩展PAR增强的架构无关解决方案的价值。ReSAGE-PAR的完整代码库可在http://www-vpu.eps.uam.es/publications/ReSAGE-PAR公开获取。

英文摘要

To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.

2606.06014 2026-06-05 cs.AI cs.RO

PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

PLAN-S:通过潜在风格动态桥接规划以实现自动驾驶世界模型

Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, and Center of Seamless Connectivity & Connected Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(智能交通 thrust、系统中心及无缝连接与智能连接研究院,香港科学与技术大学(广州))

AI总结 提出PLAN-S框架,通过从潜在表示解码风格条件语义成本图,解决自动驾驶中潜在世界模型规划的可控性问题,在nuScenes和NAVSIM上降低了碰撞率并提升了驾驶性能。

详情
AI中文摘要

潜在世界模型通过预测紧凑的场景动态来增强端到端自动驾驶,用于下游规划。然而,现有的基于潜在世界模型的规划器通常直接从纠缠的潜在表示生成轨迹。这种紧凑的潜在到规划器路径缺乏对风险、可驾驶性和多样风格偏好的显式建模,使得驾驶风格动态在最终轨迹选择之前难以监督、检查或调制。我们提出PLAN-S(具有潜在风格动态的规划),一个面向规划器的桥接方法,通过从潜在表示解码风格条件的四通道语义成本图来解决这种紧凑-可控性困境。成本图以自我状态和驾驶风格为条件,并通过两个宿主侧接口在规划决策上游被消费:用于回归规划器的注意力级融合和用于锚点得分规划器的奖励级融合。我们在两个架构不同的宿主上验证PLAN-S:nuScenes上的ResWorld和NAVSIM上的WoTE,同时冻结宿主骨干以隔离所提出的桥接的贡献。在nuScenes上,PLAN-S在每个时间范围上降低了基线L2,平均L2为0.55米,3秒碰撞率相对降低42%。在NAVSIM上,规则成本变体达到89.4的预测驾驶模型分数,而学习成本变体在基线挑战场景中提供了互补增益。消融实验表明,成本路径对更安全的轨迹选择贡献最直接。定性结果进一步显示,PLAN-S可以产生多样化的成本图,其空间一致的变化与不同的驾驶风格对齐。

英文摘要

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

2606.06011 2026-06-05 cs.RO cs.LG cs.MA

Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

将基于模型的控制与多智能体强化学习相结合以实现多智能体协作团队策略

Christian Llanes, Spencer W. Jensen, Samuel Coogan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sandia National Laboratories(桑地亚国家实验室)

AI总结 提出一种结合多智能体强化学习与模型预测控制的框架(MA-AC-MPC),通过扩展演员-评论家模型预测控制实现安全、动态可行的协作策略,并在追逃场景和异构环境中验证其优于多层感知机模型。

详情
Comments
12 pages, 8 figures, 7 tables
AI中文摘要

在这项工作中,我们提出了一种将多智能体强化学习(MARL)与基于模型的控制相结合的框架,以在协作多智能体任务中实现安全、动态可行的动作。多智能体强化学习具有从长期规划视野中的离散不可微奖励中学习多智能体团队协作策略的优势。模型预测控制具有鲁棒性,并在快速重规划框架中为短视野提供安全、动态可行的动作。我们提出了一种将演员-评论家模型预测控制扩展到MARL的算法,称为多智能体演员-评论家模型预测控制(MA-AC-MPC)。我们通过将其应用于多智能体追逃场景来展示该算法的能力。具体来说,我们比较了使用MA-AC-MPC模型和多层感知机模型(MA-AC-MLP)的逃避者团队策略。追逐者团队使用增强比例导航,因为它被接受为一种先进的对抗控制律。我们还提供了一个异构环境的示例,其中无人机和全向轮式机器人协作,在硬件上实现了可重复且成功的着陆,MA-AC-MPC的成功率为100%,而MA-AC-MLP为60%。我们在硬件上证明了所提出的MA-AC-MPC算法在两种环境中的鲁棒性。

英文摘要

In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.

2606.06010 2026-06-05 cs.LG cs.DB

Adaptive Oscillatory-State Alignment for Time Series Forecasting

自适应振荡状态对齐用于时间序列预测

Zhangyao Song, Ziqiong Li, Xiangfei Qiu, Chao Zha, Yinfei Xu, Tao Guo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AOSNET框架,通过希尔伯特变换将固定模板匹配改为自适应振荡状态对齐,以处理实际时间序列中的非平稳振荡行为,在多个基准上达到先进或竞争性精度。

详情
AI中文摘要

长期时间序列预测受益于揭示重复时间结构的归纳偏置。现有的周期性预测方法通常通过预定义周期、全局频谱分量或固定可学习模板来建模重复性。然而,现实世界的时间动态很少是严格周期性的:振荡行为通常通过幅度调制、相位漂移和局部频率变化而演变。在这些条件下,固定模板的周期性建模可能与底层时间状态根本性不匹配。我们提出了AOSNET,一个希尔伯特引导的预测框架,将周期性预测从固定模板匹配重新表述为自适应振荡状态对齐。AOSNET从观测序列和可学习的全局振荡先验中提取解析信号描述符,然后通过描述符条件门自适应地对齐局部状态,该门选择性地保留可靠观测,同时软性纠正不匹配区域。学习到的先验不是作为刚性的重复模板,而是作为通过局部状态动力学解释的灵活振荡参考。在八个基准上的实验表明,具有快速推理速度的最先进或高度竞争的准确性。控制合成研究分离幅度调制、相位漂移和局部频率变化,证实振荡状态对齐的优势随着非平稳性加剧而持续增加。

英文摘要

Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: oscillatory behavior often evolves through amplitude modulation, phase drift, and local frequency variation. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNET, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNET extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight benchmarks demonstrate state-of-the-art or highly competitive accuracy with fast inference speed. Controlled synthetic studies isolating amplitude modulation, phase drift, and local frequency variation confirm that the advantage of oscillatory-state alignment consistently increases as non-stationarity intensifies.

2606.06007 2026-06-05 cs.LG

Diffusion Models for Adaptive Sequential Data Generation

自适应序列数据生成的扩散模型

Haoyang Cao, Minshuo Chen, Yinbin Han, Renyuan Xu

发表机构 * Department of Applied Mathematics and Statistics, Data Science and AI Institute, and Mathematical Institute for Data Science, Johns Hopkins University(应用数学与统计学系、数据科学与人工智能研究所、数据科学数学研究所,约翰霍普金斯大学) Department of Industrial Engineering and Management Sciences, Northwestern University(工业工程与管理科学系,西北大学) Department Management Science and Engineering, Stanford University(管理科学与工程系,斯坦福大学)

AI总结 提出一种顺序前向后向扩散框架,通过沿序列逐步注入和去除噪声并基于历史生成条件确保自适应性,用于生成自适应时间序列数据,并引入新的分数匹配目标实现高效并行训练,在合成数据和均值-方差最优投资组合构建中验证有效性。

详情
Comments
37 pages
AI中文摘要

生成逼真的合成序列数据在运筹学、金融、医疗、能源系统和科学计算等实际应用中至关重要,这些领域使用时间索引观测进行预测、模拟、风险评估和数据驱动决策。虽然扩散模型在生成静态数据方面取得了显著成功,但其直接扩展到序列设置往往无法捕捉时间依赖性和信息结构。设计能够以自适应方式模拟序列数据且不预知未来信息的扩散模型仍然是一个开放挑战。在这项工作中,我们提出了一种用于自适应时间序列生成的顺序前向后向扩散框架。我们的方法沿序列逐步注入和去除噪声,并基于先前生成的历史进行条件化以确保自适应性。引入了一种新的分数匹配目标以实现高效的并行训练。我们在一个通用框架下推导了严格的统计保证,然后以ReLU网络作为具体实例建立了分数逼近、分数估计和分布估计结果。在实验上,我们在合成数据(包括ARMA模型和高斯过程)上验证了我们的方法,并展示了其在构建均值-方差最优投资组合中的有效性。

英文摘要

Generating realistic synthetic sequential data is critical in real-world applications across operations research, finance, healthcare, energy systems, and scientific computing, where time-indexed observations are used for prediction, simulation, risk assessment, and data-driven decision-making. While diffusion models have achieved remarkable success in generating static data, their direct extensions to sequential settings often fail to capture temporal dependence and information structure. Designing diffusion models that can simulate sequential data in an adapted manner, and hence without anticipation of future information, therefore remains an open challenge. In this work, we propose a sequential forward-backward diffusion framework for adapted time series generation. Our approach progressively injects and removes noise along the sequence, conditioning on the previously generated history to ensure adaptiveness. A novel score-matching objective is introduced for efficient parallel training. We derive rigorous statistical guarantees under a generic framework, then establish score approximation, score estimation, and distribution estimation results with ReLU networks serving as a concrete instance. Empirically, we validate our method on synthetic data, including ARMA models and Gaussian processes, and demonstrate its effectiveness in constructing mean-variance optimal portfolios.

2606.06003 2026-06-05 cs.AI

Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs

超越向量相似性:面向工业知识图谱的图增强检索结构分析

Grama Chethan

发表机构 * arXiv.org

AI总结 本文通过对比八种检索架构,提出操作符词汇表论点,证明基于LLM的图推理瓶颈在于计算操作符而非模型智能,并引入LLM查询规划器,在工业知识图谱上实现优于定制处理器的性能。

详情
Comments
11 pages
AI中文摘要

检索增强生成(RAG)在需要对互连实体进行结构推理的查询上系统性失败。我们比较了八种用于航空航天供应链情报的检索架构,从文本检索逐步过渡到图遍历和图计算。使用一个包含46个节点和64条类型边的知识图谱,我们评估了10个意图类别下的23个查询,并证明向量检索在结构上无法覆盖五类查询。我们的核心发现是操作符词汇表论点:基于LLM的图推理的障碍不是模型智能,而是作为工具可用的计算操作符。一个配备9种类型遍历原语的LLM查询规划器在性能上优于定制处理器(F1=0.632 vs 0.472),同时能泛化到未见查询。添加6种图计算工具后,LLM仅在遍历失败的查询类别上选择性采用它们。我们还发现一个测量差距:实体级F1系统性低估了正确答案为完整集合的结构查询。

英文摘要

Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.

2606.05999 2026-06-05 cs.CV cs.AI

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics(计算机与人工智能学院,西南财经大学) Ningbo University of Technology(宁波工程学院)

AI总结 提出自适应三角变换器(ATT-CR),通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰,实现高效云去除。

详情
AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖,取得了显著效果。然而,它们存在以下问题:1)自注意力的高计算复杂度限制了可扩展性;2)在注意力计算中将云像素和干净像素均视为有效,会在后续层中引入干扰,导致性能次优。为解决这些挑战,我们提出了自适应三角变换器用于云去除(ATT-CR),该模型有效降低了计算成本并减轻了云像素的干扰。具体而言,它包含两个核心组件:三角注意力(TAN)和特征选择门控模块(FSGM)。TAN使用下三角和上三角矩阵近似Softmax注意力,计算复杂度为O(N),显著降低了计算成本。而FSGM与TAN集成,自适应地区分云特征和干净特征,从而最小化无效信息引入后续层。在云去除基准上的大量实验表明,ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

2606.05998 2026-06-05 cs.CV cs.AI

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种仅用十张二维口内图像进行三维口腔重建的软件方法,采用MobileNetV2与多头注意力机制,降低成本和不适,实现自动化重建。

详情
Comments
4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025
AI中文摘要

口腔三维建模是牙科中最关键的阶段之一,常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模,存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构,效果先进但设备成本极高。为解决这些问题,本文提出一种基于软件的方法,仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型,无需专用硬件设备。该方法降低成本,消除物理扫描设备需求,减少患者不适,并实现自动化三维重建。模型在公开的Dental3DS数据集(包含950个上颌样本)上训练,采用MobileNetV2作为图像编码器,结合多头注意力进行多视图特征融合。所提模型在最近邻匹配(距离阈值0.035)下达到77.49%的准确率。然而,预测顶点倾向于集中在真实值的高密度区域,导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

2606.05997 2026-06-05 cs.CV

Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

使用大语言模型和梯度提升的多模态性别歧视识别与表征

Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) School of Electrical and Computer Engineering(电气与计算机工程学院) National Technical University of Athens(雅典国家技术大学)

AI总结 提出基于特征工程和梯度提升回归模型的后融合管道,结合视觉、文本、人口统计、生物特征及LLM语义指标,用于识别和表征模因和短视频中的多模态性别歧视。

详情
AI中文摘要

我们介绍了AILS-NTUA提交给CLEF EXIST 2026实验室的工作,解决模因(任务2)和短视频(任务3)中的多模态性别歧视识别与表征问题。我们的系统采用基于特征工程的后融合管道,围绕梯度提升回归模型和层次化后处理构建。对于模因,我们结合了视觉、文本、人口统计、生物特征和LLM衍生的语义指标,旨在捕捉刻板印象、物化、讽刺和厌女等高层次线索。对于视频,我们研究了特征选择、基于帧的视觉表示、基于OCR的文本特征、声学描述符和传感器衍生元数据的影响。开发结果表明,聚焦的LLM衍生语义线索改善了模因性别歧视识别,而视频性能对特征维度和跨模态噪声高度敏感。对于视频,开发结果倾向于紧凑的特征选择,但官方测试结果表明这一结论不能完全推广到未见数据,其中未过滤的表征泛化更好。总体而言,我们的发现强调了针对静态模因进行目标语义特征工程的有用性,以及在嘈杂的短视频环境中需要更鲁棒的时间建模。

英文摘要

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

2606.05994 2026-06-05 cs.LG eess.SP

HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care

HoT-SSM:用于医疗保健的高阶时序知识图谱推理与状态空间模型

Thummaluru Siddartha Reddy, Vempalli Naga Sai Saketh, Yash Punjabi, Mahesh Chandran

发表机构 * Fujitsu Research of India, Bangalore(印度班加罗尔 Fujitsu 研究院)

AI总结 提出HoT-SSM模型,通过构建超图捕获高阶临床交互,并利用动态超图状态空间模型建模长程时序依赖,在MIMIC-III/IV数据集上显著提升临床预测性能。

详情
Comments
Paper under review
AI中文摘要

融合临床知识的医学知识图谱(MKGs)越来越多地被用于建模电子健康记录(EHRs),以支持医疗领域的可解释预测。然而,现有的基于MKG的方法在捕获临床概念(如病情、手术和药物)之间的成对关系方面存在局限,限制了其建模共现或语义相关概念间高阶交互的能力。此外,大多数利用MKG的表示学习方法要么跨就诊折叠时间信息,要么缺乏显式建模长程时序依赖的机制,而这对于死亡率预测等临床任务至关重要。为缓解这些局限,我们提出HoT-SSM,一种参数高效的高阶时序图推理方法,结合状态空间模型。对于每次就诊,HoT-SSM通过利用领域知识将语义相关的临床概念分组为超边来构建超图,从而保留就诊级别的临床上下文。此外,为在学表示的同时建模时序动态,我们引入一种新颖的基于动态超图的状态空间模型,显式捕获患者潜在状态随时间演变,同时保留长程信息。学到的表示用于下游临床预测和推理。在MIMIC-III和MIMIC-IV数据集上的实验表明,性能显著优于当前最先进模型,证明了联合建模高阶临床交互和长程时序依赖的有效性。

英文摘要

Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.

2606.05988 2026-06-05 cs.LG cs.CL

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

压缩-蒸馏:面向高效知识蒸馏的推理轨迹压缩

Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

发表机构 * Université catholique de Louvain(列日天主教大学) Sophont Inc(Sophont公司)

AI总结 本文提出在知识蒸馏前对推理轨迹进行事后压缩,以降低训练成本并缩短推理输出,实验表明压缩在准确率与效率间存在权衡。

详情
AI中文摘要

推理模型产生长的思维链轨迹,这些轨迹蒸馏成本高且鼓励学生输出冗长内容。我们研究在知识蒸馏前对这些轨迹进行事后压缩。两个教师模型,Qwen3.5-397B-A17B 和 gpt-oss-120B,各生成约 283k 条正确轨迹;两个指令调优模型将其压缩至原始字符长度的 8.6-21.0%。在包含 48 次运行的主网格和七次 Qwen 教师截断消融实验中,压缩轨迹将训练 token 减少至原始的 12-30%,训练速度提升 2.0-7.6 倍,推理输出缩短 3-19 倍,在更短的 gpt-oss 教师下减少幅度较小。然而,原始轨迹在每个规模下和两位教师上都保持最高的下游准确率。一项长度匹配的原始轨迹截断消融实验表明,压缩并非仅仅受益于更小的 token 预算:模型压缩的轨迹通常优于或匹配朴素截断,尤其是对于较小的学生模型,同时保持更短的推理输出。总体而言,推理轨迹压缩提供了准确率与效率之间的权衡,而非免费改进:学生模型保留了原始轨迹高达 96% 的准确率,同时获得了高达 18 倍的每 token 效率提升;在 0.8B 规模下,使用 LoRA 压缩轨迹缩小了原始与压缩之间的差距,但未超过原始轨迹。

英文摘要

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

2606.05985 2026-06-05 cs.CL cs.CY

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

超越对齐:多元文化智能体系统中的价值多样性作为集体属性

Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 针对多元文化多智能体系统,提出以价值多样性作为系统级评估轴,通过文化条件化智能体在共享价值调查中的响应差异度量,发现多样性几乎与对齐无关,且当前系统远低于人类社会,混合骨干系统缩小但未消除差距,社会互动进一步侵蚀多样性。

详情
AI中文摘要

多元文化多智能体系统越来越多地部署在全球多样化的环境中,其中不同的智能体基于不同的文化背景。现有的文化评估侧重于价值对齐:单个智能体与目标文化的匹配程度。然而,对齐是每个智能体的属性,无法揭示系统作为一个整体是否保留了其旨在代表的文化多元性。我们提出价值多样性作为多元文化智能体系统的系统级评估轴,通过文化条件化智能体在共享价值调查上的响应差异来定义。利用世界价值观调查,我们评估了19种文化和18个骨干模型在广泛的系统配置下的表现。我们发现多样性在很大程度上与对齐无关,表明两者捕捉了互补的系统属性,并且当前的多元文化智能体系统在价值多样性上远低于人类社会。混合骨干系统缩小了这一差距但未消除,且该差距在文化组成和智能体规模上持续存在。社会互动进一步通过驱使智能体达成共识而侵蚀多样性,一个参与式预算案例研究表明,这种同质化缩小了集体决策的广度。总之,我们的结果将价值多样性确立为多元文化多智能体系统的一个独特评估轴,并揭示了当前基于LLM的社会中持续存在的同质化趋势。我们的代码和数据公开在 https://github.com/iNLP-Lab/MultiAgent-Diversity。

英文摘要

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity.

2606.05981 2026-06-05 cs.CV cs.LG

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

基于视觉感知的多模态大语言模型条件编辑扩散的视频率流式风格化:蒸馏UNet + MLLM文本编码器上的非对称批处理推理

Yoshiyuki Ootani

发表机构 * Independent researcher(独立研究员)

AI总结 针对蒸馏扩散模型中文本编码器成为瓶颈的问题,提出一种结合非对称CUDA流水线、编译友好的ControlNet-LLLite重构和周期性条件刷新调度的流式管线,在消费级GPU上实现视频率实时风格化编辑。

详情
Comments
12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at https://github.com/otanl/dreamlite-stream (also mirrored to Hugging Face and Zenodo)
AI中文摘要

扩散U-Net的激进蒸馏反转了实时文本到图像流水线的逐帧瓶颈:一旦去噪器成为4步或1步蒸馏的学生模型,文本编码器就成为关键路径。这种反转在视觉感知编辑扩散中最为严重,其中编码器是多模态大语言模型(MLLM)。我们研究了一个0.39B蒸馏编辑U-Net与2.13B MLLM文本编码器(Qwen3-VL)配对的情况,并提出了一种针对该场景的流式管线,该管线围绕三种工程机制构建:非对称侧流/主流CUDA流水线,带有批处理文本编码器摊销(以及可选的静态提示缓存);一种编译友好的ControlNet-LLLite重构,将整个U-Net +适配器堆栈折叠成单个融合图;以及一个带有钩子子集的周期性条件刷新调度,用于摊销每帧条件成本。在单个消费级RTX 3090 Ti上,512x512分辨率下,管线在批大小B=8时维持27.4 fps,B=16时维持29.6 fps,端到端p50延迟分别约为0.5和1.0秒;相同操作点在RTX 4090上测得54.9 fps,在RTX 5090上测得74.1 fps。我们报告的是视频率流式吞吐量而非交互式低延迟,并将我们的数据与相同堆栈的StreamDiffusion重运行进行对比,作为系统上下文,而非基准优越性声明。对于训练的油画风格,发布的时序适配器在剪辑内噪声中泛化到19个未使用的DAVIS-2017序列和来自七个来源的15个非DAVIS剪辑;对未见风格族的提示级泛化有限,并单独报告。

英文摘要

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

2606.05979 2026-06-05 cs.RO cs.AI

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

世界-语言-动作模型:统一世界建模、语言推理与动作合成

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

发表机构 * SJTU(上海交通大学) SII(上海研究院) HUST(华中科技大学) SCUT(华南理工大学) ECUST(东华大学) SHU(上海大学) NJUPT(南京工业大学)

AI总结 提出世界-语言-动作(WLA)模型,通过自回归Transformer联合预测文本子任务、子目标图像和机器人动作,融合世界建模与语言推理能力,实现多任务和长时域任务的最优性能。

详情
Comments
19 pages, 10 figures
AI中文摘要

我们提出世界-语言-动作(WLA)模型作为一类新的具身基础模型。WLA以文本指令、图像和机器人状态为输入,联合预测文本子任务、子目标图像和机器人动作,结合了世界-动作模型(WAM)中从大量自我中心视频学习的世界建模接口,以及视觉-语言-动作(VLA)模型中解决复杂长时域任务的语言推理能力。WLA的核心是一个自回归(AR)Transformer主干,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义级别的文本意图和互补的细粒度物理动态。物理动态由基于专用世界专家的世界建模目标监督,并用于简化动作专家的状态-动作相关性表征。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用世界预测。世界预测也可被激活以实现测试时缩放,从而改进机器人控制。我们的WLA-0原型具有2B活跃参数,在NVIDIA RTX 5090上每次推理耗时40毫秒。在模拟和真实环境中的评估表明,WLA-0实现了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率为92.94%,在RMBench上成功率为56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

英文摘要

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

2606.05976 2026-06-05 cs.AI cs.CL

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

自我修正错觉:LLM 纠正他人但不纠正自己

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文通过保持错误声明字节一致仅改变角色标签,发现 LLM 无法自我修正并非能力缺陷,而是聊天模板角色标签的人为产物,并提出无需训练或模型修改的提示结构干预方法。

详情
AI中文摘要

近期研究表明,LLM 智能体难以纠正自身推理轨迹中的错误,但当相同声明出现在外部来源时,其修正率显著更高。我们探究这种不对称性反映的是能力缺陷还是角色标签的人为产物:智能体纠正错误声明的意愿是否因果地依赖于承载该声明的聊天模板角色,而非声明内容本身?我们的实验设置在所有条件下保持错误声明的字节完全一致(SHA-256 验证),仅改变其包装角色:智能体自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应或 \role{system <memory>} 块。在覆盖七个模型家族和三个领域的 13 个模型-领域单元(每个单元 n=30 对任务)中,将声明从 \role{<thought>} 重新标记为外部角色后,显式修正率提升了 23 到 93 个百分点,其中 13 个单元中有 10 个达到 p<0.001。进一步实验证实该效应是不对称的、机制上可分解的,并且跨领域稳健。自我修正失败并非认知缺陷,而是聊天模板的人为产物。我们利用这一人为产物设计了一种仅涉及提示结构、无需训练和模型修改的干预方法,其最强角色标签依赖于领域:在数学上 \role{<memory>} 占主导,而在逻辑推理上普通 \role{user} 消息占主导。

英文摘要

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

2606.05975 2026-06-05 cs.CV cs.RO

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

T-FunS3D:任务驱动的分层开放词汇3D功能分割

Jingkun Feng, Reza Sabzevari

发表机构 * P4MARS Lab at the Faculty of Aerospace Engineering, Delft University of Technology(代尔夫特理工大学航空航天工程学院P4MARS实验室)

AI总结 提出T-FunS3D方法,通过构建开放词汇场景图并利用视觉语言模型,实现任务驱动的分层3D功能分割,在保持性能的同时提升速度和降低内存消耗。

详情
AI中文摘要

开放词汇3D功能分割使机器人能够在3D场景中定位功能性物体组件。这是一项需要空间理解和任务解释的挑战性任务。当前的开放词汇3D分割方法主要关注物体级识别,而场景级部分分割方法试图详尽地分割整个场景,导致资源密集且耗时。在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。作为缓解这一问题的一步,我们引入了T-FunS3D,一种任务驱动的分层开放词汇3D功能分割方法,为机器人应用提供可操作的感知。我们的方法以室内场景的3D点云和带姿态的RGB-D图像作为输入。通过提取环境中的实例及其视觉嵌入,我们构建了一个开放词汇场景图。给定任务描述,T-FunS3D识别场景图中最相关的实例,并利用视觉语言模型定位其功能组件。在SceneFun3D数据集上的实验表明,T-FunS3D在开放词汇3D功能分割方面与最先进方法相当,同时实现了更快的运行时间和更少的内存使用。

英文摘要

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

2606.05972 2026-06-05 cs.LG

LLM Explainability with Counterfactual Chains and Causal Graphs

基于反事实链和因果图的LLM可解释性

Nirit Nussbaum-Hoffer, Nitay Calderon, Liat Ein-Dor, Roi Reichart

发表机构 * Faculty of Data and Decision Sciences, Technion I IBM Research(数据与决策科学学院,技术离子IBM研究所)

AI总结 提出一种四阶段方法,利用因果图建模LLM推理过程,通过MCMC启发的反事实增强发现类判别性概念并生成可解释图,用于疾病诊断、情感分析等任务。

详情
AI中文摘要

因果图为使机制透明提供了高级语言。近期工作使用大型语言模型(LLMs)恢复外部世界过程的因果图。相反,在本文中,我们使用因果图对LLM推理本身进行建模,为利益相关者提供模型如何感知和组织高层概念以产生预测的透明视图。我们提出了一种四阶段方法来构建此类图。给定目标LLM和一组文本示例,我们的方法发现类判别性、人类可解释的概念,并将每个输入映射到LLM感知的概念状态。然后,我们引入一种受MCMC启发的反事实增强过程,通过反事实链扩展稀疏的观测数据。这使得使用$σ$-CG进行稳定的因果发现成为可能,从而产生信息丰富且可解释的图。我们将我们的方法应用于三个LLM,涵盖疾病诊断、情感分析和LLM作为评判者的分类任务。我们评估了学习到的图的预测保真度和结构稳定性,以及受MCMC启发的增强的收敛性和下游效用。我们的结果表明,发现的因果图捕获了与LLM推理一致的有意义的依赖关系。总之,本文为LLM的概念级可解释性提供了基础。

英文摘要

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with $σ$-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.

2606.05960 2026-06-05 cs.RO

Towards a Data Flywheel for Embodied Intelligence in Logistics

面向物流具身智能的数据飞轮

Anlan Yu, Zaishu Chen, Zhiqing Hong, Daqing Zhang

发表机构 * Peking University(北京大学) JD Logistics(京东物流) HKUST (Guangzhou)(香港科技大学(广州))

AI总结 提出一种数据驱动的物流具身智能框架,通过构建数据飞轮将日常操作转化为可复用数据资产,利用世界模型生成长尾包裹操作的可靠监督,并整合多模态数据实现策略持续改进。

详情
AI中文摘要

具身智能正从实验室演示走向工业部署,物流行业是其中的关键应用场景。基于学习的策略为超越传统感知-规划-控制流程提供了有前景的路径,但其可扩展性取决于具身数据的收集、组织和复用方式。本研究通过构建物流数据飞轮,探索面向工业具身智能的数据中心框架。我们的框架将日常操作转化为可复用的数据资产,利用世界模型为长尾包裹操作生成可靠监督,并将部署反馈反馈到策略改进中。作为初步成果, extit{WM-DAgger}引入了一种基于世界模型的数据聚合框架,该框架合成了分布外恢复数据,用于鲁棒的模仿学习。在此成果基础上,正在进行的工作探索如何将大规模野外多模态数据(包括标注的人类演示、未标注的操作视频以及系统级机器人日志)对齐用于策略学习,并将其转化为持续系统改进的反馈。

英文摘要

Embodied intelligence is moving from laboratory demonstrations toward industrial deployment, with the logistics industry serving as a key application scenario. Learning-based policies offer a promising path beyond traditional perception-planning-control pipelines, but their scalability depends on how embodied data can be collected, organized, and reused. This research studies a data-centric framework for industrial embodied intelligence by constructing a logistics data flywheel. Our framework converts daily operations into reusable data assets, uses World Models to generate reliable supervision for long-tail parcel manipulation, and feeds deployment feedback back into policy improvement. As an initial result, \textit{WM-DAgger} introduces a World-Model-based data aggregation framework that synthesizes out-of-distribution recovery data for robust imitation learning. Building on this result, ongoing work explores how large-scale in-the-wild multimodal data, including labeled human demonstrations, unlabeled operational videos, and system-level robot logs, can be aligned for policy learning and transformed into feedback for continual system improvement.

2606.05958 2026-06-05 cs.LG

Steering Vectors are an Adversarial Attack Surface

Steering Vectors 是对抗攻击面

Abzal Aidakhmetov, Donato Crisostomi, Tommaso Mencattini, Adrian Robert Minut, Iacopo Masi, Emanuele Rodolà

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) EPFL(苏黎世联邦理工学院)

AI总结 本文揭示了一种隐蔽的数据投毒攻击,通过替换转向数据集中的4-6%令牌,使转向向量与反拒绝方向对齐,从而劫持目标模型,同时保留对良性提示的预期转向效果。

详情
AI中文摘要

激活转向已成为一种无需微调即可控制大型语言模型(LLM)行为的流行方法。由于该技术即插即用,用户共享数据集和预计算向量以转向模型激活。然而,我们展示了一种隐蔽的数据投毒攻击可以悄无声息地破坏这一流程。通过替换转向数据集中的4-6%令牌,攻击者可以使结果向量与反拒绝方向对齐。这劫持了目标模型,同时保留了对良性提示的预期转向效果。在此威胁模型下,恶意行为者可以分发一个看似安全的包,包含文本、向量和权重,以及一个终端用户可以验证的等价证书。我们在两个开放权重模型系列和八个模型-属性组合上测试了该攻击,观察到中毒向量的绝对攻击成功率(ASR)达到20-55%,比干净参考高出19%到51%。最后,我们发现一种拒绝方向正交化防御可以恢复约82%的ASR差距,而不损害良性行为。

英文摘要

Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\%$, $+19\%$ to $+51\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\approx}82\%$ of the ASR gap without harming benign behavior.

2606.05957 2026-06-05 cs.LG stat.ML

Dead Directions: Geometric Singular Learning

死方向:几何奇异学习

Tejas Pradeep Shirodkar

发表机构 * IIIT, Hyderabad(Hyderabad 二十一世纪信息技术研究所)

AI总结 本文通过引入“死方向”概念,桥接奇异学习理论与信息几何,提出在原始参数坐标下从Fisher曲率衰减率恢复KL阶数的方法,并扩展到深度网络,实现无需后验采样的Watanabe三元组(λ, m, ν)轨迹率读出。

详情
Comments
139 pages, 13 figures, 13 tables
AI中文摘要

奇异学习理论和信息几何研究了相同的参数空间,但使用了大体不同的词汇:前者在解决坐标中计算贝叶斯不变量,后者在非退化假设下使用原始坐标,而过参数化模型经常违反该假设。我们通过一个原始概念——死方向——将它们桥接起来:死方向是沿着Fisher度量退化的单位向量,等价于具有确定KL阶数的解析奇异集的切向量,KL阶数由KL散度消失的速度决定。两种解读命名同一向量;我们的核心操作表明,其KL阶数可作为方向Fisher曲率趋近奇异点的衰减率恢复,在原始参数坐标中无需Hironaka分解。光滑纤维上的选择规则将该速率转化为Watanabe的单方向对实对数规范阈值的贡献,我们将恢复扩展到多分量交叉、重数m、奇异波动ν(在一维方向中KL阶数通用)、先验RLCT偏移以及温度后验。然后我们将该速率提升到深度网络:多层K-FAC分解将每个Fisher块写为激活侧和梯度侧速率的乘积,两者之间存在对偶性,并在现代网络原语(残差流、层归一化、注意力)中实例化。商定理将该速率传递到在G不变度量下梯度流的规范商Θ/G;SGD符合条件,标准Adam不符合,我们构造了一个G等变Adam族预条件器(DDCAdam)使其符合。该桥接提供了对奇异几何的参数坐标处理、每个架构的闭式预测,以及从一个检查点的前向和后向传播中无需后验采样的Watanabe三元组(λ, m, ν)轨迹率读出。

英文摘要

Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $ν$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $Θ/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(λ, m, ν)$ from one checkpoint's forward and backward passes, without posterior sampling.

2606.05956 2026-06-05 cs.AI

Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

最长路径的双向搜索:前向-前向启发式的情况

Tzur Shubi, Ariel Felner, Solomon Eyal Shimony, Shahaf S. Shperberg

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出BiXDFBnB算法,将单前沿双向搜索框架适配到广义最长简单路径问题,利用前向-前向启发式减少节点扩展,并在某些情况下提升运行时间。

详情
AI中文摘要

双向启发式搜索可以潜在地减少适用于后向搜索的问题的搜索工作量。众所周知,前向-前向启发式可以减少节点扩展的数量,但其开销如此之高,以至于总体运行时间几乎总是增加。我们提出了BiXDFBnB,一种双向深度优先分支定界算法,它将单前沿双向搜索(SFBDS)框架——最初为最短路径(MIN)问题开发——适配到广义最长简单路径(GLSP)设置。由于SFBDS本质上在配对状态上操作,前向-前向(F2F)启发式评估自然出现,并避免了通常与双向前沿管理相关的开销。我们展示了这种适配可以成功应用于最大化(MAX)问题,同时有效处理重叠约束。BiXDFBnB应用于几种类型的最长路径问题:最长简单路径(LSP)、Snakes和Coil-in-the-Box(CIB)。经验评估表明,新算法经常减少节点扩展的数量,并且在某些情况下也改善了总体运行时间。

英文摘要

Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.

2606.05952 2026-06-05 cs.RO cs.AI

Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

通过对抗性合成场景学习机器人安全策略

Nikolai Dorofeev, Alexey Odinokov, Rostislav Yavorskiy

发表机构 * National Research Institute of Automation and Applied Mathematics(国家自动化与应用数学研究所)

AI总结 提出一个基于对抗性游戏的框架,通过红蓝两队对抗生成危险场景并迭代优化安全策略,以高效发现高风险边缘案例。

详情
AI中文摘要

在这项工作中,我们提出了一种基于代理的博弈框架,通过合成场景进行危险告知的机器人安全策略学习。我们将场景生成建模为两个代理之间的对抗游戏:红队通过构建危险情况探索潜在故障空间,蓝队则逐步完善安全策略以防止这些故障。这种迭代过程能够高效发现通过随机模拟或手动枚举难以捕获的高风险边缘案例。通过将经典风险建模与对抗性场景生成及现代学习范式相结合,这项工作为在复杂现实环境中运行的物理AI系统嵌入安全性提供了一条可扩展的路径。本文描述了正在进行的工作,贡献在于问题形式化和提出的解决方案架构。

英文摘要

In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.

2606.05950 2026-06-05 cs.AI

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

Edit-R2:面向多轮图像编辑的上下文感知强化学习

Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Kuaishou Technology(快手科技)

AI总结 提出Edit-R2框架,通过重构会话意图和联合优化推理与生成的强化学习,解决多轮图像编辑中的长上下文稀释和状态污染问题,并在MICE-Bench基准上取得领先性能。

详情
AI中文摘要

基于扩散模型和统一多模态基础模型的文本引导图像编辑已取得快速进展。然而,现有方法大多局限于单轮设置,忽略了更现实的多轮上下文编辑场景,即用户通过一系列指令逐步细化图像。在此设置中,模型必须遵循每条新指令,同时保留累积的会话级约束,面临两种耦合的失败模式:长上下文稀释(稀疏文本约束难以从不断增长的图像-文本交错历史中恢复)和状态污染(早期编辑错误降低后续生成质量)。我们提出Edit-R2,一种用于统一多模态模型的新型强化学习后训练框架。Edit-R2重构操作会话意图,在每次编辑轮次前将分散的历史约束有效整合为显式推理轨迹。它进一步通过统一目标实现推理和生成的多轮强化学习,该目标联合优化离散文本空间中的意图重构生成和连续潜在空间中的流匹配图像生成,同时轨迹过滤机制抑制损坏的轨迹以在状态污染下稳定训练。为支持系统评估,我们引入MICE-Bench,一个大规模多轮上下文编辑基准,包含针对累积会话约束的指令遵循(IF)、内容一致性(CC)和全局感知(GA)的自动指标。实验表明,Edit-R2显著改进了多轮上下文编辑,并在与强基线的比较中取得了有竞争力的性能。

英文摘要

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

2606.05946 2026-06-05 cs.LG

Short paper: Models in the dark -- Rectification and erasure under GDPR in ML supply chains

短论文:黑暗中的模型——机器学习供应链中GDPR下的更正与删除

Henrik Graßhoff, Malte Hansen, Meiko Jensen, Sara Ramezanian

发表机构 * Karlstad University(卡尔斯塔德大学)

AI总结 本文从跨学科视角调查机器学习供应链中实现GDPR更正权和删除权的挑战,提出“黑暗中的模型”概念,并分析其带来的紧迫问题。

详情
Comments
accepted for presentation at Annual Privacy Forum 2026
AI中文摘要

根据《通用数据保护条例》(GDPR)确立的更正权和删除权对于保护个人隐私至关重要。然而,它们在机器学习(ML)系统中的有效执行仍然具有挑战性。现有工作大多从法律或技术角度孤立地处理这些权利,而忽视了模型是在涉及开发、分发和部署等多个参与者的复杂供应链中产生的。本文对ML模型中实现更正权和删除权的挑战进行了全面调查。基于学术文献和数据保护机构的指导,我们发现许多GDPR要求在技术上尚无法在实践中满足。我们的发现进一步表明,ML供应链中出现的问题在研究中的关注不足。为了解决这一差距,我们引入了“黑暗中的模型”的概念——即在ML链下游创建的衍生模型,缺乏足够的透明度和可追溯性——并分析了这一现象带来的紧迫挑战。通过采用跨学科视角,这项工作有助于弥合法律要求与ML中数据主体权利技术实施之间的差距,最终支持可信人工智能的发展。

英文摘要

The rights to rectification and erasure, as established under the General Data Protection Regulation (GDPR), are central to protecting individuals' privacy. However, their effective enforcement in machine learning (ML) systems remains challenging. Existing work has largely addressed these rights from either a legal or a technical perspective in isolation and disregards the fact that models are produced in complex supply chains involving multiple actors across development, distribution, and deployment. This paper presents a holistic survey of challenges in implementing the rights to rectification and erasure in ML models. Drawing on academic literature and guidance from data protection authorities, we find that many GDPR requirements cannot yet be technically met in practice. Our findings further suggest that issues arising in ML supply chains are insufficiently addressed in research. To tackle this gap, we introduce the notion of models in the dark -- derived models created further downstream in an ML chain without sufficient transparency or traceability -- and analyse the urgent challenges posed by this phenomenon. By adopting an interdisciplinary perspective, this work contributes to bridging the gap between legal requirements and the technical implementation of data subject rights in ML, ultimately supporting the development of trustworthy artificial intelligence.

2606.05937 2026-06-05 cs.CL

Large Language Models are Perplexed by some Political Parties

大型语言模型对某些政党感到困惑

Paul Lerner, François Yvon

发表机构 * Sorbonne Université, CNRS, ISIR(索邦大学、国家科学研究中心、信息研究所)

AI总结 通过困惑度评估,发现大型语言模型对极右翼和民族主义政党文本的困惑度高于社会民主党,且该偏差源于预训练阶段,指令微调影响甚微。

详情
AI中文摘要

大型语言模型(LLMs)的使用日益广泛,包括在政治应用中,但其政治公平性研究甚少。我们使用困惑度进行评估,认为一个公平的模型应对所有政治群体赋予相同的概率。然而,我们在涵盖37种语言的10个LLMs和三个数据集中发现,LLMs对极右翼和民族主义政党的文本比对社会民主党的文本更困惑。我们发现这与先前关于翻译公平性的研究一致,以至于困惑度与下游翻译指标相关。我们的方法适用于基础LLMs及其指令微调版本,并且发现两者高度相关,表明LLMs的政治公平性源于其预训练,而指令微调几乎不影响它。

英文摘要

Large Language Models (LLMs) are increasingly used, including in political applications, but their political fairness has been little studied. We assess it using perplexity, posing that a fair model should give equal probability to all political groups. However, we find, across ten LLMs and three datasets covering 37 languages, that LLMs are more perplexed by the texts of far right and nationalist parties than of social-democratic parties. We find this to be consistent with previous work on translation fairness, to the point that perplexity correlates with downstream translation metrics. Our method is applicable to both base LLMs as well as their instruction-tuned counterpart, and we find that both are highly correlated, suggesting that the political fairness of LLMs stems from their pretraining, and is hardly affected by instruction-tuning.

2606.05936 2026-06-05 cs.CL

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

语言模型中的认知不公正:预训练过滤器和护栏的审计

Marco Antonio Stranisci, A Pranav, Rossana Damiano, Christian Hardmeier, Anne Lauscher

发表机构 * University of Turin(都灵大学) IT University of Copenhagen(哥本哈根技术大学) Trustworthy AI Lab(可信人工智能实验室)

AI总结 通过审计预训练过滤器和推理时护栏,发现它们对边缘群体(如跨性别者、女性和中美洲人)的提及存在过度标记,导致认知抹除,而人类标注者会保留大部分被标记内容。

详情
AI中文摘要

现代语言模型依赖预训练过滤器从训练语料中移除不良内容,以及推理时护栏抑制部署期间的不良输出。在本文中,我们研究了这些过滤和审核决策如何产生认知抹除形式,并揭示了自动化系统之间以及这些系统与人类判断之间的紧张关系。我们审计了四个预训练过滤器和三个推理时护栏,针对包含性别和地域来源提及的Common Crawl句子,以及一个手动标注的500句子子集。我们的分析表明,过滤和护栏决策与基于黑名单的词汇线索强相关,同时经常未能标记包含私人信息或明确仇恨言论的内容。与此同时,边缘群体,特别是跨性别者、女性和中美洲人,在各个系统中被显著过度标记。相比之下,人类标注者会保留88.5%的过滤器标记内容和91.3%的护栏标记内容,通常能识别出当前系统未能捕捉到的、由内容移除紧张关系产生的表征性伤害。综合来看,我们的研究结果记录了一种认知抹除形式,其中对边缘群体的提及在预训练前被不成比例地移除,并在推理时再次被抑制。

英文摘要

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.